Enabling SSO with CQ5 – Part III

In previous part , we discussed about protecting CQ5 author instance when CQ5 acts as a service provider (SP). In this blogpost, we’ll cover how to protect any published resource/website. We’ll be using Shibboleth SP for the same.

Necessary Steps  : 

  1. Installing LDAP Server.
  2. Installing Shibboleth IdP.
  3. Installing Apache tomcat on Ubuntu.
  4. Configuring Shibboleth IdP.
  5. Installation of SP.
  6. Make Apache aware of shibboleth
  7. Configuration of Shibboleth SP and providing it to IdP.
  8. Protecting and accessing the CQ published resource.

Steps 1-4 have already been covered in Part  I. Subsequent steps are explained below :

  1. Installation of SP

Installing SP and it’s configuration was a bit difficult task for me as the existing documentation was not very clear. I would like to mention few important points and brief the installation steps below .

  • Installing Apache2

sudo apt-get install apache2
  • Installing Shibboleth SP

Easiest way to install SP is via command line.

sudo apt-get install libapache2-mod-shib2 shibboleth-sp2-schemas

When the installation is completed, various components of Shibboleth will be placed in appropriate directories based on the OS file system layout. You may check:

  • Shibboleth configuration files will be placed at /etc/shibboleth/ and the necessary Apache configuration  in /etc/httpd/conf.d/shib.conf
  • shibd will be installed to /usr/sbin and may be managed using /sbin/service and /sbin/chkconfig
  • An appropriate version of mod_shib and other pluggable modules will be installed to /usr/lib/shibboleth/
  • Logs will be located in /var/log/shibboleth/shibd.log

The installation directory structure may vary depending upon the OS version/type you are using. It might happen that some of the folders/files mentioned above might not be present in your file system. In my case, I was not able to see /etc/httpd/conf.d/shib.conf file . If same thing happens with you, then make all apache related configurations in apache2.conf file as per the apache installation directory structure in your system.

  1. Make Apache aware of Shibboleth

Tell apache where to find the mod_shib you just installed (assuming you are using Apache 2.2). Add the below line in apache2.conf file.

 LoadModule mod_shib /usr/lib/apache2/modules/mod_shib_22.so 

Note : (For CQ only ) If you have enabled dispatcher, make sure you specify the above line before dispatcher loading i.e. shibd should load before dispatcher gets load.

  1. Configuration of Shibboleth SP and providing it to IdP.

The way shibboleth works is by running a daemon called shibd at the same time apache runs, and then mod_shib.so knows how to talk to shibd.

    • Configure /etc/shibboleth/shibboleth2.xml

      This file tells mod_shib and shibd all about your setup. It is probably already full of data, but be careful – you need to configure it to know which IdP you’re connecting to. Take a backup of the original file before directly editing this one.

        • Make sure the ‘entityID’ points at your machine. Update the entityID in  <ApplicationDefaults> tag as follows . (You can provide your own ID as well)

          <ApplicationDefaults REMOTE_USER="eppn persistent-id targeted-id" entityID="http://<your domain>/shibboleth">

          I’ll be using http://my.domain.com/shibboleth  as  the Entity Id.

        • Configure the IdP you want to use. Provide the entity ID of the IdP configured in part I or you can simply copy the entity ID present in <SAML_IDP_HOME>/metadata/idp-metadata.xml file
      <SSO entityID="https://idp.intelligrape.com/idp/shibboleth">SAML2 SAML1</SSO>
        • Also, specify from where the IdP’s metadata will come from. At the IdP server , copy its metadata file idp-metadata.xml located at <SAML_IDP_HOME>/metadata directory and place it at your SP server at location /etc/shibboleth/ directory. Add the below tag in /etc/shibboleth/shibboleth2.xml if not already present.
       <MetadataProvider type="XML" file="idp-metadata.xml"/>
    • Configure SP’s Metadata file
      • If you read the instructions at the shibboleth site they seem to VERY STRONGLY IMPLY that you need to construct your own Metadata file for an SP. It is PARTLY true. To start off with, let http://my.domain.com/Shibboleth.sso/Metadata provide a Metadata file for you. shibd will create one on-the-fly and serve it to you automatically.

      • Copy and paste the contents in a file say sp-metadata.xml. Save this file to the location <SAML_IDP_HOME>/metadata. Once you create or acquire metadata for SP, you must supply it to the IdP. Similarly, the IdP MUST supply its metadata to the SP which was already done in the previous step.
      • At IdP server add the following to the relying-party.xml file (at <SAML_IDP_HOME>/conf/) just after the IdP’s own metadata is defined:

        <pre><metadata:MetadataProvider xsi:type="FilesystemMetadataProvider" xmlns="urn:mace:shibboleth:2.0:metadata"
        id="SPMETADATA" metadataFile="<SAML_IDP_HOME>/metadata/sp-metadata.xml"/></pre>
    • In relyingParty.xml, we need to specify the details of IdP’s metadata and any other service provider’s metadata file that relies on our IdP. IdP’s metadata file is already provided in Part I. While specifying the relying party and metadata of SP, value of Provider attribute in <rp:RelyingParty> tag should be the same as that of EntityId specified in above metadata file. Also, change encryptAssertions attribute to “never”.
<rp:RelyingParty id="my.domain.com" provider="http://my.domain.com/shibboleth" defaultSigningCredentialRef="IdPCredential"
 <rp:ProfileConfiguration xsi:type="saml:SAML2SSOProfile" inclu deAttributeStatement="true"
 assertionLifetime="PT5M" assertionProxyCount="0"
 signResponses="never" signAssertions="always"
 encryptAssertions="never" encryptNameIds="never"
 <rp:ProfileConfiguration xsi:type="saml:SAML2ArtifactResolutio nProfile" signResponses="never" signAssertions="always"
 encryptAssertions="never" encryptNameIds="never"/>
    • Modify the /etc/shibboleth/attribute-map.xml file to have list of all the attributes being released from IdP. There are already a lot of attributes mentioned in the file but are commented out. You can find out your own attribute and uncomment it or you can use the below.
<Attribute name="urn:mace:dir:attribute-def:uid" id="uid"/>

8. Protecting and accessing the CQ published resource.

  • In apache2.conf , enable the dispatcher if not already enabled. You can refer the existing documentation or this excellent blog can be followed.

  • We can use shibboleth to secure the published content. Only authenticated users can access a resource , otherwise the request should not be processed. Below is the block diagram for how the request is processed.

  •  In order to check that SP is working, protect a directory by acquiring a shibboleth session. Add the below in shib.conf (if present in your system) , otherwise you can specify the same in httpd.conf file. For ubuntu , you can specify it in the file where virtual hosts entries are present. In my case it was /etc/apache2/sites-available/default . It may vary depending upon the installation structure.

    <Location /path/to/secure/content>
    # this Location directive is what redirects apache over to the IdP.
    AuthType shibboleth
    ShibRequestSetting  requireSession 1
    require valid-user

    Note that the path mentioned in Location element is relative to the Directory root. A sample configuration can be found at the end of the document.

  • Try to  access the published page , you should be presented with a IDP login page. Pass the valid credentials to access the page.
  • Note that this configuration can be used to protect any cached website/directory. This implementation is not just limited to CQ.

    Troubleshooting :

    • Please make sure to restart tomcat to reflect the changes you do in IdP’s configuration files. For starting/shutting tomcat , go to <TOMCAT_HOME>/bin and run the startup.sh/shutdown.sh respectively.

    • Restart apache server after every change in the apache.conf files.

    • If you have enabled dispatcher module for CQ, make sure you enable it for CQ cached directory only to avoid unwanted errors.

    <Directory /path/to/CQ/cached/content>
    <IfModule disp_apache2.c>
    SetHandler dispatcher-handler
    Options FollowSymLinks
    AllowOverride None
    • When you access the published page, make sure not to specify the port no <4503>.

    • Sample Virtual Host Entry in apache configuration .

    <VirtualHost *:80>
    ServerAdmin webmaster@localhost
    ServerName publish.intelligrape.com
    DocumentRoot /var/cache/apache2/cq-cache
    <Directory />
    Allow from all
    Options FollowSymLinks MultiViews
    AllowOverride None
    <Location /content>
    AuthType shibboleth
    ShibRequireSession On
    require valid-user
    ErrorLog ${APACHE_LOG_DIR}/error.log
    LogLevel warn
    CustomLog ${APACHE_LOG_DIR}/access.log combined

    Reference Links





P.S. This post was originally published on my company’s website.

Hope it helps !! 🙂

Enabling SSO with CQ5 – Part II

In first part of our tri-part blog series ,  we discussed about the installation and configuration of Shibboleth IdP. We’ll be focusing on the following two use cases :

Use-Case I : Protecting CQ5 author instance when CQ5 acts as a service provider (SP).

Use-Case II: Protecting any published resource/website.
This blogpost will target to provide the solution for the first use case and provide a way to protect CQ5 author instance.
Basic Workflow: 
When user accesses a protected resource, the SP determines if the user has an active session. If there is no valid session, SP will prepare an authentication request and send that SAML authentication request to IdP. Shibboleth IdP will check for valid session on its end, if no session exists, login screen will be presented to the user to enter the login credentials. IdP will in turn request LDAP for user credentials, fetches the necessary information, generates a SAML response and send it to SP. User is now trying again to access the protected resource, but this time the user has a session and SP knows who they are. SP will service the user’s request and send back the requested data.
AEM provides support for the SAML 2.0 Authentication Request and can act as a SAML service provider.
Necessary Steps  : 
  1. Installing LDAP Server.
  2. Installing Shibboleth IdP.
  3. Installing Apache tomcat on Ubuntu.
  4. Configuring Shibboleth IdP.
  5. Creating the SP’s metadata file (AEM in this case )  and providing it to IdP.
  6. AEM configuration.
  7. Accessing AEM author instance.
Steps 1-4 have already been covered in Part  I. Subsequent steps are explained below :
  1. Creating the SP’s metadata file (AEM in this case ) and providing it to IdP.
  • As AEM is acting as an SP, it needs to provide it’s metadata file to IdP. Create and open a file say <SAML_IDP_HOME>metadata/adobecq.xml and paste the below :
<md:EntityDescriptor xmlns:md="urn:oasis:names:tc:SAML:2.0:metadata" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" entityID="https://sp.intelligrape.com">

<md:SPSSODescriptor protocolSupportEnumeration="urn:oasis:names:tc:SAML:2.0:protocol urn:oasis:names:tc:SAML:1.1:protocol">
<ds:KeyInfo xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Id="SPInfo">
//copy the public key from <SAML_IDP_HOME>/credentials/idp.crt and paste here

Location="http://idp.intelligrape.com/Shibboleth.sso/SLO/SOAP" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"/>
Location="http://idp.intelligrape.com/Shibboleth.sso/SLO/Redirect" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"/>
Location="http://idp.intelligrape.com/Shibboleth.sso/SLO/POST" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"/>
Location="http://idp.intelligrape.com/Shibboleth.sso/SLO/Artifact" xmlns="urn:oasis:names:tc:SAML:2.0:metadata"/>

<md:AssertionConsumerService Binding="urn:oasis:names:tc:SAML:2.0:bindings:HTTP-POST" Location="http://localhost:4502/saml_login" index="1"/>
  • In relyingParty.xml, we need to specify the details of IdP’s metadata and any other service provider’s metadata file that relies on our IdP. IdP’s metadata file is already provided in Part I. While specifying the relying party and metadata of AEM (SP), value of Provider attribute in <rp:RelyingParty> tag should be the same as that of EntityId specified in above metadata file. Also, change encryptAssertions attribute to “never”.

    <rp:RelyingParty id="sp.intelligrape.com" provider="https://sp.intelligrape.com" defaultSigningCredentialRef="IdPCredential"
    <rp:ProfileConfiguration xsi:type="saml:SAML2SSOProfile" includeAttributeStatement="true"
    assertionLifetime="PT5M" assertionProxyCount="0"
    signResponses="never" signAssertions="always"
    encryptAssertions="never" encryptNameIds="never"
    <rp:ProfileConfiguration xsi:type="saml:SAML2ArtifactResolutionProfile"
    signResponses="never" signAssertions="always"
    encryptAssertions="<strong>never</strong>" encryptNameIds="never"/>
    <rp:ProfileConfiguration xsi:type="saml:SAML2LogoutRequestProfile"
  • Provide the AEM’s metadata file created in above step to IdP:

<metadata:MetadataProvider id="CQMETADATA" xsi:type="metadata:FilesystemMetadataProvider"
maxRefreshDelay="P1D" />
  1.  AEM configuration.

For better understanding of SAML 2.0 Authentication Handler, please go through the official documentation . As mentioned in documentation, we need to provide the public and private keys to AEM. After creating a node named saml under /etc/key

  • Create a binary property idp_cert inside this node and upload the idp.crt file from <SAML_IDP_HOME>/credentials/idp.crt path.
  • Create a binary property private inside this node .Upload the below mentioned  newly converted key here.The private key must be in PKCS8 format.  To convert a PEM encoded private key to PKCS8 with openssl :
openssl pkcs8 -topk8 -inform PEM -outform DER -in idp.key  -nocrypt &amp;gt; pkcs8.key

where idp.key is the key that needs to be converted and pkcs8.key is the resulted key after conversion.

  • Update the ReferrerFilter,add the <IDP Hostname> in allow hosts property.

In Adobe Granite SAML 2.0 Authentication handler :

  • Set the IdP URL to the path where the SAML authentication request should be sent to or simply paste the below : https://idp.intelligrape.com:8443/idp/profile/SAML2/POST/SSO
  • Set the SP provider id to the ID specified in above metadata file (adobecq.xml) i.e. https://sp.intelligrape.com 

  • Uncheck the Use Encryption property in Adobe Granite SAML 2.0 Authentication Handler.

  • User Id attribute’s value should be the same as the ID of the <SAML Attribute > that contains the user id to authenticate the user.

  • For GroupMembership property , provide the id of the <SAML Attribute > here which will contain the list of groups to which the user will be added to after creation. You can add an attribute called “OU” in user profiles maintained in LDAP. Value of this attribute should be the valid group name present in CRX. Make sure you release the attribute and do the necessary configuration needed in <SAML_IDP_HOME>/conf/attribute-resolver.xml and <SAML_IDP_HOME>/conf/attribute-filter.xmlfiles. Refer Part I for more details. Please note that the value of the GroupMembership property should be the same as you provide the ID of the attribute “OU” in attribute-resolver.xml file.

  1. Accessing AEM author instance. 
  • Making a request to AEM at http://<host&gt;:<port>/ should redirect to IdP login page . Try to login with valid user profiles maintained in LDAP. You should be able to login if credentials are valid and all the above configuration is correct.

Troubleshooting Steps:

  1. After getting the IdP login screen, you log in using an LDAP user credential and are presented with a 404 error code and an error stack trace on welcome page.

This means that the imported user does not have appropriate READ permissions. You could give READ permission to this user, although the appropriate way would be to give READ permission to the group (say “iggroup”) to which imported LDAP users are assigned as members.It can be achieved by creating a group “iggroup” as member of the “contributor” group so that “iggroup” inherits the default permissions from the “contributor” group. In SAML authentication handler service , we specify the group membership attribute which will contain the list of groups the newly created user will be added to. To achieve this, add an attribute say ou in LDAP user profile , and specify the same attribute name in IdP attribute-filter.xml file. The id specified in attribute-filter.xml will be used as the value for groupMembership attribute in SAML handler.

  1. If IdP login page doesn’t appear , cross check the configuration again.

References :




P.S. This post was originally published on my company’s website.

Hope it helps !! 🙂

Enabling SSO with CQ5 – Part I

Single sign-on (SSO) is a mechanism where by a single action of user authentication and authorization can permit a user to access all computers and systems for which he has access permission, without the need to enter multiple passwords. While implementing SSO as part of one of our projects , we thought of protecting CQ5 author and publish instances. I’ll be covering it in tri-part blog series which will be composed of following three parts :

Part I : Laying the groundwork for installing, configuring Shibboleth (IdP)
Part II : Protecting CQ5 author instance when CQ5 acts as a service provider
Part III : Protecting any published resource/website

We’ll be using Shibboleth as a way to provide SSO. Shibboleth is an open-source project that provides Single Sign-On capabilities and allows sites to make informed authorization decisions for individual access of protected online resources in a privacy-preserving manner.

It’s an open source implementation for identity management, providing a web-based single sign-on mechanism across different organizational boundaries. It is a federated system, supporting secure access to resources across security domains. Information about a user is sent from a home identity provider (IdP) to a service provider (SP) which prepares the information for protection of sensitive content and use by applications. Please visit Shibboleth home page if you are not familiar with it.

In this post of this tri-blog series, I will cover installation and configuration of Shibboleth IdP.
Basic Concept:

When user accesses a protected resource, the SP determines if the user has an active session. If there is no valid session, SP will prepare an authentication request and send that SAML authentication request to IdP. Shibboleth IdP will check for valid session on its end, if no session exists, login screen will be presented to the user to enter the login credentials. IdP will in turn request LDAP for user credentials, fetches the necessary information, generates a SAML response and send it to SP. User is now trying again to access the protected resource, but this time the user has a session and SP knows who they are. SP will service the user’s request and send back the requested data

 Software stack being used:
OS : Ubuntu 12.10
CQ5 Server : AEM 5.6.1
LDAP Server : openLDAP (It can be any other LDAP server).
IDP (Identity Provider ): Shibbolet IDP 2.4.0 (Download the IDP from the given link).
App Server for IDP : Apache Tomcat 6 (Download the apache tomcat for linux).
Following Steps will be covered :
  1. Installing LDAP Server.
  2. Installing Shibboleth IDP.
  3. Installing Apache tomcat on Ubuntu.
  4. Configuring Shibboleth IDP.
  1. Installing LDAP Server.
For installing LDAP and configuring it for AEM (if needed), Installing and configuring LDAP for AEM can be followed.
  1. Installing Shibboleth IDP
  • After downloading the Shibboleth IDP from the link mentioned above , execute the install.sh file . Make sure you run it as a sudo user. You will be prompted to enter the IdP hostname. Provide the fully qualified name as per your requirement. Going forward, we’ll be using idp.intelligrape.com as IdP hostname in the forthcoming post.

  • Edit /etc/hosts file to have an entry like below:  idp.intelligrape.com idp 
 3. Installing Apache tomcat on Ubuntu.
  • Generate the keystore files for tomcat to enable SSL.
keytool -genkey -alias tomcat -keyalg RSA  -keystore \path\to\my\keystore 
  • Apply the SSL certificate generated in above step to TOMCAT_HOME/conf/server.xml. (TOMCAT_HOME refers to the installation directory of Tomcat)
<Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
maxThreads="150" scheme="https" secure="true"
keystoreFile="<your keystore file generated in above step>"
keystorePass="&amp;lt;your password here&amp;gt;"  clientAuth="false"
sslProtocol="TLS" >

Copy   idp.war from SAML_IDP_HOME/war/idp.war to TOMCAT_HOME/webapps . Alternatively, you can create and open TOMCAT_HOME/conf/Catalina/localhost/idp.xml file and paste the following content into it. This step will make sure that TOMCAT will always serve the latest war.

 <Context docBase="/opt/shibboleth-idp/war/idp.war" privileged="true"
antiResourceLocking="false" antiJARLocking="false"
unpackWAR="false" swallowOutput="true" /> 
  • Create the directory TOMCAT_HOME/endorsed and copy the .jar files included in the IdP source endorsed (SAML_IDP_HOME/lib/endorsed) directory into the newly created directory.
  • Quick Test accessing https://idp.intelligrape.com:8443/idp/profile/Status returns Ok.
  1. Configuring Shibboleth IDP
  • Modify SAML_IDP_HOME/metadata/idp-metadata.xml to make sure all the location attribute points to idp app on tomcat 8443. (SAML_IDP_HOME refers to the installation directory of Shibboleth IdP.)
  • Modify SAML_IDP_HOME/conf/attribute-resolver.xml to add definition of attribute and ldap connect string. Find out the tag and uncomment the attributes you want to release to SP. or Either copy the below . Source attribute Id should be same as present in LDAP user profiles.
<resolver:AttributeDefinition xsi:type="ad:Simple"
xmlns="urn:mace:shibboleth:2.0:resolver:ad" id="uid" sourceAttributeID="uid">
<resolver:Dependency ref="myLDAP" />
<resolver:AttributeEncoder xsi:type="enc:SAML1String"
name="urn:mace:dir:attribute-def:uid" />
<resolver:AttributeEncoder xsi:type="enc:SAML2String"
nameFormat="urn:oasis:names:tc:SAML:2.0:nameid-format:transient" name="uid"
friendlyName="uid" />

Find out the tag and uncomment it for LDAP setting. Customize the same as per your LDAP settings.

 <resolver:DataConnector id="myLDAP" xsi:type="dc:LDAPDirectory"
      ldapURL="<your LDAP hostname>"
      baseDN="<your base dn settings>"
      principal="<principal user to login to LDAP>"
      principalCredential="<Password for admin user>"
  • Modify SAML_IDP_HOME/conf/attribute-filter.xml for each attribute you are releasing to be used by SP. (Add it inside afp:AttributeFilterPolicy tag. )

<afp:AttributeRule attributeID="uid">
        <afp:PermitValueRule xsi:type="basic:ANY"/>
  • Modify SAML_IDP_HOME/conf/handler.xml to remove all the entries for authentication except “UsernamePassword” and “PreviousSession”. You must comment out the “RemoteUser” element or the authentication will be open for any user. Removal of “PreviousSession” will disable SSO support, that is it will require the user to authenticate on every request.

  • Modify SAML_IDP_HOME/conf/logging.xml for detail debug trace. Increase the log level to Debug.

  • Modify SAML_IDP_HOME/conf/login.config to provide the login credential for LDAP. Uncomment the code present in ShibUserPassAuth or simply copy the below. Please customize it according to your LDAP connection settings

 edu.vt.middleware.ldap.jaas.LdapLoginModule required
           ldapUrl="&amp;amp;lt;Ldap host setting like ldap://ldap-test.intelligrape.com:389 &amp;amp;gt;"
           baseDn="&amp;amp;lt;base dn settings like dc=intelligrape,dc=com&amp;amp;gt;"
           bindDn="&amp;amp;lt;Principal user dn settings like cn=admin,dc=intelligrape,dc=com &amp;amp;gt;"
           bindCredential="&amp;amp;lt;Principal user password&amp;amp;gt;"
           port="&amp;amp;lt;port where ldap is running&amp;amp;gt;"
           userField="&amp;amp;lt;ldap attribute which will be used as username e.g. uid"&amp;amp;gt;
  • In relyingParty.xml, we need to specify the details of IdP’s metadata and any other service provider’s metadata file that relies on our IdP. For now, we’ll provide details for IdP’s metadata only. Enter the following details into relyingParty.xml , if the same doesn’t exist already.
<metadata:MetadataProvider id="IdPMD"
maxRefreshDelay="P1D" />
  • For each rp:ProfileConfiguration tag, change encryptAssertionsattribute to “never”.

This should cover the basic configuration for Shibboleth Idp.

We’ll introduce SP and its configurations in forthcoming posts. Stay tuned for next parts..

P.S. This post was originally published on my company’s website

Hope it helps !! 🙂

Talend : Passing values from Parent to Child Job

In the previous post, I discussed automating the Content Migration using Talend. As the power of Talend lies with creating meaningful Jobs which solves a business problem. More often than not, one job has to talk to another job.

Use case:

While working on a migration project, I had a use case where I wanted to pass a value from a Parent Job to a Child Job.


globalMap is a useful way to pass values, but it can only be used to pass the values within the same Job. For Instance, If I have a Job A like below:


where tJava_1 is setting a key in globalMap,

globalMap.put("testVal","Test Value 1");

and tJava_2 is trying to retrieve its value.

System.out.println("testVal in the same job is "+(String)globalMap.get("testVal"));

Its value can be retrieved from the components defined in the same Job but if we try to access those values outside this Job, we’ll get the null value.


This problem can be solved by utilizing the power of “Context Variables”.Create a Job 2 like as shown below and define a context variable “testValue”

job 3

Create a Job B like as shown below:  (Job 2 is a tRunJob component which is calling the above Job)


Here tJava_1 is setting a key in globalMap,

globalMap.put("testVal","Test Value 1");

To access the above value in Job 2 , define this component’s properties as below. Note the “Context Param” definition, this is how we can set the value in the context variable and retrieve it in the child job.

final job

tJava_1 in Child Job can simply retrieve its value :

System.out.println("testVal in theanother job is "+context.testValue);

Hope it helps !! 🙂

Automating Content Migration Using Talend

Redevelopment of a website is often triggered because of three major factors:

  • The current website is built on the technology stack which is now obsolete.
  • Redesigning/Revamping the existing website, either to address the weaknesses in the current system or to add significant features.
  • Switching to a new technology platform, such as a new Content Management System (say AEM).

Often these factors are coupled together, selecting the new technology platform combined with redesigning the existing site. Identification of the new technology platform is a difficult task and subject to various factors like budget, feasibility, stability of the new technology stack, maintenance support, time to market etc. No matter what that choice is, more often than not gives birth to migration projects. An organization that has thousands of Pages, Articles, Assets etc would want to retain that data rather than creating everything from scratch. Migration has a very wide scope, but this blog post will talk about Content Migration.

Content Migration is a process of migrating the existing Digital Media of an organization to the new System. Being involved in various Content Migration Projects, I can say that this is not a simple process.

A change in technology platforms makes the migration challenging, as does a major restructure or redesign of the site.

Content Migration can be achieved by either of the following two ways or sometimes combined:

  1. Manual: Ctrl+C and Ctrl+V are the favorite keyboard shortcuts for every developer. The manual way is always the easiest yet the most painful one. If it is about a few pages, you might want to copy the content from the old site and paste into the new publishing tool. But, if the old system contains thousands of pages, would you want to follow that route? Maybe you can hire a team of content authors who’ll do the job for you. But a manual process is error-prone.
  2. Automated: Option of automating the entire process of migration is clearly an appealing one. Using some tool/methodology where you can define the rules for the migration process. This requires little or no manual effort. Talend Open Studio (ETL tool) is one such tool which can be used to automate the content migration process . You can refer Talend Open Studio Reference Guide for better understanding of the tool.

There are three basic requirements for migration:

  1. The input export of the existing content. It can be in any form e.g. Delimited Text file, XML file etc depending on the existing system.
  2. The output format i.e. What should be the end result of the migration process? Which data from the existing system should map to the new system (AEM in our case)? You should be clear with all the mapping and transformation rules specific to the new system. As we are dealing with migration to AEM, then we need to define the mappings between the existing content and AEM components. For instance, if the input extract received is an XML file then you would have to define the mappings among XML tags and the properties of an AEM component.
  3. Loading Mechanism which defines how the content gets loaded into the target System. This is a very important part as whole migration process will be designed based on the method of load. We’ve chosen the approach of creating a valid CQ Package which can be installed from CRX package manager. One of the major advantages of using this approach is that we can easily rollback and uninstall the package.

A basic migration job created using Talend looks like as follows:

main job.png

Each block in the above picture is a component, tRunJob in this case which calls another sub-job. The connectors between two such blocks define the transition i.e. how and when do we want the next block to be executed. In this case, these transitions are called as triggers.

This main job consists of four sub-jobs. Purpose of each sub-job is explained below:

  1. Pre-migration Cleanup: This job reads the input content (say XML) and breaks it into smaller manageable chunks (multiple XML files) which can be worked upon individually. The job can be modified to handle scenarios like Internal URL mapping, resolving the character encoding issues, define any tag mapping rules etc.
  2. Extraction & Transformation: This job reads the XMLs created in the previous step one by one, transforms it to AEM specific .content.xml schema and stores it under the required jcr_root hierarchy on the file system.
  3. Post Migration Cleanup:  This job is required if there are any post-migration cleanups that need to be done.
  4. Packaging: This is the final step of migration which creates the archive of the pages migrated in the above steps. Keep in mind that the package needs to be AEM compatible i.e. it should contain jcr_root & META_INF folder and associated metadata properties as per AEM packaging standard.

Content Migration is an important activity in redevelopment of a website and it needs proper planning. While you can automate the migration process but it will always require human eyes to approve the migrated content.

Hope this helps !! 🙂