Automating Content Migration Using Talend

Redevelopment of a website is often triggered because of three major factors:

  • The current website is built on the technology stack which is now obsolete.
  • Redesigning/Revamping the existing website, either to address the weaknesses in the current system or to add significant features.
  • Switching to a new technology platform, such as a new Content Management System (say AEM).

Often these factors are coupled together, selecting the new technology platform combined with redesigning the existing site. Identification of the new technology platform is a difficult task and subject to various factors like budget, feasibility, stability of the new technology stack, maintenance support, time to market etc. No matter what that choice is, more often than not gives birth to migration projects. An organization that has thousands of Pages, Articles, Assets etc would want to retain that data rather than creating everything from scratch. Migration has a very wide scope, but this blog post will talk about Content Migration.

Content Migration is a process of migrating the existing Digital Media of an organization to the new System. Being involved in various Content Migration Projects, I can say that this is not a simple process.

A change in technology platforms makes the migration challenging, as does a major restructure or redesign of the site.

Content Migration can be achieved by either of the following two ways or sometimes combined:

  1. Manual: Ctrl+C and Ctrl+V are the favorite keyboard shortcuts for every developer. The manual way is always the easiest yet the most painful one. If it is about a few pages, you might want to copy the content from the old site and paste into the new publishing tool. But, if the old system contains thousands of pages, would you want to follow that route? Maybe you can hire a team of content authors who’ll do the job for you. But a manual process is error-prone.
  2. Automated: Option of automating the entire process of migration is clearly an appealing one. Using some tool/methodology where you can define the rules for the migration process. This requires little or no manual effort. Talend Open Studio (ETL tool) is one such tool which can be used to automate the content migration process . You can refer Talend Open Studio Reference Guide for better understanding of the tool.

There are three basic requirements for migration:

  1. The input export of the existing content. It can be in any form e.g. Delimited Text file, XML file etc depending on the existing system.
  2. The output format i.e. What should be the end result of the migration process? Which data from the existing system should map to the new system (AEM in our case)? You should be clear with all the mapping and transformation rules specific to the new system. As we are dealing with migration to AEM, then we need to define the mappings between the existing content and AEM components. For instance, if the input extract received is an XML file then you would have to define the mappings among XML tags and the properties of an AEM component.
  3. Loading Mechanism which defines how the content gets loaded into the target System. This is a very important part as whole migration process will be designed based on the method of load. We’ve chosen the approach of creating a valid CQ Package which can be installed from CRX package manager. One of the major advantages of using this approach is that we can easily rollback and uninstall the package.

A basic migration job created using Talend looks like as follows:

main job.png

Each block in the above picture is a component, tRunJob in this case which calls another sub-job. The connectors between two such blocks define the transition i.e. how and when do we want the next block to be executed. In this case, these transitions are called as triggers.

This main job consists of four sub-jobs. Purpose of each sub-job is explained below:

  1. Pre-migration Cleanup: This job reads the input content (say XML) and breaks it into smaller manageable chunks (multiple XML files) which can be worked upon individually. The job can be modified to handle scenarios like Internal URL mapping, resolving the character encoding issues, define any tag mapping rules etc.
  2. Extraction & Transformation: This job reads the XMLs created in the previous step one by one, transforms it to AEM specific .content.xml schema and stores it under the required jcr_root hierarchy on the file system.
  3. Post Migration Cleanup:  This job is required if there are any post-migration cleanups that need to be done.
  4. Packaging: This is the final step of migration which creates the archive of the pages migrated in the above steps. Keep in mind that the package needs to be AEM compatible i.e. it should contain jcr_root & META_INF folder and associated metadata properties as per AEM packaging standard.

Content Migration is an important activity in redevelopment of a website and it needs proper planning. While you can automate the migration process but it will always require human eyes to approve the migrated content.

Hope this helps !! 🙂

Changing the Port of a Running AEM Instance

AEM derives the port number from the quickstart jar file. As the documentation says that by renaming the jar file, we can configure AEM to run on a different port.

But this requires an instance shutdown. What if we want to change the port number of a running AEM instance? What if it is an e-Commerce site and shutting down the server means the loss of customers which implies the loss of business? This blog post will talk about how to change the port of a running AEM instance.

Use case:

In a production AEM environment, we access the AEM instance through a web server, typically Apache Web Server. Dispatcher module that sits on Apache communicates to the AEM server. An end-user is oblivious to the fact that an AEM server even exists? But if there is a vulnerability in your application which could be exploited through the port it is running on, then a potential hacker can take advantage of it and cause serious security problems by reaching the server through that port.It is always recommended that we should change the default port. Though, it is hidden that which port the AEM server is running on but what if a hacker gets that information? A production system needs to change the port in that case. Changing the port in the usual way requires downtime. And Downtime can have serious impacts.

Resolution:

With AEM 6 onwards, we can change the port of a running AEM instance. Go to Felix Console and search for Apache Felix Jetty Based Http Service.

jetty-service

Change the default value of HTTP Port (highlighted above) with the new port number and hit Save. As soon as you save the configuration, the AEM will start running on the new port. To verify if the configuration works, reload the page. It will not open up. Now, open the same page with the new port specified in the above configuration. You will see that the AEM is running on the new port.

Hope it helps !! 🙂

Creating Custom Node Type in JCR

In this blogpost, I’ll talk about the various ways of creating the Custom Node Type and deploying it across multiple instances. We’ll be using AEM 5.6.1 as our CQ server.

A. Creating and Registering the Custom Nodetype 

There are broadly following three ways of creating custom node types.

  1. Using Node Type Administration console.
  2. Programmatically
  3. Using Package Manager

We’ll discuss them one by one :

  1. Using Node Type Administration Console 
  • Using CND files.

The Compact Namespace and Node Type Definition (CND) notation provides a compact standardized syntax for defining node types and making namespace declarations. The notation is intended both for documentation and for programmatically registering node types. Existing documentation can be followed for creating the CND file.

Go to Node Type Administration console, click on Import  Node Type, copy/paste the CND file in the textarea, keep “Automatically register nodetype” checkbox and “Automatically register defined namespaces” checked. Click on submit and your custom node type will be registered.

  • Without using CND files

Go to Node Type Administration console,click on Create Node Type and enter the details about Node Name , child Node defintions , property definitions , supertypes etc. Click on the [Register Node Type] link at the bottom of the page to register this newly created Node type. Check the nodetype in Node type Administration console.

2. Programmatically

We can register the nodetype programmatically as well.

  • Using CND file.

We can use JCR Commons CndImporter to register it. Following is the code snippet to regsiter it. Create a CND file say nodetypes.cnd having the definition of the new node type. Make this file as a part of the bundle.

  • Without using CND file.

We can use JCR API to create a new node type and register it. Following is the code snippet to register it.


session = slingRepository.loginAdministrative(null); 

NodeTypeManager manager = (NodeTypeManager)session.getWorkspace().
getNodeTypeManager();
NamespaceRegistry ns=session.getWorkspace().getNamespaceRegistry();
ns.registerNamespace("cp","https://codepearlz.wordpress.com/CustomNode");

// Create node type
NodeTypeTemplate nodeTypeTemplate = manager.createNodeTypeTemplate();
nodeTypeTemplate.setName("cp:testNodeType");
// Create a new property PropertyDefinitionTemplate
customProperty1 = manager.createPropertyDefinitionTemplate();
customProperty1.setName("cp:Name");
customProperty1.setRequiredType(PropertyType.STRING); PropertyDefinitionTemplate
customProperty2 = manager.createPropertyDefinitionTemplate();
customProperty2.setName("cp:City");
customProperty2.setRequiredType(PropertyType.STRING);
// Add property to node type
nodeTypeTemplate.getPropertyDefinitionTemplates().add(customProperty1);
nodeTypeTemplate.getPropertyDefinitionTemplates().add(customProperty2);
/* Register node type */
manager.registerNodeType(nodeTypeTemplate, true); session.save();

3. Using Package Manager

We can register node type via package manager as well . In Package Manager, upload a CQ package containing custom nodetypes.cnd  and install it. Check that the custom nodetypes are registered in Node Type Administration console.

Troubleshoot : 

    • After registering the nodetype, make sure it is visible in Node Type Administration console. If not registered, check the error.log for more insight.
    • CND file should be in proper format to avoid unwanted errors.
    • Java 7 introduced a stricter verification and changed the class format a bit — to contain a stack map, used to verify that code is correct. If you are using java 7, pass these parameter -XX:MaxPermSize=512m -Xmx1520m -XX:-UseSplitVerifier while starting the instance from command line. Refer this link for more details.

B. Deploying the Custom Nodetype across multiple instances

If we have enabled clustering, then our multiple author and publish instances will be running on separate machines. We would want the new node type to be visible in all the instances. It can be  done via two ways :

  1. If we are programmatically registering the new node type, then deploying the bundle will simply make it visible across all the instances.
  2. Whenever a new node type gets registered in repository, three files gets updated . custom_nodetypes.xml at <CQ author instance directory>/crx-quickstart/repository/repository/nodetypes  will contain the definition of new node type. ns_idx.properties and ns_reg.properties at <CQ author instance directory>/crx-quickstart/repository/repository/namespaces will have the details of the new namespaces added. Copy/Pasting these files to all the instances at the specified location will make it visible. Note that this will require an instance restart.

Hope it helps !! 🙂

Debugging in AEM

While working on one of a complex requirements in our project, we felt the need to continuously analyze the flow. Though, logs are of good help but we wanted to analyze the complete flow . In this scenario, debugging feature in IDE becomes very handy. Software Stack being used : 

  • CQ Server : AEM 5.6.1
  • IDE : IntelliJIdea 12.0

First We need to start our CQ instance in debug mode. We can do so by starting the AEM in debug mode by running the following command :

 java -jar cq5-author-4502.jar -fork -forkargs -- -Xdebug -Xrunjdwp:transport=dt_socket,address=59865,suspend=n,server=y -Xmx1520m -XX:MaxPermSize=512m -XX:-UseSplitVerifier 

We need to first open the socket from where all the JVM communication will happen. We need to specify the port number while starting the instance. Socket specifies the entry point for all the communication that happens in JVM. Every communication will happen via Socket. In above command, address=59865  is creating the socket for us. In IDE, we need to setup a remote connection for CQ and specify the same port no as mentioned while starting CQ. Follow the below steps to setup a remote connection in IntelliJIdea.

  1. Go to Run panel (top of the window) and select Edit configuration.
  2. Select Defaults and click on “+” to add new configuration. List of all the options will appear. Select “Remote”  from that list.
  3. Enter the details in the window as per your need. Specify the same port number which was used while starting the CQ instance in debug mode.
  4. Click on “OK” to save the configuration.

Below is the screenshot for the reference.

debug

Add breakpoints in the java files which you want to debug by double clicking on the line and start newly created configuration in Debug mode.

Troubleshooting :

1. While starting the AEM instance, make sure JVM has enough heap size for running CQ server, otherwise it will fork the JVM and parameters will not be passed to the forked jvm. Use -fork -forkargs – option to ensure that the command line parameters gets passed to the jvm.

2. If you are using java 7 , make sure to specify the -XX:-UseSplitVerifier parameter to avoid the unwanted strict verification errors while debugging the bundle.

Hope it helps !! 🙂