Job guarantee in AEM as a Cloud Service

5 min readJul 1, 2021

Who would not like to have some form of assurance that their task will be completed when they are busy working on something they like? In today’s world, having a job guarantee is of paramount importance and I’m not talking about your occupation but background tasks running in Adobe Experience Manager (AEM) which forms the backbone of this industry leading content management solution.

Sling jobs are more important than ever before in AEM as a Cloud Service (AEMaaCS). AEMaaCS is a cloud native offering which uses Kubernetes clusters to run multiple instances of AEM application in multiple pods. The clusters can automatically scale horizontally to sustain the throughput of the application and hence the pods are added/removed in real time.

Check here for more on pod disruptions:

Disruptions

This guide is for application owners who want to build highly available applications, and thus need to understand what…

kubernetes.io

Hence it is of vital importance now to leverage Sling jobs to guarantee the processing of your task in a cluster aware environment.

This article is not about how to implement Sling jobs as there is lot of public documentation available but to decide on when to use Sling jobs in general and discover some pitfalls you may encounter while harnessing power of Sling jobs in AEMaaCS.

Lets talk about when you would like to leverage Sling jobs for running background tasks in AEMaaCS.

Task scheduling

Sling commons scheduler cannot be used anymore for scheduling as execution cannot be guaranteed due to nature of cloud platform as discussed above. While the sling commons scheduler will still run on AEMaaCS but expect Cloud Manager Quality Gate to flag your code to convert it to a Sling scheduled job. If scheduler execution is not business critical or it runs frequently where the scheduled task can make up for a missed execution then the Sling commons scheduler can be left unchanged, but for any new implementation it makes more sense to opt for Sling Scheduled jobs to perform more resilient task executions.

Cluster aware task processing

While AEM has clustering support built in for years now, still it was not prevalent to see author environment cluster topologies in even large enterprise level deployments, as performant author tier is achievable by optimal vertical scaling. Hence developers tend to avoid writing cluster aware code as that unnecessarily complicates the operations especially write operations. With AEMaaCS shipped with Mongo MK in cluster deployment it becomes ubiquitous to write cluster aware code for write operations. For any write jobs we might not want to write more than once to the same repository area hence it’s better to leverage OOTB available queue configurations to spawn cluster aware sling jobs. We’ll see later how to create cluster aware sling jobs when we’ll have a look at different queue configurations available for Sling jobs.

Long running background processes

As its mentioned in Development Guidelines here:

AEM as a Cloud Service Development Guidelines

Code running in AEM as a Cloud Service must be aware of the fact that it is always running in a cluster. This means…

experienceleague.adobe.com

It is advisable to avoid long running background tasks as pods can go up and down anytime. However, if you still need to perform some background tasks like bringing in data from third party services to AEM, sling jobs comes to the rescue as these can be made resumable since the job progress data is stored in the repository. Another important point to note is if any of these background tasks require to perform write operation on repository, then the task should run on author only and replicate the content if required on publish service as well. This is a mandatory requirement as publish services are now read only, hence any kind of data cannot be written by custom application code to the publish repository.

As a side note, any resource intensive job must save its progress in a set interval to sustain a restart due to pod being recycled by autoscaler. Otherwise, the newly spawned pod will start the job from its initial state.

More details on pod autoscaling:

Horizontal Pod Autoscaler

The Horizontal Pod Autoscaler automatically scales the number of Pods in a replication controller, deployment, replica…

kubernetes.io

Queue configuration options

By default, if you don’t add any queue configuration, your sling jobs will be added to the default main queue. This is not a problem unless you want to change the behavior of how your jobs should be processed.

We’ll go through not all but few important configurations that decide how your jobs will be processed in AEM cloud service environments

Queue type: If your sling jobs are write intensive and you are not sure if parallel jobs will write to same repository area then it makes sense to make the queue.type as ORDERED. There can be cases when the jobs are resource intensive, and queue is configured to run jobs in PARALLEL the autoscaler will spawn new pods to process the new jobs which might update same area in repository (let’s say some node under/var) and produce inconsistent state.
Maximum retries: It sometimes becomes a topic of discussion within the teams on what should the value for this field. By default, it is set to 10 but one should be careful in choosing default value as in case of ordered queue the job retries will prevent the queue from dispatching next job and thereby slowing down the progress. If you know at what point the job is retrying and the next retry will get stuck at same point, then there is no rationale to keep this value high.
Another important point and I cannot stress enough is that the job will only retry if it did not finish with succeeded or cancelled state. It will not retry in case of any exception. Hence it is very important to implement robust error handling and passing correct state in catch blocks.
Queue priority: Do not change queue priority unless you know what you are doing. AEM does a lot of work in background using Sling jobs and making a change might impact on behavior of processing of those jobs.
Prefer Creation Instance: This should be best left at discretion of job engine. The jobs can be offloaded to another node in the cluster and unless of very specific reason like processing some content that is only present at one node there is no need to enable it.
Maximum Parallel Jobs: Though this is not important for AEMaaCS as we do not have control over CPU configuration but for AMS/On-premise setup I am copying this verbatim from Adobe documentation as mentioned here:
Adobe recommends that you do not exceed 50% of the CPU cores. To adjust this value, go to the following: http://<host>:<port>/system/console/configMgr/org.apache.sling.event.jobs.QueueConfiguration. Set queue.maxparallel to a value that represents 50% of the CPU cores of the server that hosts your AEM instance. For example, for 8 CPU cores, set the value to 4.

Conclusion

There has been a paradigm shift with introduction of AEM as a Cloud Service. Being a cloud native solution it can seamlessly integrate within the Adobe Experience Cloud umbrella with other solution offerings as well open new avenues of asynchronous task processing with Adobe I/O events. Before that it is of paramount importance to understand how Sling jobs can be leveraged optimally to complete the ecosystem. Hopefully you will now have a better understanding of how this is possible and use it to your advantage.