Introduction
On a high level, most typical applications consist of workflows that process tasks. Within each workflow are steps performing actions on the tasks. In modern applications, these steps are typically implemented as container services or serverless microservices.
If we use an analogy to help understand this, then the application would be an organisation such as a fast-food restaurant, the workflows are teams, such as the team that takes an order, prepares it then brings it to the customer, and the steps are the individual team members performing their respective jobs.
This article will compare two approaches for managing the flow of tasks through a workflow: service orchestration vs service choreography.
Service Orchestration
Let’s start with service orchestration. In our analogy, this approach would be like having your traditional manager running the team. The manager controls each team member’s activity as they process the food orders. Team members can only respond to instructions from the manager, and each member’s output must return to the manager before the order proceeds to the next member.
Service orchestration has its benefits. There is oversight across all the steps in a workflow, and errors can immediately be flagged and acted upon. In step 6 of the example above, the manager accepts the order then passes it to the first step. The manager has a map of all the steps, knowing which step is next in the workflow and how the task will flow. The manager is also able to know when a step fails, allowing them to respond appropriately to that failure – such as attempting to retry a step or notify an employee to rectify the issue.
The challenge with this approach lies in scaling. As the application and its services grow, orchestration becomes increasingly inefficient. The orchestrator will become a bottleneck – like a single manager trying to micromanage 100 team members. If the orchestrator is unavailable due to bugs, overload or other issues, the services is unable to function independently, and the workflow will stop working.
Service Orchestration on AWS
A simplified service orchestration architecture could look like this on Amazon Web Services (AWS).
In the middle, we have AWS Step Functions, a fully managed serverless orchestration service on AWS. Step Functions is essentially a workflow manager responsible for managing the flow of tasks and executing the different steps in the workflow in the right order. In this example, the processing steps are implemented as Lambda microservices – also serverless, and a final fulfilment step is the SES service – which can send emails.
Step Functions has built-in capabilities for error handling. Suppose it does not receive a response from a step due to some issue. In that case, it can automatically retry the step or send a notification to the developers to manually resolve the matter.
Each step waits for its turn and will only start processing when a task is sent to it by the orchestrator. When a service receives a task, it will acknowledge receipt and begin processing the task.
When the step completes processing, it will send the result back to the orchestrator, which will review it and execute the next step in the workflow.
Service Choreography
A different approach that addresses the challenges of orchestration is service choreography. In this approach, the steps are independent, and they manage themselves. When tasks arrive in the workflow, their status is broadcast. The first step will recognise the broadcast of a new task and automatically pick it up instead of waiting for an orchestrator to tell them what to do. When the service completes the task, it will broadcast that fact, and the next step in the workflow will retrieve and process the task.
Individual steps are not aware of each other, nor is there any central orchestrator aware of all steps in the workflow. Instead, each step is simply aware of the broadcast it needs to look for and the broadcast to send when it has completed processing the task.
For example, using our analogy of a fast-food restaurant, a customer places an order (1). As we saw earlier, with service orchestration, each team member will wait for the orchestrator to tell them to process the order when it’s their turn. In service choreography, the arrival of a new order will be automatically picked up by the ‘ordering’ member, who will confirm and validate the order (2) then add it to the system (3) and broadcast it to the team (4). Next, the ‘cooking’ member will recognise the broadcast of a newly accepted order and pick it up for preparation (5). Once the order is prepared, the cooking member will ring the bell to broadcast that the order is ready (6). This time, the broadcast is recognised by the delivery member (7), who will bring the order to the customer (8).
A benefit of this approach is that waiting time is reduced as everyone has the authority to execute their job independently, removing the dependency and bottleneck of an orchestrator. Each step can be executed based on the broadcast changes in the environment, such as the arrival of a new task or the task being completed. The broadcasts do not perform any data validation, such as checking the accuracy of the prepared order; it only announces that the task status has changed.
Many modern applications will interact with application services and external services and eliminating the need for an orchestrator makes it easier to add, change or remove services without changing the existing code. Avoiding a central orchestrator also makes this approach more scalable and resilient as the bottleneck and single point of failure are removed.
Service Choreography on AWS
With service choreography, there are three main components.
- First, the producers create and send tasks – the customer in our analogy
- Then, the consumers receive and process these tasks – the steps with their microservices in our earlier example
- Lastly, a router broadcasts the tasks between the steps
There are various services in AWS that can take on the role of a router – depending on the use case. For example, Amazon Simple Notification Service (SNS) provides a simple mechanism to send messages from routers to consumers. Amazon EventBridge provides more sophisticated capabilities such as defining rules and sending messages to specific consumers based on pre-defined criteria.
A big difference between choreography routers and orchestration managers is that once a task is sent, the router does not wait for a reply from the consumer. In our analogy, once the food is cooked, the cooking member will simply ring the bell; they will not wait for anyone to acknowledge the bell before continuing with the following order.
Benefits and Challenges
Benefits
Improved Productivity
Choreography removes the need for services to wait for instructions before processing a task. In addition, services do not have to wait for one another to complete. In some cases, services can work in parallel. This can reduce the time needed to complete a task.
Highly Scalable and Available
Routers remove the single point of failure of an orchestrator. In AWS, we can use fully-managed service with multiple redundancies and automated scaling such as SQS, SNS or EventBridge. These services offload the maintenance burden and ensure the high availability and scalability of the router.
Highly Flexible
Service choreography is highly flexible as services can be easily added and removed by subscribing or unsubscribing them to the router. There is no need to manage a map of all steps in the workflow though we need to be careful not to create a gap in the workflow when we remove a step.
Loose Coupling
Service choreography decouples the tight integration between different services avoiding a broken step from losing the task and its data. The task will remain in the router until the step is fixed, and the workflow can continue without anything being lost.
Reduced Cost
Orchestrators tend to spend a lot of time waiting for responses, which results in being billed for idle time. This can be avoided with the event-driven nature of the choreography. To add, router services tend to have a lower cost per message than the per-millisecond billed compute services needed for an orchestrator. These differences can lead to significant cost savings.
Error Handling for Message Delivery
Routers such as Amazon SNS and Amazon EventBridge can attempt an automatic retry when tasks are unable to reach their destination. The tasks can be redirected to a Dead-Letter Queue in case of repeated failures. This service can store failed tasks and notify developers to remediate the issue. Once resolved, the task can be sent back to the workflow to continue processing.
Challenges
Lesser Observability
One challenge of service choreography is the reduced observability during troubleshooting. With an orchestrator, we can easily identify which step has failed and track down the root cause. However, with proper logging and monitoring strategy such as proactive logging, we can help minimise the impact of this challenge.
Re-training
Most developers are familiar with the use of APIs to query services and wait for a response. However, in service choreography, services use a publish/subscribe (pub/sub) model for interacting with one another, which may require a bit of a learning curve.
Incompatibility with Existing Projects
Typically, a re-design will be required for existing projects to take advantage of service choreography. The difference between monolithic architecture and choreography is quite significant, requiring a complete rebuild of the application using a process such as event storming. However, going from orchestration to choreography is less effort and can likely be done by changing the existing architecture rather than rebuilding it.
Error Handling for Failures
While routers can handle tasks not being delivered to a service, errors in the services often need to be handled by the services themselves. Using Figure 4 as an example, after Step 7, the delivery member must know what to do if an order is wrongly prepared. For such orders, it might send the task back to the ordering system instead of delivering it to the customer so that the cooking step can be tried again. In addition, repeated errors could be handled in the same way as failed message deliveries. The service can send them to a Dead-Letter Queue for analysis and follow-up, but this capability needs to be designed and implemented in the architecture.
Conclusion
While orchestration works, it does have its challenges. As the application scales, there is a significant risk of the orchestrator becoming a bottleneck, causing the application to perform slowly. Regardless of the size of the application, the orchestrator is a single point of failure that could render the application unusable if it breaks or is unavailable.
Service Choreography should be the preferred approach for service integrations when designing modern applications. It improves the user experience for your customers by minimising the waiting time required between steps and improving your application’s speed. It also provides your application with the flexibility to easily add new capabilities and the scalability to keep up with increasing demand.
Service Choreography comes with its own set of challenges that developers will need to mitigate by following best practices and designing error handling into the architecture.