What is AWS Step Functions?
Step Functions is an orchestration service that allows you to model workflows as state machines. You design your state machine using a JSON-based specification language, and then you can start an execution of your state machine in three ways:
The Step Functions service manages the execution state, and either handles errors or performs retries as specified.
In most cases, each state in the state machine invokes a Lambda function. You can also incorporate branching logic, perform tasks in parallel, or even create SWF-style activities to integrate with external systems. Step Functions offers the ability to wait for an arbitrary amount of time between states, which is really difficult to do in an elegant, cost-efficient way with Lambda.
Sadly, workflows can’t be visually designed. Instead, JSON is used for design, and the visualization tool is used to visually validate the design. Azure Logic Apps and IBM Node-Red both offer the ability to design workflows visually, and hopefully, Step Functions will follow suit in the near future.
When Should it Be Used?
Step Functions charges based on the number of state transitions. At $25 per million state transitions, plus the cost of the Lambda invocations, it is a comparatively expensive service. Considering that workflows can be implemented as code inside a single Lambda function, when should Step Functions be used?
I typically reserve Step Functions for three types of workflows:
1. Business Critical Workflows
Consumers are happy to pay a premium to ensure expensive purchases such as a car or a house against unforeseen failures. Engineers are happy to pay a premium for workflows that they really want to succeed. Good examples are payment and subscription flows—the things that actually earn money.
For these business-critical workflows, it makes sense to pay a little extra to have more flexibility around error handling and retries, in order to give the workflows the best chance to succeed.
2. Complex Workflows
For complex workflows that involve many different states and branching logic, the visual workflow is a powerful design and diagnostic tool.
For example, an application support team is able to look at the workflow diagram for a running or completed execution and understand what happened. The team can intuitively understand the state of the system and how it got there without knowing the ins and outs of its implementation.
This is possible because the important design decisions in the workflow have been lifted out of the code and made explicit in a visual format that anyone can follow.
Equally, if the diagram is shown to a product person (or any other non-technical user), they would understand it without knowing how the underlying code works. This makes collaboration much easier, and you can quickly identify misunderstandings when everyone is on the same page.
3. Long Running Workflows
For workflows that cannot complete within the five-minute execution limit for Lambda, you should also consider using Step Functions.
The Lambda team discourages the use of recursive Lambda functions because it’s easy to get them wrong.
Instead, you should use an orchestration service like Step Functions. You can put explicit branching checks in place and enforce timeouts at the workflow level. This helps prevent accidental infinite recursions.
How Does it Work?
At the heart of a Step
A state is
Executes the Lambda function identified by the Resource field. The output from the Lambda function is then passed on to the next state as input.
One caveat to remember here is that TimeoutSeconds defaults to 60 if not specified. This would fail the state with a States
It’s a good practice to always match TimeoutSeconds with the timeout setting for the function.
Passes input to output without doing any work.
Causes the state machine to wait before transitioning to the next state.
Terminates the state machine successfully.
Terminates the state machine and marks it as a failure.
Adds branching logic to the state machine.
Performs tasks in parallel.
Input and Output
When an execution is started, it is presented with a JSON input. That input is bound to the symbol $ and passed on as the input to the first state in the state machine.
By default, the output of each state would be bound to $ and becomes the input of the next state. However, you can use the OutputPath field to bind the output from a state to a path on $ instead, preserving other fields on $.
For example, if the input to a Task state is the following:
If the output of the Task state is 84, then specify OutputPath as $.y as follows:
The output of this Task state would become:
Similarly, if you don’t want to present the entire JSON object $ as input to a Lambda function, then you can also use InputPath to select parts of $.
The following gif illustrates how $ changes as it passes through a number of states. InputPath and OutputPath are used to carefully select values from $ as input to the Task states, binding the outputs to new fields.
You can specify how a state should be
"ErrorEquals": [ "ErrorA", "ErrorB" ],
"ErrorEquals": [ "ErrorC" ],
If this TaskState failed with ErrorA or
If they are not specified, then the following default values would be used:
- IntervalSeconds: 1s
- BackoffRate: 2.0
- MaxAttempts: 3
After the max number of retry attempts have been exhausted, the execution would fail with the last error unless you add a Catch field.
"ErrorEquals": [ "ErrorA", "ErrorB" ],
"ErrorEquals": [ "ErrorC" ],
"ErrorEquals": [ "ErrorA", "ErrorB", "ErrorC" ],
"ErrorEquals": [ "States.ALL" ],
Like the Retry field, you can specify how the state machine should handle different types of errors and what states it should transition to next. For the LAST catcher in the Catch array, you can also use the special States.ALL error type as a catch-all.
Like other AWS services, Step Functions has a long list of limits. Here are a few important ones:
- Maximum execution time: one year
- Maximum execution history retention time: 90 days
- When you start an execution, the execution name must be unique in your AWS account and region for 90 days.
- There are regional limits on API calls to Step Functions such as ListExecutions and ListStateMachines. These limits are generally very low and refill slowly, so be mindful of them when your system needs to make regular API calls to Step Functions.
Aside from these service limits, the biggest limitation with Step Functions is the fact that you can’t spawn concurrent Lambda invocations dynamically. Imagine a state machine that reads a CSV file in S3, and then, for each row, spawns a Lambda function to perform some processing.
This is currently not possible with the Parallel state in Step Functions because the number of parallel tasks has to be specified ahead of time.
Challenges to Monitoring and Debugging
Every state machine exposes a number of metrics in CloudWatch, allowing you to monitor the execution time and success rate of its executions and create alarms against failures.
In the Step Functions console, if you select one of your state machines, you can see the history of all recent executions and their statuses.
You can then drill into a particular execution to see what happened. TheVisual workflow pane shows the current progress (when the execution is still running) or outcome of the execution. You can click a step to see the input, output, and exception details for that step.
Whilst there is a link to the CloudWatch Logs log group for the function, you still have to go and find the relevant log stream yourself.
The Execution event history pane displays a detailed history of all state transitions, including the timestamp (in UTC) and the relative elapsed time since the start of the execution. This is useful for identifying performance issues and slow running steps.
When you have steps that are executed multiple times in a single state machine execution, theVisual workflow pane only shows you what happened the LAST time that step is executed. The event history, on the other hand, shows you each invocation.
You can expand the TaskStateEntered and TaskStateExited events to see the input and output of the state.
When a Lambda function error occurs, you can also expand the LambdaFunctionFailed event to see the error details.
While Step Functions offers many tools to help you with monitoring and debugging, the problem is that they exist in an isolated ecosystem.
Modern applications are comprised of many independently deployable services, all working together to make things happen. My state machines are a part of that application. As an engineer, I need a unified tool for monitoring all of these different services. It’s not helpful for me to have to jump between different tools and AWS consoles to collect the information I need to understand the end-to-end flow of data.
When trying to understand and debug the end-to-end flow of data, you also need to know what happened OUTSIDE the state machine. How was the execution started? From where did the data originate?
It’s for these reasons that I really like what the Epsagon guys are building. One of the nice features of their tool is the ability to link the step function executions with its upstream functions. This enables you to see at a glance not only what happened inside the state machine execution, but also what happened before it.
Passing Correlation IDs through Step Functions Executions
On my own blog, I have previously written about how you can capture and forward correlation IDs through various Lambda event sources such as API Gateway, SNS, and Kinesis data streams. As you might have noticed from earlier screenshots, we can apply the same technique with Step Functions.
If you don’t want to build your own mechanism for flowing correlation IDs through the Lambda functions in the state machine, or not sure how to implement such a mechanism yourself, Epsagon can help you as it does this out of the box.
Another nice feature of Epsagon is that it shows you the logs for the relevant Lambda invocation. Which is a lot better than just taking you to the CloudWatch Logs log group!