Search post

Hitchhiker's Guide to AWS Step Functions

atlas-green-1507-unsplash (1)

Announced during re:Invent 2016, AWS Step Functions is the spiritual descendant of the not-so-simple-to-use Simple Workflow (SWF) service. It addresses many of its predecessor’s usability issues and made AWS Lambda the centerpiece. In this post, we will review all you need to know about Step Functions.

Getting Started

What is AWS Step Functions?

Step Functions is an orchestration service that allows you to model workflows as state machines. You design your state machine using a JSON-based specification language, and then you can start an execution of your state machine in three ways:

The Step Functions service manages the execution state, and either handles errors or performs retries as specified.

In most cases, each state in the state machine invokes a Lambda function. You can also incorporate branching logic, perform tasks in parallel, or even create SWF-style activities to integrate with external systems. Step Functions offers the ability to wait for an arbitrary amount of time between states, which is really difficult to do in an elegant, cost-efficient way with Lambda.

Step Functions also allows users to visualize the state machine at both design time and execution time. For example, a pipeline to ingest a large S3 file into DynamoDB might look something like this:

Step function state machine

Sadly, workflows can’t be visually designed. Instead, JSON is used for design, and the visualization tool is used to visually validate the design. 
Azure Logic Apps and IBM Node-Red both offer the ability to design workflows visually, and hopefully, Step Functions will follow suit in the near future.

When Should it Be Used?

Step Functions charges based on the number of state transitions. At $25 per million state transitions, plus the cost of the Lambda invocations, it is a comparatively expensive service. Considering that workflows can be implemented as code inside a single Lambda function, when should Step Functions be used?

I typically reserve Step Functions for three types of workflows:

1. Business Critical Workflows

Consumers are happy to pay a premium to ensure expensive purchases such as a car or a house against unforeseen failures. Engineers are happy to pay a premium for workflows that they really want to succeed. Good examples are payment and subscription flows—the things that actually earn money.

For these business-critical workflows, it makes sense to pay a little extra to have more flexibility around error handling and retries, in order to give the workflows the best chance to succeed.

2. Complex Workflows

For complex workflows that involve many different states and branching logic, the visual workflow is a powerful design and diagnostic tool.

For example, an application support team is able to look at the workflow diagram for a running or completed execution and understand what happened. The team can intuitively understand the state of the system and how it got there without knowing the ins and outs of its implementation.

This is possible because the important design decisions in the workflow have been lifted out of the code and made explicit in a visual format that anyone can follow.

Visual workflow

Equally, if the diagram is shown to a product person (or any other non-technical user), they would understand it without knowing how the underlying code works. This makes collaboration much easier, and you can quickly identify misunderstandings when everyone is on the same page.

3. Long Running Workflows

For workflows that cannot complete within the five-minute execution limit for Lambda, you should also consider using Step Functions.

The Lambda team discourages the use of recursive Lambda functions because it’s easy to get them wrong.

Instead, you should use an orchestration service like Step Functions. You can put explicit branching checks in place and enforce timeouts at the workflow level. This helps prevent accidental infinite recursions.

How Does it Work?

At the heart of a Step Functions state machine are the state definitions and how inputs are propagated from one state to the next.


A state is the way you tell the state machine to “do something.” Here are the seven types of states you can have:


Executes the Lambda function identified by the Resource field. The output from the Lambda function is then passed on to the next state as input.

"TaskState": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:1234556788:function:hello-world",
  "Next": "NextState",
  "TimeoutSeconds": 300

One caveat to remember here is that TimeoutSeconds defaults to 60 if not specified. This would fail the state with a States.Timeout error after 60s, even if the Lambda function is still running! Equally, if the function itself times out before the TimeoutSeconds value, then Step Functions is not able to distinguish the timeout error from other types of errors. It makes handling specific errors more difficult.

It’s a good practice to always match TimeoutSeconds with the timeout setting for the function.


Passes input to output without doing any work.


Causes the state machine to wait before transitioning to the next state.


Terminates the state machine successfully.


Terminates the state machine and marks it as a failure.


Adds branching logic to the state machine.


Performs tasks in parallel.

Input and Output

When an execution is started, it is presented with a JSON input. That input is bound to the symbol $ and passed on as the input to the first state in the state machine.

By default, the output of each state would be bound to $ and becomes the input of the next state. However, you can use the OutputPath field to bind the output from a state to a path on $ instead, preserving other fields on $.

For example, if the input to a Task state is the following:

  “x”: 42

If the output of the Task state is 84, then specify OutputPath as $.y as follows:

"DoubleInput": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:1234556788:function:double",
  "Next": "NextState",
  "OutputPath": “$.y”

The output of this Task state would become:

  “x”: 42,
  “y”: 84

Similarly, if you don’t want to present the entire JSON object $ as input to a Lambda function, then you can also use InputPath to select parts of $.

The following gif illustrates how $ changes as it passes through a number of states. InputPath and OutputPath are used to carefully select values from $ as input to the Task states, binding the outputs to new fields.

Input and output

Error Handling

You can specify how a state should be retried by adding a Retry field to its definition.

"TaskState": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:1234556788:function:hello-world",
  "Next": "NextState",
  "Retry": [
      "ErrorEquals": [ "ErrorA", "ErrorB" ],
      "IntervalSeconds": 1,
      "BackoffRate": 2.0,
      "MaxAttempts": 2
      "ErrorEquals": [ "ErrorC" ],
      "IntervalSeconds": 5

If this TaskState failed with ErrorA or ErrorB, the execution engine would retry the state with two more attempts. IntervalSeconds specifies the delay before the first retry attempt. Subsequent retries would multiply the delay by BackoffRate. For example, with an IntervalSeconds of 1s and BackoffRate of 2.0, the delays between retries would be 1s, 2s, 4s, 8s…

If they are not specified, then the following default values would be used:

  • IntervalSeconds: 1s
  • BackoffRate: 2.0
  • MaxAttempts: 3

After the max number of retry attempts have been exhausted, the execution would fail with the last error unless you add a Catch field.

"TaskState": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:1234556788:function:hello-world",
  "Next": "NextState",
  "Retry": [
      "ErrorEquals": [ "ErrorA", "ErrorB" ],
      "IntervalSeconds": 1,
      "BackoffRate": 2.0,
      "MaxAttempts": 2
      "ErrorEquals": [ "ErrorC" ],
      "IntervalSeconds": 5
  "Catch": [
      "ErrorEquals": [ "ErrorA", "ErrorB", "ErrorC" ],
      "Next": "RecoveryState"
      "ErrorEquals": [ "States.ALL" ],
      "Next": "TerminateMachine"

Like the Retry field, you can specify how the state machine should handle different types of errors and what states it should transition to next. For the LAST catcher in the Catch array, you can also use the special States.ALL error type as a catch-all.


Like other AWS services, Step Functions has a long list of limits. Here are a few important ones:

  • Maximum execution time: one year
  • Maximum execution history retention time: 90 days
  • When you start an execution, the execution name must be unique in your AWS account and region for 90 days.
  • There are regional limits on API calls to Step Functions such as ListExecutions and ListStateMachines. These limits are generally very low and refill slowly, so be mindful of them when your system needs to make regular API calls to Step Functions.

Aside from these service limits, the biggest limitation with Step Functions is the fact that you can’t spawn concurrent Lambda invocations dynamically. Imagine a state machine that reads a CSV file in S3, and then, for each row, spawns a Lambda function to perform some processing.

This is currently not possible with the Parallel state in Step Functions because the number of parallel tasks has to be specified ahead of time.

Challenges to Monitoring and Debugging

Every state machine exposes a number of metrics in CloudWatch, allowing you to monitor the execution time and success rate of its executions and create alarms against failures.

Step function metrics

In the Step Functions console, if you select one of your state machines, you can see the history of all recent executions and their statuses.

Step function executions

You can then drill into a particular execution to see what happened. The
Visual workflow pane shows the current progress (when the execution is still running) or outcome of the execution. You can click a step to see the input, output, and exception details for that step.

Step function execution details

Whilst there is a link to the CloudWatch Logs log group for the function, you still have to go and find the relevant log stream yourself.

The Execution event history pane displays a detailed history of all state transitions, including the timestamp (in UTC) and the relative elapsed time since the start of the execution. This is useful for identifying performance issues and slow running steps.

Execution event history

When you have steps that are executed multiple times in a single state machine execution, the
Visual workflow pane only shows you what happened the LAST time that step is executed. The event history, on the other hand, shows you each invocation.

You can expand the TaskStateEntered and TaskStateExited events to see the input and output of the state.

Entering a stateExiting a state


When a Lambda function error occurs, you can also expand the LambdaFunctionFailed event to see the error details.

Step function failure

Isolated Ecosystem

While Step Functions offers many tools to help you with monitoring and debugging, the problem is that they exist in an isolated ecosystem.

Modern applications are comprised of many independently deployable services, all working together to make things happen. My state machines are a part of that application. As an engineer, I need a unified tool for monitoring all of these different services. It’s not helpful for me to have to jump between different tools and AWS consoles to collect the information I need to understand the end-to-end flow of data.

When trying to understand and debug the end-to-end flow of data, you also need to know what happened OUTSIDE the state machine. How was the execution started? From where did the data originate?

It’s for these reasons that I really like what the Epsagon guys are building. One of the nice features of their tool is the ability to link the step function executions with its upstream functions. This enables you to see at a glance not only what happened inside the state machine execution, but also what happened before it.

Step function in Epsagon

Passing Correlation IDs through Step Functions Executions

On my own blog, I have previously written about how you can capture and forward correlation IDs through various Lambda event sources such as API Gateway, SNS, and Kinesis data streams. As you might have noticed from earlier screenshots, we can apply the same technique with Step Functions.

Input parameter passing

Using a Middy middleware like this one, we can capture correlation IDs in the invocation input and include them in our logs.

Tracing Step Functions


If you don’t want to build your own mechanism for flowing correlation IDs through the Lambda functions in the state machine, or not sure how to implement such a mechanism yourself, Epsagon can help you as it does this out of the box.

Another nice feature of Epsagon is that it shows you the logs for the relevant Lambda invocation. Which is a lot better than just taking you to the CloudWatch Logs log group!

Getting the Lambda CloudWatch

How much does AWS Lambda cost?
Best Practices for AWS Lambda Timeouts

About Author

Yan Cui
Yan Cui

I’m an experienced engineer who has worked with AWS for nearly 10 years. I have been an architect and lead developer with a variety of industries ranging from investment banks, e-commerce to mobile gaming. I have worked up and down the stack, from writing cloud-hosted functions with AWS Lambda all the way down to implementing custom reliable-UDP protocol for real-time multiplayer mobile games. In the last 2 years I have worked extensively with AWS Lambda in production, and I have been very active in sharing my experiences and the lessons I have learnt, some of my work has even made their way into the Well-Architected whitepaper published by AWS. I am polyglot in both spoken and programming languages, I am fluent in both English and Mandarin, and count C#, F#, Scala, Node.js and Erlang amongst programming languages that I have worked with professionally. Although I enjoy learning different programming languages and paradigms, I still hold F# as my undisputed favourite. I am a regular speaker at user groups and conferences internationally, and I am also the instructor for Production-Ready Serverless and one of the co-authors of F# Deep Dives. In my spare time I share my thoughts on topics such as AWS, serverless, functional programming and chaos engineering on this very blog.

Related Posts
The Importance and Impact of APIs in Serverless
Best Practices for AWS Lambda Timeouts
5 Ways to Gain Serverless Observability


Subscribe To Blog

Subscribe to Email Updates