Play
Search post

How to Handle AWS Lambda Errors Like a Pro

1
In this post you will understand:
  1. How AWS Lambda errors and retries work, and what's the idea behind it.
  2. What consequences it has on your code.
  3. How to build your system using AWS Step Functions to control error handling, together with a useful resource for doing that.

Anyone familiar with Serverless knows that it does not only mean executing your monolithic code on Lambda functions- it's a different architecture of your whole system. In this architecture, the system is composed of distributed nodes activated by asynchronous events. Each node must be designed as an independent component which has its API (a 'black box'), even when not exposing it to the outside world. So how can we know how to define these nodes accurately? It turns out that it has a lot to do with correct error handling and correct dealing with AWS retry behavior.

Lambda Retry Behavior

Lambda functions can fail in three cases:
  1. An unhandled exception is raised — whether an invalid input received, an external API failed, or just a programming bug occurred.
  2. Timeout — Lambda running longer than the configured timeout duration is violently closed with a ‘Task timed out after … seconds’ message. The default value is 6 seconds, and maximal value is 5 minutes.
  3. Out of memory — In this case, the lambda usually terminates with ‘Process exited before completing request’ and ‘Memory Size’ is equal to ‘Max Memory Used’.

Out of memory error
Out of memory error
 
When that happens (and be sure that it will), your Lambda will probably be retried according to the following behavior:
  1. Synchronous events — in event sources such as API Gateway or synchronous invocation using the SDK, the invoking application is responsible for making retries according to the response it gets from the Lambda. This is the least interesting case because it’s kind of like the regular monolithic error handling.
  2. Asynchronous events — for most event sources, the lambda is invoked asynchronously, meaning that there isn’t any application to respond to the failure, and therefore the AWS framework take care of that by itself. What it does is to trigger the lambda again with the same event, mostly twice in the following ~3 minutes (though in rare cases it may take up to six hours, and a different number of retries may occur). If all retries have failed, it’s often necessary that this event will be recorded and not just thrown away. Therefore, the important DLQ feature enables to configure a Dead Letter Queue over Amazon SQS that receives such events.
  3. Stream-based events — Current events of this type are only Amazon Kinesis Data Streams and DynamoDB streams. Failing Lambda functions are triggered again and again until the data expires or processed successfully. Unlike asynchronous events, this event source is blocked until that point.

In this post, I’ll refer mostly to the most common and problematic case of asynchronous events, though some of the given advice is relevant to the other cases as well. For a detailed explanation of retry behavior, check AWS docs.
 

Retry Behavior Consequences

Keep calm and try again!
Since each Lambda might be executed several times with same input, while the 'caller' actually didn't mean the operation to be executed several times, the Lambda must be what's called idempotent - meaning that no additional effect takes place when it's run more than once with the same input. 

This term is not related to Serverless functions only: a classic example is a network API in which the answer to some request did not arrive, and therefore the same request is made again. In Serverless architecture a similar case may happen when, for example, a Lambda is timed out before receiving such response. Even if that is highly unexpected, in some cases an incorrect retry handling may cause severe problems as DB structure violation. 
"Idempotence is the property of certain operations in mathematics and computer science that they can be applied multiple times without changing the result beyond the initial application" (Wikipedia).
But wait - what if the same operation has to be executed twice when it's not a retry? For example, let's say that the Lambda receives as input a user operation log, and is responsible for recording it on a database. In that case, it is needed to differentiate between a retry case and when the Lambda is just triggered with same input because the user did the same operation again. 

A good solution for that is to refer the Lambda's request ID as if it were part of the input itself because only when the Lambda is retried the same ID is given. To extract it, use context.awsRequestId in Node.js (or the corresponding field in other languages). This method is actually the general approach to detect retry executions. 

Using the request ID for being genuinely idempotent is not always convenient. In the previous example, this ID should have been saved in the DB as well, so following invocations could find whether to add a new record. Another solution may be to use some in-memory data store (as Redis), but again, it adds a quite significant overhead.

Step Functions for the Rescue

It turns out that AWS Step Functions is a beneficial feature when building a Serverless application that deals with errors and retries properly - even a crucial one. 

Let's say that in response to an event, the application has to perform several operations. If all of them are combined to the same Lambda, the code usually has to check for each operation whether it has to be redone so that the whole Lambda remains idempotent, and that could be a real pain. It's important to understand the difference here from monolithic applications, in which the application itself could be responsible for making retries since it can wait between them - and that's not possible in Serverless. 

On the other hand, with Step Functions, we can run each operation on a different Lambda, and define the transitions between them as suitable for the specific case. Moreover, we can control the retries behavior (retries number and delay duration) to make it most suitable too, and even disable it when it's the right thing to do. From my experience, creating a step machine even for a single Lambda is the easiest workaround to disable unwanted retries behavior. 

Step by step
 
If you are already familiar with Step Functions, you may know that unfortunately, their currently available triggers are only API Gateway and manual execution using the SDK. Because of that, we have created a template for a Pythonic lambda that can be used as a glue code to execute a state machine asynchronously as a response to any event, which in can be found in this Gist. 

A complete ready-to-use template is available on a public repository. 

To deploy this Lambda you should use the Serverless framework, with the awesome serverless-resources-env plugin in order to pass the state machine ARN easily. Make sure also to use serverless-step-functions and serverless-pseudo-parameters to define the state machine easily as in the following example. 

We artificially made a state machine being triggered by an SNS event, which is accessible by the initial step Lambda as input. Because we named the state machine execution as the invoker Lambda request ID - everything becomes idempotent. If the invoker Lambda is retried, AWS gives it the same request ID, and afterward, AWS also won't execute the state machine again since it's named the same. Theoretically speaking, the execution name of the state machine is also a part of its input. This solution is useful in many situations, but keep in mind that it also adds some complexity overhead that affects debugging and overall observability of the system. 

Using Step Functions with SNS and Lambda
Using Step Functions with SNS and Lambda
 
It's important to understand the error handling mechanism of Step Functions, which is different than the Lambda's one. For every Task state, a timeout duration could be set, so that if the Task is not finished in time a States.Timeout error is generated. This timeout is basically unlimited, but for the typical case of a Task executing a Lambda, the Lambda's actual timeout duration is determined only by its own configured value (so it cannot get longer by this method =/ ). Therefore, make sure to configure the Task timeout to be equal to the Lambda's timeout. The retries behavior of a Task is by default disabled and could be specifically configured (other than for Lambda).

Conclusions

Error handling in AWS Serverless architecture is quite confusing, and understanding how it affects your system is not always easy. I think that it would have been better if the retries behavior of AWS Lambda could have been manageable (as for Step Functions) and that a retry counter field in the context parameter is obviously a missing feature. 

Nevertheless, I believe that the proposed architecture with Step Functions is useful for many cases, and besides helping to handle errors and retries correctly, it encourages elements separation which is a good practice in the Serverless world. 

Looking forward to hear your thoughts, questions and feedback on Medium.
Lambda Internals  —  Part 2: Going Deeper
How to Set Up AWS Lambda With SQS - Everything You Should Know!

About Author

Ron Yishai
Ron Yishai

I am a software engineer with an academic background in Mathematics, experienced in cyber-security, reverse engineering, machine learning - and recently also a Serverless enthusiast. During the last year, I am working as a senior developer on Epsagon - a startup focused on bringing observability to serverless cloud applications, using distributed tracing and AI technologies, tackling the unique challenges in such environments.

Related Posts
How to Package External Code in AWS Lambda Using Serverless Framework
Stackery and Epsagon to Increase Serverless Observability
More than Functions - Serverless Observability Webinar

Comment

Subscribe To Blog

Subscribe to Email Updates