Distributed Circuit Breakers in Event-Driven Architectures on AWS
Intro
I was working on a presentation about handling event failures in Event-Driven Architectures. At some point, I got into describing why a circuit breaker is needed. I realized that I used custom implementations based on Elasticache in my projects. When I started thinking about “lighter” ways of setting them, I realized there is no “off-the-shelve” mechanism for setting up circuit breakers in Serverless architectures. I did some research, and here are my thoughts on the topic.
Why is Serverless so special?
The Circuit Breaker is stateful. We have to check its state before invocation and has to track all requests. Why? Because we are not opening circuit breakers based on a single invocation but on the RATE of failures.
In some implementations also based on the rate of slow requests duration.
We must track the state in a single place instead of a distributed state among all lambda instances. Let’s get back to the beginning.
Why do we need a Circuit Breaker?
Scenario
Imagine event processing, where you take tasks from the queue, load data from a third party, and then do “something” with the data. Let’s say we are going to persist them on a DynamoDB table.
Everything fails
One day, our 3rd party may have one of those bad days and start to fail. At that point, it will be perfect, and everyone would appreciate it if we were respectful about using our 3rd party system; since we have noticed that it is under a lot of stress, we could stop sending them more and more requests. We could save money, resources, and the nerves of some members of the SRE team.
What can we do then? Backoff idea
The first idea that comes to mind is to apply the back-off strategy to our consumers. We could change the visibility timeout of received and failed messages. There are a few key things to remember here:
- setting the maximum delay time value,
- and applying jitter.
Assuming that we have received a message like the one below, we could do it using one simple call to ChangeMessageVisibility
API.
We have to extract ApproximateReceiveCount
which contains information about the amount of processing attempts of the message and calculates backoff with jitter based on it — e.g., like in the snippet below.
Next, we need to use the AWS SQS SDK to change the visibility timeout using the receiptHandle
of each record, as shown in the snippet below.
Remember to add IAM policy to your AWS Lambda function — e.g. thgrough SAM managed policy: SQSPollerPolicy
Results
Backoff will retry messages at extended intervals. Thanks to this, we are not always processing the same messages in the loop, but we can give our third-party system some breathing space in the hope that it will get better in the meantime. Also, we can save our dead letter queues in case of limited maximum retries.
We may be proud of ourselves and pat ourselves on the back, but we have changed… not much. In the case of severe outages, AWS will limit our SQS consumers invocations, but in the case of high throughput systems, we are still in the same situation from a 3rd party system perspective.
What's even worse is that we're spending money for all these lambda invocations, and if our third party is down, the calls will be as slow as our timeout, which could cost us a lot. How can we prevent this from happening?
Circuit Breaker comes into play
Circuit breakers work like circuit breakers in an electrical circuit. When they detect failure, they trip open, and no more requests are sent. We want to close the circuit when the third-party system is back online.
That reconnaissance is done in a half-open state when we are sending “scout” requests. Those requests are some small groups that check whether the system behind the circuit breaker is still down or not. In case the system is down, the circuit remains open; if it works again, we close the circuit and return to business as usual.
Please take a look at the animation below to understand it.
If you are still confused about the process, see the diagram below.
Serverless State of game
This is not a big problem on servers because you can use one of the circuit breaker implementations, and there you go. You store the state on your server locally. Of course, not all instances will be in sync, but that’s usually not a big problem. With serverless, it’s way different because we have tons of instances, and you can’t pass state between each of those small instances.
What are the key considerations here?
- Complexity of the solution
- Cost of solution as we have to:
- read the state before each operation
- persist the effect of the call after each operation
I was thinking about a solution that could easily work for most of us in Serverless space. I typically store the Circuit Breaker state in Elasticache. Why? Because it’s there, it’s blazing fast, and—apart from custom implementation—it's easy to use. On the other hand, I understand that not all serverless systems run in the VPC and under heavy load, although that “heavy load” part is the most interesting here.
So, let me go through implementations found on the internet and then into something I came up with.
General purpose Option: Jeremy Daly’s classic
Almost every article about circuit breakers refers to Jeremy Daly’s article… so here we go.
It’s the same pattern that I am using and just mentioned above. It’s just a classic circuit breaker with the state in the centralized storage. It works great, it’s battle-tested, and I believe it’s the best way to set up circuit breakers in distributed systems as it provides the following:
- Super granular solution supporting multiple CBs in your logic
- Since logic is in your code, you can return fallbacks
- It can be used with any event-source mappings & types of invocations
This pattern requires Elasticache, so typically, it requires VPC as well. What can we do if we don’t want to build our solution in a VPC?
Non-VPC variant
DynamoDB considerations
The author mentions that it could also be used with DynamoDB for non-VPC lambdas. Is it worth considering that option in a high-volume system? Of course, it depends on your scale. Why?
- 1000 WCU limit on partition — so you will have to do tricks on persisting circuit breaker state above 1000 RPS (kudos Tycko Franklin for pointing that out)
- The cost of circuit breaker persistence might be high if you work with a high-throughput system.
Alternative approach? Momento!
Using Momento, you won’t have to put your lambda function into VPC, and you can save money at the same time. It’s also a super scalable and low-latency solution.
Words of warning
Wrong Implementations
I looked on the internet and have seen plenty of wrong implementations. Please correct me if I’m wrong, but even the most popular implementation in the space is incorrect, as it overrides the state from a single instance in a distributed system, which might cause many headaches.
Implementation: https://github.com/gunnargrosch/circuitbreaker-lambda
Wrong source of information
I’ve seen suggestions to use SSM and health check status, but if you get into that implementation, please do not check SSM before every call, as it may cost you a lot. There are some workarounds, though. Take a look at the thread linked below.
ESM Circuit Breaker Patterns
Some circuit breaker patterns are not as generic as the previous pattern and specialize in event consumers running with ESM (Event Source Mapping).
Christoph Gerkens’s Specialised ESM circuit breaker
The article covers an exciting idea. The plan is to use SQS Consumer ESM as a circuit breaker. We can enable/disable the ESM state to open/close the circuit.
When it comes to Half Open
state and sending “scout requests,” the author suggests using an additional lambda called “Trial Message Poller,” which will read one of the messages from the queue and send it over to our SQS Consumer lambda. Based on the result, we will close or keep the circuit open.
This approach is excellent as we don’t have to do anything in our lambda, and it’s completely detached from its logic.
Cost
There is no store, so it should be cheap, right? The author recommends using high-resolution metrics. This is mandatory if you don’t want to end up with circuit breakers with a minutes-long processing drift. I agree, but we also look here through a high-throughput processing lens.
We can configure alarms out of default high-resolution lambda metrics provided by AWS as they provide a lot of informations about lambda state
In that case, collecting many metrics might get expensive, and storing those metrics should be done carefully. PutMetricData
API is quite expensive, at $0.01 per 1000 invocations, so you could consider switching to EMF. With that price tag and tracking event after every invocation, using a proper circuit breaker with state storage might be cheaper. Just do your math beforehand.
Remember also that you must create a custom alarm metric when using partial failures.
Complexity
“Trial Message Poller” introduces additional complexity and is not something I would like to maintain. It must be coupled with the whole solution and cannot be reused with different types of triggers. Even with SQS, I can imagine potential issues related to visibility timeouts or switching to partial failures.
Potential loss of messages
When messages are not loaded from the queue and “rot” in case of a long and severe 3rd party failure, we might LOSE OUR MESSAGES. Yes, if messages are not reaching maxReceiveCount
and retentionPeriod
will elapse, messages won’t be moved to the dead-letter queue.
If you don’t have specific business requirements, do not try to over optimize
retentionPeriod
andmaxReceiveCount
by picking low values. Limiting time of processing can be done on a contract level.
Simplified ESM circuit breaker
Inspired by Christoph Gerken’s article, I considered simplifying that architectural pattern.
The principle is based on mirroring the full circuit breaker state in the CloudWatch alarm states and thus using it as a state of the circuit.
The metrics defined in the original article (Invocations
and Errors
to calculate the failure rate) are available in high resolution for free by default, so we can use them for tracking failure rates and defining alarms even with 10s granularity. (This solution can work based on any alarm according to your needs; the ESM Manager has to listen to changes on different alarms.)
A half-open state doesn’t have to be managed through Step Functions. We can use INSUFFICIENT_DATA
state of the CloudWatch alarm. There is one drawback here — the half-open period is based on a limited sample of messages consumed by SQS consumers instead of a single message taken from the queue. It’s also important to use TreatMissingData: missing
.
FailureRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Metrics:
- Expression: "100 * errors / MAX([errors, invocations])"
Id: "failureRate"
Label: "failureRate"
ReturnData: true
- Id: "errors"
MetricStat:
Metric:
MetricName: Errors
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref YourFunction
Period: 10
Stat: Sum
ReturnData: false
- Id: "invocations"
MetricStat:
Metric:
MetricName: Invocations
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref YourFunction
Period: 10
Stat: Sum
ReturnData: false
EvaluationPeriods: 5
DatapointsToAlarm: 3
TreatMissingData: missing
Threshold: 80
ComparisonOperator: GreaterThanOrEqualToThreshold
State Management
When an alarm is triggered, it will open a circuit, and there won’t be any lambda invocations. The alarm will be moved then into the INSUFFICIENT_DATA
state. Which then can send “scout requests” through limited concurrency processing of the queue that, after failing, can open or close the circuit after successful processing.
In the case of using default AWS Lambda metrics, we can save on that new approach, but it’s still prone to the “Potential loss of messages” issue from the previous option.
Implementation
Enabling and disabling ESM is easy. The only thing missing is an option to add any metadata to ESM, even a name. Because that feature is missing, it has to be added to the stack as a parameter.
If the option were possible, this implementation would only enable/disable the original ESM and a special ESM created for the half-open state, which could also be enabled/disabled.
My implementation is written in go with some custom instrumentation and OTEL integration, but you can use something as simple as in the snippet here.
My implementation is based there on SQS, but it can be adapted to all different event-driven ESMs configs customized for a half-open state after small customizations.
Testing
The solution was tested with FIS using several different alarm configurations. Below, you can see the behavior of example alarm state transitions.
This approach lacks flexibility in controlling how often to move into a half-open state, but it’s good if you don’t have access to the code or don’t want to add any complexity to your lambda handler.
Summary
Getting circuit breakers right in a distributed system is quite tricky, and there are many different trade-offs and approaches to consider. In some cases, you might not need a distributed circuit breaker at all—e.g., in the case of a small fleet of lambda consumers, where having circuit state memory might be good enough.
If you are unlucky and it’s not your use case, I hope that article can help you decide and/or give you some ideas on what to consider during your implementation. Below, I created a decision tree that might make it easier for you. I believe that the classic approach is still the best, but I can see some use cases from ESM-based circuit breakers.
If you would like to discuss that topic, you can join the #believeinserverless community and discuss that with the community here.
Thank you for your time, and I would like to hear your thoughts on the topic! Let’s get in touch:
PS
Yet another approach
Sheen Brisals approach to Circuit Breakers with retries and archiving events that you can read about here: https://sbrisals.medium.com/amazon-eventbridge-archive-replay-events-in-tandem-with-a-circuit-breaker-c049a4c6857f or watch author presenting his idea in here (great talk BTW):