Do you need P? Systems Correctness Practices at AWS

Marcin Sodkiewicz
7 min readFeb 11, 2025

--

A non-big tech engineer’s view of formal and semi-formal methods

Intro

Recently, I came across an AWS paper titled “Systems Correctness Practices at AWS” by Marc Brooker and Ankush Desai, which you can find linked below.

I love systems reliability topics and AWS, so I was hooked by the title and started to read it immediately. I found some interesting things in the paper about using formal methods in designing systems, which is not a part of my toolbelt.

Would it change my workflow, and what do I think about them?

AWS Way

According to the article, AWS ensures system reliability through a combination of formal methods (like TLA+ and P programming language) and semi-formal approaches (like property-based testing, deterministic simulation, fuzzing, and fault injection with FIS). These techniques help identify bugs early, enable performance optimizations, and verify critical security components.

Semi-formal methods are well-known techniques in the industry, but I never used formal methods during system design. AWS has used formal methods for quite a long time. In the paper “How Amazon Web Services uses formal methods”, you can find that TLA+ was a successful technique that led to this conclusion 10 years ago:

At AWS, formal methods have been a big success. They have helped us prevent subtle, serious bugs from reaching production, bugs that we would not have found via any other technique. They have helped us to make aggressive optimizations to complex algorithms without sacrificing quality.

source: https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf

While TLA+ proved to be successful, something more approachable to programmers was needed, and this is where P language comes into play.

There is no example in the paper, and I was super curious how such an experiment could look like.

On top of that AWS uses formal method with Dafny, Kani & Lean. It seems that there is plenty of formal proofs done during AWS services design, and that’s great!

End-user practicality

In case it was not clear before, the whole formal method approach with P is meant to “run chaos engineering” at the design time, instead of (post)implementation time. Sounds exciting!

Authors are complaining about the lack of adoption and thus talent pool on the market. It was not a shocker to me, as formal methods are not very common in the industry. In the paper there is part about it:

Despite significant success in scaling formal and semi-formal testing methods across AWS over the past 15 years, several challenges persist, particularly in industrial adoption of formal methods. The primary barriers for formal methods tools include their steep learning curve and the specialized domain expertise required. Additionally, many of these tools remain academic in nature and lack user-friendly interfaces

The education gap begins at the academic level, where correctness even basic formal reasoning approaches are rarely taught, making it difficult for graduates from top institutions to adopt these tools. Although formal methods and automated reasoning are crucial for industry applications, they continue to be viewed as niche fields. We anticipate that increased industry adoption of formal methods and automated reasoning will attract more talent to this domain.

Okay, so it seems that we have a huge gap, and there is low adoption in the industry and at the universities. So what are we missing? Is it something that is really niche and should stay that way, or should we add yet another type of testing on top of our semi-formal method of resiliency testing?

I was trying to find some test cases that would convince me that I should change the way I deliver software. I started by watching this talk by Ankush Desai (one of the authors of the paper) and Bikash Behera.

After watching this session, I can say that I was totally not convinced. As much as I am excited about the whole idea, there was no “Aha!” moment during this talk, and the part with the example presented was strange to me.

An example of such a case should be an eye-opener, and in the presented scenario, I knew all of the potential issues with that architecture with a glimpse of an eye. 2PC was clear straight away, and the conclusion that we need self-healing architectures and that idempotency is important in EDA is basically a baseline of best engineering practices. In the case of going as deep as using formal methods, I expect something more.

The strangest part of it was the conclusions. The speaker states that P language is:

  • easy — even though the authors themselves in the paper say that there is a steep learning curve
  • fast—speaker (I suspect that an expert in the field) built that model within 2 weeks, and the model was based on the same pattern repeated multiple times. Maybe it is fast in comparison to other formal methods, but I have no experience in the area. Yet, it doesn’t seem fast to me in this scenario. Repo with this scenario can be found here: https://github.com/aws-samples/p-model-transaction-processing-systems/tree/main. As you can see, this is also code, so it means that you have to maintain it, and it can be buggy.

I downloaded Peasy for Visual Studio Code, checked the repo with examples from the talk, and after that, I was still interested in getting some more information and checked https://p-org.github.io/P/.

My general feeling is that despite the author's ambitions, it will still be a niche. And that’s good. As much I think formal methods are a good technique to spot issues in the case of building complex distributed systems for a big tech cloud computing company. Complex distributed system behavior and its corner cases are their core business.

I don’t think that the majority of companies building systems on AWS need formal methods. Based on the Pareto principle, there is much more value in the easier techniques like using semi-formal methods.

Semi-formal methods adoption

The adoption of semi-formal methods is another story. Still, I have an impression that, e.g., chaos engineering has way lower adoption than we would think it is in the industry, which is also stated in the paper.

Even well-established semi-formal approaches face adoption challenges. For example, deterministic simulation, a key distributed-systems testing technique used successfully at AWS and in projects like FoundationDB, remains unfamiliar to many experienced distributed-systems developers joining AWS.

Similar gaps exist in the adoption of other proven methodologies such as fault-injection testing, property-based testing, and fuzzing. The challenge is educating distributed-systems developers about these testing methods and tools, teaching the art of rigorous thinking. The education gap begins at the academic level, where correctness even basic formal reasoning approaches are rarely taught, making it difficult for graduates from top institutions to adopt these tools. Although formal methods and automated reasoning are crucial for industry applications, they continue to be viewed as niche fields. We anticipate that increased industry adoption of formal methods and automated reasoning will attract more talent to this domain.

This only confirms my observations about adoption, as well as the general interest in the topic across software engineers.

Summary

I guess I won’t continue my journey with getting into the world of formal methods for now. I don’t feel that it's the lacking tool in my engineering toolbelt, and I guess the same applies to the majority of the companies.

It was mentioned many times regarding big tech techniques that you shouldn’t do as <name of the big tech> do because you are not <name of the big tech>, and the same rule applies here, I guess.

According to the fantastic paper mentioned (that you can find here): “92 percent of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors”. Authors even conclude that:

While high statement coverage on error handling code might seem difficult to achieve, aiming for higher statement coverage in testing might still be a better strategy than a strategy of applying random fault injections.

source: https://www.researchgate.net/publication/266722925_Simple_Testing_Can_Prevent_Most_Critical_Failures_---_An_Analysis_of_Production_Failures_in_Distributed_Data-intensive_Systems

I will stick to good “old” best industry practices and chaos engineering with AWS FIS. In the end, it proved to be very effective. In case you would like to hear what I have to say about handling failures in the EDAs, you can watch my talk:

Last words

I encourage you to read this paper, though, as it covers many good practices and is a really easy and good read. There are many interesting things in the paper, like, e.g., metastable systems. As extra material for the paper, I would recommend watching a talk by Marc Brooker from last re:invent on the topic.

--

--

No responses yet