Sitemap

Legacy software & Expanding Software Capabilities with Serverless Event-Driven Architectures

10 min readMar 2, 2025

--

How to build modern systems in symbiosis with digital fossils

Context

The company I work for is growing fast. Last year, we broke record after record and have an appetite for more. Even though we have been in the cloud for a long time, our success depends on critical, specialized, domain-specific software that is nowhere near the green field. It's the thing we all "love" as engineers: legacy systems and, what's more, third-party legacy software.

In this article, I will describe some ways to tame legacy systems and build event-driven software that works in harmony with them.

Legacy Software

First, let me clarify what kind of legacy software I will discuss here. I don't consider legacy here a project that the other team built some time ago, and you have full access to the codebase. You might find better alternatives in that case, although some concepts might also interest you.

I will focus on 3rd-party legacy software that we can't modify and partially on a scenario in which you can't have events about changes occurring internally because of technical limitations or cost-related contract difficulties.

In the case of legacy systems, a couple of things might be blocking us. To be ready for the growth of our business, we need to focus on some key areas presented in the graphic below, like R&D, profitability and flexibility. I believe that achieving them with this type of software will not be easy (or even possible).

Okay… is legacy really that bad?

Legacy software is not bad as long as it serves its purpose.

Usually, this software was built to solve a complex business problem and does it well. Someone put effort into building it, your company paid for it, and it just works. If you are using this software and want to extend it, it means it is still doing well, or you were forced to do so (for example, due to scalability, cost, or problematic maintenance)

What might be a problem here?
Lack of flexibility. Software and business reality evolve, and in many situations, we want to be able to adjust functionality and scale as needed. If our business grows, our infrastructure must follow.

You could argue that our vendor can scale that software for us. However, due to technical limitations, this is not always the case. In many cases, such software is not ready for modern-day requirements, and in other cases, the license fees can eat up your margin.

My biggest issue with legacy software is related to the most dangerous phrase in business: "We've Always Done It This Way". Such software anchors a way of thinking and can tremendously impact an organization. If you want to disrupt, you must rethink some parts of running your business.

In case of a deeply rooted legacy system you could think about the analogy of 5 monkeys experiment. If you don’t know what I am referring to take a look here: https://www.throwcase.com/2014/12/21/that-five-monkeys-and-a-banana-story-is-rubbish/. Still… very colorful and easy to understand anecdote.

Data liberation

If we deal with a legacy system that we can't modify, we have to integrate with it somehow. Calling it whenever we need data from it might not be an option. If we want to operate on a much bigger scale than in the past, that system might not scale to our desired scale. Also, spikes in traffic in our modern system might cause disruptions in the core system, which is more fragile than our systems built for the modern cloud era.

Direct communication, which will eventually couple the systems and spread connections between the old and new systems in many places, might not be the best option — especially if our legacy system covers multiple domains.

I am a big fan of event-driven architecture, and in this scenario, I would like to stick to my guns. We can run a data liberation process that will load changes from the legacy system and publish them into an event stream. That way, many different systems can consume those events, opening new possibilities for our organization.

It might not be as easy as drawing a block on the schema below. We have to be sure that we are okay with eventual consistency (that we are doomed for as we can't plug into the internal processes. We will always be eventually consistent. Even if your 3rd party will emit events, we will still be eventually consistent. What we can do now is to minimize lag. It is super important to get the correct SLAs for it.

Implementing such a data liberation process might take many different forms, depending on the acceptable processing lag, the character of the system, and many other aspects. It might look like the process on the slide above.

Processing tasks across the same business partition (shards) is crucial to prevent conflicts in a data liberation process. It can also make persistence easier as we know that there are no conflicts across consumers.

Another critical aspect is throughput control, as we don't want to overwhelm our source system. I recommend loading data in not too many different processes (I made such mistakes in the past, and it can cause issues in the source system.). Otherwise, it might be hard to control throughput to the external system and make it work predictably, as well as to prioritize some reloads. It will require synchronizing reloads across different components by implementing a throttling mechanism using a distributed data store.

Still, there are a ton of things that we have to take care of in such processes like:

  • Reloads optimization (e.g., the more probable that some data will change, the more frequently reloads should be triggered)
  • Check the hash of previously loaded data to avoid overriding it mindlessly. Remember that conditional writes are charged regardless of the condition check result.
  • Throttle data reloads on the business key level
  • Plug into business processes to detect data changes
  • Prepare for versioning (and make it part of the hash)

and many more aspects. On top of that, we have to make sure that our data processing is fail-safe. You can watch my talk on the topic that I linked below:

Extend

Now that we have data loaded from our source systems “liberated”, we can finally extend our system capabilities.

Data loaded from the legacy system might be one of the inputs into our system. We can have another data source that will affect our system. It might be a push-based one.

We have to pick the approach that will suit our needs for those data sources and have some decisions to make.

  • We can store events at separate tables or a single table.
  • We can store all events to have a full history of events or store only the last state and keep track of past events in the audit. (don’t get triggered, more about it later)

What will be the data access model? We want a CQRS approach, doing the heavy lifting on the backend and preparing dedicated views for our consumers, or we want to process all events from the stream and build a read model during a request. It is also possible to have both options, depending on the process requirements and preferences.

I like to do the heavy lifting on the backend and process, compose, and handle all of those events that will effectively end up as a tailored read solution. In that case, having a document with a fully composed aggregate is very useful. It’s also a good source for streaming aggregate changes to the downstream systems (e.g., SNS or EventBridge). We have to remember the limitations of DynamoDB document size.

After persisting aggregate in the doc, we can consume a stream of data updates and put them into the tailored model and DB that will be the best option on the read side of our system.

Of course, if we don’t want an aggregate doc with this approach, we can also build views from the stored events every time a new event occurs.

Event Types

I mentioned before that “We can store all events to have a full history of events or store only the last state and keep track of past events in the audit,” and it will depend on the type of processed events. I like to think about events in two categories:

  • Event Carried State Transfer: events that carry information about the changed state of some entity without a reason why. For example: “price model for X flight changed on Y route”. In such cases where we are just informed about such changes, you might need only the last state stored, and in case you need to keep track of those events, an audit might be the best place to store it in a cost-effective manner.
  • Delta events: events that occurred in some system. The event is in the style “something happened” with info about that event. In those situations, I always recommend storing all of the events and materializing the state out of them.

Delayed & order of events

You have to remember to handle events in order inside your system. In some situations, it can be critical; in others, it won’t.

There might be an option that some events will be sent not only out of order but also with a delay, e.g., due to a network issue. The great question is, “What does it mean that an event is delayed?”. In my opinion, it’s not a technical requirement but a business decision. Hear me out. Late event is a CONSUMER concept. Why do I think so?

You might have multiple consumers. They might consume those events for different purposes. What we have to do is to trade off latency for determinism.

For example, order status information during checkout might change every x milliseconds. In the case of a delayed event, it may be incorrect for some time, but eventually, it will be consistent as soon as the delayed event is received.

Yet, from the perspective of a system that consumes events and reports nightly, where we are processing events in batches long after the events, it won’t affect our processing or consistency.

How can we handle out-of-order events here?

Drop
We can drop those events in case of event-carried state transfer events in case we have a newer version that has already persisted. It won’t be an option for delta events.

Event Sourcing
We can store events and materialize the state based on the whole event history across the same stream. This is definitely my go-to option for delta events. It should be yours as well. It gives you auditability, safety & ease of reprocessing.

Since we are going to rebuild the state after all new events and we are doing that based on all known facts, we can reconstruct the correct state even if the events arrived in the wrong order.

Reprocessing

We might want to build different views from the facts (events) we have gathered. In that situation, we have to retrigger processing for each aggregate. Since we are storing events in the append-only log and/or we have the latest entity snapshots coming from event-carried state events. We need a mechanism to reprocess all those events.

This feature is particularly useful for our legacy polling mechanism because we can update the document version. By changing the hash, we can enforce change detection during the reload process, which will trigger the reprocessing of all our aggregates.

Remember that this might be an expensive and time-consuming process in some cases.

Summary

This approach works well for me after running it for a few years on several projects.

This architecture is easy to extend & reason about. Served data is fit for purpose, with easy support for caching, which can improve the cost-effectiveness and performance of the solution. It’s fully auditable, and since all views are built based on facts, it gives me a feeling of confidence that my system is rock solid. Apart from that, it offers freedom of choice regarding the evolution of my architecture.

I wish that all of us could get a stream of events from our legacy systems to remove the most troublesome part of this architecture. Unfortunately, this is not possible in many situations. What we can do is to build systems that will always consider downstream systems integration with push-based mode instead of pull-based mode.

--

--

Responses (2)