AWS & OpenTelemetry: Collector architecture at scale
This article covers different approaches to collecting your telemetries with OpenTelemetry, from the simplest to the more complex solutions.
Intro
This article is based on part of my talk about OpenTelemetry, which is why some of the visuals used are simply slides. Below, you can have a snick-peak of what we are going to cover.
Our goal is to move data from the app collecting telemetries to the system that makes them useful (our vendor) in a secure, manageable, and flexible way. So now, let’s take a look at the different options that we have.
Direct Integration
We'll start with the simplest solution—the one without any collector at all. This solution can send all telemetries directly to the vendor, skipping the OTel Collector.
Vendor — by that term I mean solution that stores and visualize our telemetries. It might be Grafana, Elasticsearch, New Relic, DataDog, Honeycomb or any other similar solution.
This is a super easy setup. We just have to configure our app to send telemetries directly to our vendor’s OTel endpoint.
With this approach, all your applications are tightly coupled with a vendor. In case of a vendor change, you must redeploy all of the company's apps. You might say this is a corner case, and you are right. The real issue is that ANY significant change related to the collection of telemetries means you must redeploy ALL of your apps.
You cannot modify how you process the data collected (let's say modify attributes), batching is suboptimal, and sampling can only be used with head-based sampling. On top of that, maintaining such a solution can turn into an absolute nightmare.
The biggest issue is that the security credentials you have to send to your vendor in a header are scattered all over the place.
Agent
Another approach is sending data through a collector next to our application. Some daemon might be installed on the EC2, sidecar container, or lambda layer with an OTel collector.
Thanks to running our apps in “Agent” mode, we can improve our processing by configuring the collector and embracing OTel collector goodies like batching, retries, modifying collected data, and much more through configuration.
With agent setup, we must maintain collectors running next to all of our apps. If you have more than a few services, this is not something you would like to do. Upgrading collectors and applying security patches are required.
Gateway mode
This means we can run the collector centrally, e.g., behind a load balancer. Let’s first look at the architecture, which we will cover more deeply later in the article. It will add new infra to “maintain” and something running inside your account. In reality, there is really little to do with such a setup apart from picking the right size of a collector's nodes and setting up auto scaling.
Gateway: Collection
In this setup, we expose the solution to all services through a single endpoint (over, for example, a load balancer). Thanks to that, we can replace anything behind it. Such a solution will be highly available and scalable with some compute cluster behind it (e.g., ECS). Since we expose it through a single endpoint, we can do anything behind it and evolve our architecture in the future.
Gateway: Centralized configuration
What is great about this setup is that our configuration can be centralized.
Collector
Upgrading the OTel collector with this setup is as simple as redeploying our IaaC with the latest collector version. Then, it’s an upgrade for all systems.
Centralized configuration
The setup that works really well for me is based on SSM parameters and EventBridge. Lambda can detect any change to the SSM parameter and force redeployment on ECS. That way, having a synchronized config and cluster running that configuration is super easy—centrally and deployed in seconds.
Secure configuration
Configuration can be easily injected into the node clusters. What is more, it can embed values from environment variables. This is great, as we can inject the vendor’s credentials through node secrets. Thanks to that, we can store our credentials centrally and securely.
Gateway: Sending data to vendor
Since data is collected centrally, we can do many things with it. We can batch, compress, and send data much more efficiently.
We can also remove/rename/add attributes, which works miracles. For example, you can lower the cardinality of metrics by removing some attribute that brings little to no value centrally without touching any app. This is especially handy in case of attributes added by default through auto-instrumentations.
Apart from that, we can replace/change/extend the list of vendors to which we are sending data, which gives us the option to send THE SAME DATA to different vendors at the same time. With such a powerful feature, we can compare different solutions based on precisely the same data, so it’s an apples-to-apples comparison.
Gateway: Serverless telemetry collection
Our configuration can be distributed and used by other components in our company. We can use a pattern similar to the one we discussed for cluster collector configuration, but we could store it in S3. We can reference that configuration directly in lambda through the environment variable. The layer will pick it up and send telemetries to the collector inside the account.
We can also communicate it directly without layers if we want and reference our endpoint through environment variables.
Gateway: Serverless telemetry collection — non-VPC lambdas
We want to have a common strategy for all of our solutions. What about non-VPC lambdas? ? However, they can’t communicate with our internal load balancer due to networking.
To cover that scenario, we must put all our lambdas into the VPC OR extend our solution with an additional public load balancer (blue lines on the diagram below).
We can secure it with WAF and a rotated security token that will be sent to our collectors alongside our telemetry data. That security token can be part of the configuration and will be updated on each rotation. Thanks to that, we have token management centralized, and we don’t need to bother anyone with it.
Multi-Layer Gateway Mode
In large-scale systems, sampling is super important. It affects our observability billing, and it is also essential not to be swamped with useless data collected and cut through noise.
We have to switch to tail-based sampling to have more flexibility in sampling, where we can decide what we want to sample. We can postpone the sampling decision to “the end of trace”. In other words, we can sample only traces with errors or sample them at a different rate than successful ones. Sounds great! Where is the catch?
This type of sampling is more expensive to collect. Why? It’s stateful, so it uses more resources on the collector, and we need a new “layer” of collectors to collect that data with a loadbalancing
exporter.
If you don’t understand why we need that extra layer, you can read in detail about that in another article I wrote — here:
So after applying that change, our setup could look like this:
We will have to introduce a new configuration containing a new setup for the 1st loadbalancing
collectors layer that will reroute requests to the proper nodes. The rest of the configuration remains the same.
Multi-account (multilayer?) architecture
What if we have multiple accounts with solutions based on OpenTelemetry and want all of the benefits of previous architecture, like centralized configuration, centralized setup for low maintenance, security segregation, and so on, but in many accounts?
We don’t want this ugly thing below, right? It will cause us a maintenance headache again.
We want the same centralized architecture that is ready to support multiple accounts. We can move our solution to the observability accounts, communicate with them through Transit Gateway, and protect them with a Network Firewall for extra security. That way, we can have the security and flexibility of such an architecture.
Multi-Account setup: Configuration distribution
All configurations related to the collector configuration and vendor credentials should remain in the observability account. This should further improve security and make updates easier for the whole organization.
In the previous approaches, we shared configs internally inside the same account, which made it easy to control. Here, we have them stored in a separate account, but some configurations should/have to be shared among all of the applications in the organization like:
- Collector endpoints
- OTel environment variables: propagators, span attribute limits, compression type, metrics temporality, resource providers to exclude
- The lambda layer version that should be used if needed
- Location for collector configuration for VPC and non-VPC resources (e.g. for lambda layers)
How can we do it at scale? Our solution is based on a secrets manager shared with accounts within our organization.
Then, if you want to reference values, you can refer to them through dynamic references in your stacks. One caveat here: if you don’t use secret versionId, your value won’t update, as CloudFormation won’t see a change. That might be painful. Because of that, we have stack sets that deploy SSM parameters in all target accounts whenever our exported values are changed. Thanks to that, all changes are easy to refer to, and there are no synchronization issues.
Summary
I hope you have learned a thing or two from reading this article or at least have a spark of new ideas on setting up your OpenTelemetry collectors at scale. If you are doing something different (and hopefully better), please write to me on LinkedIn. I would love to hear about your experiences and o11y journey!
Since it’s re:invent season, if I could have one wish, it would be some managed service that will make this article obsolete. OTel collectors' setup on AWS should be easy and unified across the industry.
Thank you for your time, and I would like to hear your thoughts on the topic! Let’s get in touch: