AWS & OpenTelemetry

Marcin Sodkiewicz
12 min readApr 4, 2023

Find out how to setup OpenTelemetry collector on AWS in Gateway mode, how and why to centralize your OTEL configuration.

Let’s dive into a journey of running OpenTelemetry on AWS.

What is OpenTelemetry?

OpenTelemetry is a standard that was created by combining OpenTracing and OpenCensus. Currently part of a CNCF initiative, OpenTelemetry is supported by the big names in the technology world that you can find here. Just to name a few of them: AWS, Azure, Datadog, Dynatrace, Elastic, Grafana, Lumigo, New Relic or Splunk.

All of these companies invest in and believe in this standard. They invest in it because they believe they can achieve more by working together on this standard. Especially because of the amount of agents for all the languages and platforms. The list of supported languages, libraries and frameworks in OpenTelemetry is quite impressive. There is some controversy about some of the vendors working with open source, but it looks better and better every year. Believe me, I have been working with it almost from the beginning.

The whole point of this fascinating standard is to create a common standard for collecting telemetry signals in a unified way, and to abstract your data visualisation from the way it is collected by your software.

If you still don’t get the point, imagine this — with this standard in place, it’s super easy for any company building their SDK, library or writing any code to integrate with this standard. Without it, what are the options?

Until now there was couple of standard, but usually it ended up like on a meme above. Now we have one that got interest by companies and community! It’s really great thing for the industry.

Another great use case is if you’re building skills in your team — for example, a software house where you have many different customers, but with OpenTelemetry you can easily integrate with their telemetry provider and use your in-house skills rather than becoming an expert in yet another telemetry integration with all its quirks.

How OpenTelemetry works?

We have concept of collector — which is component that defines how data can be collected, transformed and where they should be sent over. It’s made of 3 parts:

  • Receivers— defines what can be sent over. Examples: OpenTelemetry, Jaeger, Zipkin, Prometheus.
  • Processors — how to process sent data. Examples: batching, limiting memory, sampling, attributes modifications (for assigning e.g. environment attribute, hashing some attribute value or simply renaming an attribute)
  • Exporters — where to send collected data. Examples: Jaeger, Prometheus, Zipking or OTLP/OTLHTTP which usually means New Relic, Datadog, Dynatrace, Elastic, Lumigo or any other big player. Typically you are setting up otlp endpoint and Authorization header for the service.
Diagram from: https://opentelemetry.io/docs/collector/

Standardisation

So as you can see, OpenTelemetry is trying to standardise and unify the way we collect, process and export telemetry. That is just the technical aspect. What about other aspects of standardisation?

Conventions

This is something that is really critical to observability in general — unification. OpenTelemetry has a big role to play here as an entity driving the unification of format and naming conventions in logs, metrics and traces.

I can’t stress enough how important this is for many reasons. It can help you easily correlate logs, build unified dashboards across all your applications, and your log cluster won’t have any problems with the conflict in the log model. For example: http status can be stored as number, text or even object with status description and it’s status code.

Another super important thing is to increase the cardinality of your telemetry data by assigning business-related attributes. But if everyone starts calling the same thing by different names, your valuable telemetry data won’t be useful. For example: adding a trace attribute with key: “userId”, “customerId” & “user” interchangeably will make a mess out of your data. However, this is something that OpenTelemetry won’t help you with and you will have to take care of it yourself in your organisation.

Semantic Conventions resources:

If you would like to take an extra look on conventions picked by OpenTelemetry.

Logs

Metrics

Traces

How to setup OpenTelemetry on AWS?

Collector setup

When it comes to setting up a collector, we usually have 2 choices: agent or gateway.

Agent
Running collector instance along with your application. For example, as a side-car container or locally as a daemon. I do not recommend this approach. Why?

  • Distributed configuration that makes any update really hard
  • Distributed configuration can be problematic from a security point of view — especially if you need secure access to your 3rd party data collector. So I guess — always.
  • Problem with sampling / tail-based sampling
  • Upgrades & Security patches issues

Gateway
Running collector in a centralized fashion — e.g. behind a load balancer. In my opinion, it’s the perfect choice. Why is that?

  • Centralized configuration that have your security keys in a single place
  • Centralized attributes management — f.e. renaming/adding/deleting attributes can be done in a single config and will be consistent across all apps
  • Probabilistic sampling and tail-based sampling makes sense
  • Easy upgrades & security patches
  • You can reload whole data collection by running single update and that’s where magic happens, we’ll get back to that later.

Collector architecture

Setup that works really great for us is running on Fargate that sits behind ELB where it collects all telemetry data.

Our collector definition is defined in the SSM parameter with a placeholder for the token that is injected from the SecretsManager via the Secrets section in the ECS task definition. Thanks to this, it’s not spread across multiple services, but sits in a single place. What’s more, after each update to our SSM configuration, it’s picked up by AWS EventBridge, which triggers a lambda function that forces redeployment of our ECS collector service. It means that in seconds after update of SSM configuration it’s picked up by our collectors.

Applications can simply use the load balancer endpoint to report it’s telemetry. In my case serverful applications are JVM based so we use OpenTelemetry Agent and it’s as simple as setting OTEL_SERVICE_NAME for application name and OTEL_EXPORTER_OTLP_ENDPOINT variable to set up collector endpoint.

There is one case where setup requires a little bit more effort unfortunately — in case of usage with AWS Lambda.

AWS Lambda Setup

Layer

To integrate lambda function with OpenTelemetry you have to run the collector on a Lambda layer. Although you have to keep in mind that distributed version of OTEL Lambda collector layer might be outdated and you have to keep your own layer distribution as per OpenTelemetry Lambda FAQ.

You can build and deploy this layer using AWS CodeBuild. You will need to fork the OpenTelemetry repository to use your own buildspec definition. Keep in mind that it will need to be distributed to multiple regions and potentially multiple accounts, using 2 types of architectures for x86 and arm.

Here is example buildspec definition where you can specify comma-separated REGIONS and SHARE_WITH_PRINCIPALS.

version: 0.2

phases:
install:
runtime-versions:
golang: 1.18

build:
commands:
- echo Build started
- cd collector
- make package

post_build:
commands:
- >
if [[ -v REGIONS ]]; then IFS=',' read -ra regions <<< "$REGIONS"; for region in "${regions[@]}"; do
aws lambda publish-layer-version --layer-name $LAYER_NAME --zip-file fileb://build/collector-extension.zip --compatible-architectures $COMPATIBLE_ARCHITECTURE --region $region --compatible-runtimes nodejs14.x nodejs16.x nodejs18.x java11 python3.8 python3.9 --query 'LayerVersionArn' --output text;
if [[ -v SHARE_WITH_PRINCIPALS ]]; then IFS=',' read -ra items <<< "$SHARE_WITH_PRINCIPALS"; for item in "${items[@]}"; do
aws lambda add-layer-version-permission --layer-name $LAYER_NAME --region $region --version-number $(aws lambda list-layer-versions --layer-name $LAYER_NAME --region $region --query 'LayerVersions[0].Version') --statement-id sharingWith$item --principal $item --action lambda:GetLayerVersion;
done; fi
done; fi
- echo Build completed for $(aws lambda list-layer-versions --layer-name $LAYER_NAME --query 'LayerVersions[0]')

Sharing layer version

Sharing a common lambda layer version might be something you want to have in place, so you could add it to your CI/CD pipeline.

Then you can use latest published layer version that should be used by lambdas in your organization using resolve:ssm or use specific version of SSM parameter.

Layers:
- !Sub
- '{{resolve:ssm:${LayerParam}}}'
- LayerParam:
Fn::ImportValue: OTEL::LambdaLayer::arm64

Sharing layer configuration definition

I mentioned earlier that your goal should be centralised configuration. So how do we share configuration across all lambdas? At the moment, the only way to share configuration between multiple lambdas is through S3.

Again, what I would recommend is to keep the configuration of lambdas centralised — both the ones running inside VPC, and the ones running in public (if you can’t run all the lambdas inside VPC).

A setup that really works well is to have configurations for both internal and external lambda configurations in SSM. Then, using EventBridge, we can capture all configuration changes and put them into an S3 bucket that will be used as the configuration data source for our lambda layers.

The S3 configuration location can also be stored in SSM. This will make configuration path updates much easier. This way you can change the location config completely and don’t have to update all the lambdas in your organisation every time you want to change the config.

It’s also important to grant S3 permissions to lambda in order to retrieve S3 collector configuration. I would recommend creating a single IAM Managed Policy with the necessary access that can be used by your Lambda functions.

Here is a sample architecture for setting it up:

The internal configuration is super easy as we just need to redirect the lambda layer collector to report data to our already existing load balancer that we saw in the diagram above. Here is an example of how to do this:

InternalOtelConfiguration:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${Environment} Internal lambda configuration
Name: !Sub OTEL-LAMBDA-COLLECTOR-CONFIG-INTERNAL-${Environment}
Type: String
Value: !Sub
- |
receivers:
otlp:
protocols:
grpc:
endpoint: localhost:4317
http:
endpoint: localhost:4318
exporters:
otlp:
endpoint: ${CollectorEndpoint}
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
metrics:
receivers: [otlp]
exporters: [otlp]
telemetry:
logs:
level: warn
- CollectorEndpoint: !FindInMap [ Environments, !Ref Environment, InternalCollectorEndpoint ]

It’s more difficult when it comes to the public endpoint, because we have to protect it with a token. Since we can’t use the same token all the time, we need to rotate it. It works in such a way that secret after rotation updates our SSM configuration (and by doing so → S3), as well as, the listener rules in our ELB. It’s quite important to remember that the previous token may be in use by our lambda layers for some time after the rotation has happened, and it needs to be handled properly. Otherwise we would have gaps in our telemetries.

ExternalOtelConfiguration:
DependsOn: PublicOtelEndpointHeader
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${Environment} External lambda configuration
Name: !Sub OTEL-LAMBDA-COLLECTOR-CONFIG-EXTERNAL-${Environment}
Type: String
Value: !Sub
- |
receivers:
otlp:
protocols:
grpc:
endpoint: localhost:4317
http:
endpoint: localhost:4318
exporters:
otlp:
endpoint: ${CollectorEndpoint}
headers:
Authorization: ${PublicApiKey}
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
metrics:
receivers: [otlp]
exporters: [otlp]
telemetry:
logs:
level: warn
- PublicApiKey: !Sub '{{resolve:secretsmanager:${PublicOtelEndpointHeader}:SecretString:AuthToken}}'
CollectorEndpoint: !FindInMap [ Environments, !Ref Environment, ExternalCollectorEndpoint ]

Lambda layer configuration

Now we can setup our lambda configuration using variables:

  • OPENTELEMETRY_COLLECTOR_CONFIG_FILE centralized collector config path
  • OPENTELEMETRY_EXTENSION_LOG_LEVEL logging level of OTEL layer
  • OTEL_SERVICE_NAME meaningful service name

It’s also important to grant lambda access to configuration S3 bucket.

SomeLambda:
Type: AWS::Serverless::Function
Properties:
Handler: bootstrap
CodeUri: ../codeDir
Runtime: provided.al2
Architectures:
- arm64
Environment:
Variables:
OPENTELEMETRY_COLLECTOR_CONFIG_FILE:
!Sub
- '{{resolve:ssm:${ConfigLocation}}}'
- ConfigLocation:
Fn::ImportValue: !Sub OTEL::${Environment}::CollectorConfig::(In/Ex)ternal
OPENTELEMETRY_EXTENSION_LOG_LEVEL: warn
OTEL_SERVICE_NAME: my-super-app-name
Policies:
- Fn::ImportValue:
!Sub OBSERVABILITY-${Environment}::CollectorConfig::AccessPolicy

Lambda instrumentation

Now that we have all the infrastructure set up to collect our telemetry and the lambda layer set up, we need to instrument our logic. This will vary depending on the language and framework you’re using. All details can be found here: https://opentelemetry.io/docs/instrumentation/.

Let’s use Go as an example here.

  • Create trace provider
func NewProvider(ctx context.Context) (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithInsecure())
if err != nil {
log.Error("Error creating exporter: %v", err)
return nil, errors.Wrapf(err, "Error creating exporter")
}

resourcesMerged, err := buildResources(ctx)
if err != nil {
return nil, err
}

tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resourcesMerged),
)
if err != nil {
log.Error("Error creating otel tracer provider: %v", err)
return nil, err
}

otel.SetTracerProvider(tp)
return tp, nil
}

func buildResources(ctx context.Context) (*resource.Resource, error) {
resources, err := resource.New(ctx,
resource.WithFromEnv(),
resource.WithDetectors(lambda.NewResourceDetector()),
resource.WithAttributes(
semconv.ServiceVersionKey.String(getServiceVersion()),
semconv.TelemetrySDKLanguageGo,
),
)
if err != nil {
log.Error("Error creating custom resources: %v", err)
return nil, errors.Wrapf(err, "Error creating custom resources")
}

return resources, nil
}
  • Wrap you lambda function handler
func InstrumentHandler(tp *trace.TracerProvider, handlerFunc interface{}) interface{} {
return otellambda.InstrumentHandler(handlerFunc,
otellambda.WithTracerProvider(tp),
otellambda.WithFlusher(tp),
otellambda.WithPropagator(propagation.TraceContext{}))
}

...

lambda.Start(InstrumentHandler(tp, handler))
  • Annotate your spans

You can annotate your logic either with cusomer span attributes or already defined OTEL integrations — here is an example of integration with AWS SDK Go v2. Apart from that — business as usual.

func provideAwsConfig() aws.Config {
cfg, err := config.LoadDefaultConfig(context.Background())
otelaws.AppendMiddlewares(&cfg.APIOptions)
if err != nil {
panic(errors.Wrapf(err, "failed to load configuration, %v", err))
}
return cfg
}

You can integrate it as well with popular frameworks like gin using code as simple as one in snippet below where you can just register middleware.

router.Use(otelgin.Middleware(serviceName))

It’s highly recommended to annotate your spans with business related attributes — it can be done like that:

 span := trace.SpanFromContext(ctx)
span.SetAttributes(attribute.String("bookingId", booking.ID))

Centralized configuration super powers

We have just gone through the whole process of setting up configuration for our applications in a single place. So right now, with a single CloudFormation/SSM update, we can redirect all our telemetry to another provider or many of them — just to compare their service for example. How can we do this? You can simply specify multiple exporters as in the snippet below:

receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp:
endpoint: https://otlp.nr-data.net:4317
headers:
api-key: ${!YOUR_NR_KEY}
compression: gzip
otlphttp/elastic:
endpoint: https://your-account.apm.eu-west-1.aws.cloud.es.io:443
headers:
Authorization: ${!YOUR_ELASTIC_TOKEN}
otlphttp/lumigo:
endpoint: https://ga-otlp.lumigo-tracer-edge.golumigo.com/v1/traces
headers:
Authorization: ${!YOUR_LUMIGO_TOKEN}
otlphttp/dynatrace:
endpoint: https://{your-environment-id}.live.dynatrace.com/api/v2/otlp
headers:
Authorization: ${!YOUR_DYNATRACE_TOKEN}
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp, otlphttp/elastic, otlphttp/lumigo, otlphttp/dynatrace]
metrics/newrelic:
receivers: [otlp]
exporters: [otlp, otlphttp/elastic, otlphttp/dynatrace]

Sounds great, and you can do it in a single update. If we’re talking about your personal account then I guess it’s nothing special. In a situation where you’re managing hundreds of services and thousands of lambdas, it can blow your mind because it’s one of the easiest tasks you could pick up and it has huge impact. Snippet above will report exactly same telemetry data into multiple vendors. So it’s apple to apple comparison for the whole organisation telemetries without any code changes. Just WOW.

Another great case is when you discover that someone has added data that shouldn’t be in your telemetry. In this case, you can easily delete or hash certain span attributes. It’s just a matter of adding the processor to your definition and then using it in your pipeline:

processors:
attributes:
actions:
- key: user.super-secret-key
action: delete
- key: user.email
action: hash

Things to watch out

OpenTelemetry is still an emerging standard and this means that not everything is in GA and a lot of features are experimental. Even something you might consider basic — like the metrics API for go. It works fine, but you have to keep in mind that it could change.

Documentation is getting better, yet it’s spread across many articles, release notes, readme’s and sometimes you have to dive into code to find your answer. So please, be patient. Sometimes the only option is to dive into code to find some answers, but that’s beauty of open source — that we can do it.

When I was using OpenTelemetry, there were a few times when things started to break after upgrading to the collector version. That’s the “beauty” of being an early adopter, but it doesn’t happen that often. So don’t worry too much about it. Yet it’s usually better to use specific version (as usual) rather than “latest” on production.

Remember that telemetry is not the same as auditing. So please do not base your business audits on span attributes for example. It’s not a good choice because, firstly, your traces are being sampled and, secondly, audit is audit, not telemetry. It has different business/legal requirements, time to retain and it’s not something we may or may not retain.

Summary

Telemetry is a really important topic and it’s super important for all types of organisations. I hope you enjoyed this article and I would love to hear your stories and opinions about OpenTelemetry.

If you want to read more about the whole concept, I can definitely recommend the book “Distributed Tracing in Practice”.

There is also a new book (still MEAP) on a topic by Michael Hausenblas that can be found here:

--

--