Increase your DevOps Maturity & Modernize Your Monitoring Capabilities using AWS X-Ray

DevOps Capabilities: Observability & Monitoring

DevOps practices and tools are founded on the three principles of DevOps: Flow, Feedback, and Continuous Learning. The second principles of feedback and continuous learning are the less popular principles but are the most critical to achieve high quality and resilience. Traditional monitoring achieves feedback by providing capabilities in feedback ‘after the fact’, simply gathering and reporting data of running artifacts, usually limited to infrastructure. Observability is a more modern DevOps practice by taking the basic principles of monitoring and extending them to the application. As opposed to being focused solely on monitoring the production infrastructure and the reporting on typical infrastructure metrics of CPU, Memory, Disk space, modern observability also includes monitoring of application performance. This blog post focuses on achieving full Application Performance Monitoring (APM) capabilities using AWS X-Ray.

The Challenge AWS X-Ray Solves

When we develop applications, often there is a certain opacity that comes because of low observability, or more simply put, de-prioritizing non-functional requirements. Given how common it is in software development to focus on getting a feature out the door rather than creating something with observability and sustainability around it.

What if we could take the first step on that journey with minimal effort? The value of AWS X-Ray is just that – a low barrier to entry tool to begin the process of making application performance actionable.

Moving to agile software development methodologies increases the velocity of software development teams which may also increase Production release cadence. With this increased cadence, more modern observability into application performance is critical to ensure performance related Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are maintained. This is validated through research conducted by DevOps Research and Assessment (DORA). DORA’s research indicates that comprehensive monitoring and observability is a critical capability and indicator of high-performing teams. Monitoring key of SLIs (aligned with setting Service Level Objectives (SLO) according to Service Level Agreements (SLAs) ) allows us to determine how change impacts our applications. This is increasingly important as we transition from large monolithic systems to distributed cloud-native architectures as the number of moving parts needing to be monitored has vastly increased.

How AWS X-Ray Works

AWS CloudWatch provides observability into computing metrics out of the box; however, this service lacks insight into an application’s performance. Utilizing the Custom Metrics feature to record a Service Level Indicator such as ‘Response Time’ for a specific endpoint will provide actionable visibility. Unfortunately, this metric does nothing to answer the question of why the response time is what it is and that is the exact question that AWS X-Ray is designed to answer.

Essentially there are two components of AWS X-Ray that are integrated into the application’s stack:

A library provided by AWS needs to be worked into the application supporting Java, JavaScript, .NET, Ruby, Go, PHP and Python. The level of effort to integrate is dependent on the language / framework the application utilizes and the depth of instrumentation desired.
An infrastructure component that collects segments produced by the library for forwarding to AWS. This daemonis quite easy to deploy via an OS package or a Docker container. As the segments are pushed to AWS their hosted AWS X-Ray console is used to view the assembled traces.

How to Use Amazon X-Ray

As a simple demonstration, this example will utilize a ‘Backend for Frontend’ pattern for an imaginary enterprise social app. The BFF pattern highlights distributed tracing quite well and should provide a good demo.

We will look at a simple ‘home’ endpoint for the app which returns a list of team members, some TODOs, and the last five documents that the individual worked on. We will define an SLO for this ‘home’ endpoint to be a response or load time of 150ms. This code will be intentionally poor to provide interesting traces.

AWS Console: Trace Map and Segment View

The trace that AWS X-Ray creates begins with the entry into the system and collects segments from our instrumented services to provide a holistic view of the request. The AWS X-Ray console provides two main sections to analyze which consist of a Trace Map and the Segment View. The Trace Map is quite nice as a higher-level view of the system. Closely matching the application design from Figure 1. It is easy to visualize how the different services impact the SLO of the endpoint.

Application Problems

Immediately this trace highlights a few problems with the application:

As indicated from the trace map that there is an ‘Address Service’ being called several times which is not part of the design.
The average response time of 36ms from this service multiplied by the seven requests results in 256ms which is quite different than the total of 2.64 seconds reported by the member service.
The segment timeline inside the BFF Application indicates the service calls are being executed in series.
A duration of 2.8 seconds is far greater than the 150ms budget defined by the SLO.

At this point it is obvious that fixing or optimizing the Member Service will provide the largest gains in reaching the service level objective. As this is a distributed tracing system the trace for the Member Service is available and a quick look at it shows significant processing time after each request to the address service.

Code Review

Reviewing the code answers the question as to what is happening between each request to the address service but raises many questions.

the BFF Application will be updated to parallelize the requests to the dependent services

First, if the address service accepts an array of member identifiers, why are we sending the requests one by one?
Second, the ‘team’ endpoint looks like cut and pasted code from the ‘member’ endpoint;
1. Is the member address needed when retrieving a list of team members?
2. The address service seems quite frail, what will happen once this application is in the hands of hundreds of users?

These are all interesting questions, regardless the design does not require the address service so it can be removed. In addition, the BFF Application will be updated to parallelize the requests to the dependent services.

In our final trace, we have achieved the SLO of 150ms as indicated by the total duration of 121ms. The requests to the backend services are performing as they should – in parallel. There is an opportunity for additional gains by looking into the work the TODO service is performing.

Instrumenting and measuring an applications’ performance is far more powerful than just determining if it is up or down. Distributed tracing provides a high-level view of an application’s infrastructure providing the ability to drill down and focus on key transactions that impact service level objectives.

AWS X-Ray is one of many Observability tools that can be used to instrument application code. Although it may be limited in features and the languages it supports, it does provide a very low barrier to entry and requires very little infrastructure to achieve its purpose.