Monitoring

CDN observability: What HTTP logs tell you about user experience

In recent studies by an observability tool vendor, over 50% of developers and operators said that they rely on customers to tell them about errors - meaning that there is a sizable blind spot in their monitoring systems. Conversely, teams that have comprehensive monitoring and observability solutions in place were 30% more likely to be inthe highest-performing group of DevOps teams (see the DORA State of DevOps report 2018 for more details). So today, let's look at closing another gap in making web applications observable and finding errors quickly: collecting metrics from the CDN (Content Delivery Network) logs and generating real-time insights, like increases in error rates on certain pages, or dips in processed traffic.

HTTP logs are surprisingly rich data sources. When processed, they can expose data about traffic patterns and error rates, split across different types pages and stages in the user journey. For example., reading HTTP logs in an e-commerce environment, we might see traffic on category and search pages rising first, followed by product detail pages, followed by view cart and checkout calls.That makes these logs an interesting source of information for forecasts. By looking at the trends for category and search pages alone, we can make informed forecasts on the order count in 10-30 minutes time.

Anatomy of an example HTTP log

HTTP logs were the primary source of user experience monitoring in the early days of the internet. They have fallen out of favor since; web analytics vendors place Javascript code on pages instead, in order to collect data from a user perspective across various different web applications. That way, HTTP logs don't have to be collected from those individual applications, and it becomes easy to track what each user does during his or her session. However, web analytics tools have some blind spots, which make them problematic for monitoring purposes:

  • All pages must be instrumented. If the Javascript tracking code is missing on a certain page, or a certain application, then it won't show up in reports.
  • Accuracy of reporting depends on how pages are instrumented. If incorrect tracking code is inserted on a page, it will be incorrectly reported.
  • Connection problems and server errors will not show up. If a user can't retrieve an HTML page, the Javascript instrumentation will not be triggered and web analytics won't receive a report.
  • When the CDN delivers cached content, we will continue to see web analytics data even when the web servers themselves are already down.

As outlined in "10 steps for fixing scalability issues in large-scale web applications", we need accurate measurements from a user perspective to be able to detect issues, and to calibrate load tests. Processing HTTP logs from the CDN closes many of the blind spots in web analytics. CDNs like Akamai can make HTTP log streams available for a whole site, across all underlying applications. That makes it possible to process the data in real-time, to better understand what web responses users are getting. Processing CDN logs is not an algorithmically difficult problem; the problem is that there is so much data. The challenge is to build a scalable yet affordable solution that can process millions of daily log lines at a low cost.

For a client in the e-commerce industry, we built a simple yet scalable solution: Edgesense is an open-source tool using serverless functions and streaming services on AWS. How it works:

  1. Collects data from the CDN
  2. Processes the data, in order to map URLs to certain types of pages and stages in the user journey
  3. Calculate metrics on the data, such as the number of total requests, and number of requests by HTTP status code and URL pattern
  4. Make metrics available to monitoring systems, to visualize the data and make it available for alerting

The client has been using this solution since mid-2018. Their teams have been able to quickly detect problems after software releases a number of times, lowering their mean time to resolve (MTTR) this kind of incident. The CDN log data is also being used in reporting DevOps KPIs, such as error rates.