Not every web application was born in the cloud and satisfies the criteria for clean and scalable 12-factor applications. Actually, successful web businesses tend to build complex interconnected systems over time. Making sure those systems perform well even under sudden spikes of traffic, or when a 3rd-party service fails, is not an easy task. It's even harder to retrofit a scalable and resilient architecture onto an existing system that wasn't build that way. Look no further than Black Friday 2018 for some highly visible website outages, often caused by overloaded servers or unexpected issues with 3rd party APIs and page components.
However, it's absolutely possible to (a) restore robustness, scalability and good user experience to an existing systems landscape, and (b) prepare for a special event - say, a holiday season promotion or a major new feature launch - without re-engineering the whole system or madly throwing hardware at the issue. This article outlines the 10 major steps of such a project and how they relate to each other.
Before going into the specifics of the methodology, a quick word on project management aspects. In mid-size to large organizations, many different teams will need to shared the objectives of robustness and scalability, and work in alignment. Therefore, standard project management methods (agile and traditional) apply - getting buy-in from management and various stakeholders, securing the participation of key people and overall resourcing. However, "fixing scalability" isn't a great definition of project scope - rather, most part of the project is about continuous improvement, in order to increase performance and reduce risks until the system is in a "good enough" state. The second part of the project leads up to "competition day", the special event for which we need everything to be in optimum shape.
We will want to provide value early to show that the method is working, and give some reassurance that things can be fixed. Therefore, we recommend starting to measure early, sharing KPIs with the whole team, and then running several iterations to improve them in a somewhat open-ended setup. Success of the project / program is defined as "KPIs are good enough and special event goes well". Though the activities below are described as distinct steps, they are not meant to taken sequentially - overall success hinges on the ability to build a continuous feedback cycle that will yield improvements in all dimensions.
The goal of observability is to bring existing issues to light, and create visibility into interdependencies and scalability constraints in the system. Everyone has monitoring, everyone has logs, but telemetry data has to be put into context to be able to make sense of it. Some key practices we have found to be effective:
- Monitoring from the user perspective - using RUM (real-user monitoring) and user-facing metrics such as HTTP error rates to measure (and report on) system health. We like gahtering data on the CDN layer, see "CDN observability: What HTTP logs tell you about user experience" on this site.
- Using APM (Application Performance Monitoring) tools across the entire infrastructure to be able to easily see how components interact.
- Setting up curated metrics on dashboards and publicly visible information radiators, to help teams develop a shared understand of how their components contribute to the user experience.
Improving observability is the basis for selecting the right parts of the system to optimize, and understanding system behavior during load tests (step 4), incidents and game days (step 6), and finally on competition day (step 10).
2. Architecture Principles
Every organization hopefully has architecture guidelines and best practices, but are they known? Are they tested for and adhered to? Do people buy into them? A "community of practice" approach goes a long way towards establishing some easily remembered guidelines that developers will actually apply in practice. Of course, having guidelines doesn't retrofit them onto existing software. But this step prepares the software improvements (step 5) and resiliency testing (step 6). Also, feedback from production performance - during normal operations, load tests and game days - will inform continuous improvement of the architecture principles.
An ideal system can scale to meet any demand. All real-world systems have scalability constraints, however, so getting business stakeholders to estimate the load on the system on peak-load days is essential for actually testing for those constraints. Getting realistic forecasts is a science in itself, but having them is essential for preparing load tests (step 4) and hardware scaling (step 7).
4. Load Testing & Analysis
In our experience, running load tests on the production environment is essential in any non-trivial architecture. Admittedly, it's not easy to build load tests that accurately simulate the behavior of actual users, while not modifying actual production data. However, the goal is not to cover 100% of functionality with load tests. Instead, want to have a solid understanding of the gaps between simulation and reality, and be able to estimate the remaining risks from those gaps. We have developed iterative methods for aligning simulated traffic with actual user traffic and scaling tests up to the forecasted levels. See "Making load tests life-like" on this site for more detail on this.
Observing system behavior (step 1) during load tests is essential for understanding the most relevant architectural issues (to be addressed in step 5) and scalability constraints (to be addressed in step 7).
5. Software Improvements
Using our set of architecture principles and findings from the load tests, we can now "surgically" address the most impactful issues in the software architecture. Multiple development teams may need to work together to address issues; APM traces will give valuable clues into where services are not working together well. It's essential to get the bulk of software improvements done before throwing hardware at the rest of the issues (step 7).
6. Game Days & Resiliency
While load tests and software improvements can prepare the system for well-defined and expected situations, issues can be caused by a multitude of internal and external dependencies - for example, internal services or 3rd party SaaS services that send unexpected responses, infrastructure issues, error loops and cascading failures.
The goal of the resiliency track is to harden the system against the highest ranked risks. Although Chaos Engineering has developed methods to methodically inject various failures, we have found that much manual experimentation is needed to make progress in this area, and that requires developers as well as product owners to buy into the concept and participate. An organization in which developers don't share the pain of outages, because they are shielded by operations teams or platform teams, is less likely to establish successful "Game Days" (establishing DevOps culture is beyond the scope of this article, however). Resiliency improvements lay the groundwork for improving availability metrics, and reduce the risk of unexpected trouble during competition day (step 10).
7. Hardware Scaling
Software improvements can't solve all scalability issues, especially when 3rd party software is involved, so we will acknowledge that judiciously adding hardware and adjusting autoscaling policies (after addressing software issues) is a necessary activity in increasing scalability. Every hardware and software change will need to prove its effectiveness through load testing, though. With an optimized and scaled-up / scaled-out system that can meet the forecasted load, the system is are now nearly ready for competition day (step 10).
8. Contingency Planning
With improved software architecture, validated through load tests and game days, the system is now theoretically able to withstand a wide range of unusual situations. However, systems tend to be creative in finding new ways to fail, and traffic forecasts will almost certainly be inaccurate. Thus, we recommend creating a ranked list of remaining risks, and developing contingency plans for the highest-ranked ones. Ideally, all plans would be tested beforehand to prove that they work, though we are willing to acknowledge that not all catastrophic failure situations can be easily simulated. The contingency plans will become part of the competition day runbook (step 10).
9. Last Optimizations
With the system hopefully already in a much more stable and reliable state as measured by the KPIs, the remaining step is to embrace change - forecasts will change, timelines will change, configurations will be modified, and it's useful expect some last-minute turbulence before the special event starts. However, most organizations try to avoid last-minute turbulence in the technical domain by establishing feature freezes and more rigid change control before the event starts, to minimize the risk of change-related issues.
10. Competition Day: Event & Situation Room
Nearly done! What remains to be done is to establish effective communication and decisions structures for the event day, make sure all relevant aspects of the system are being watched, have the contingency plans and runbooks ready and be prepared to jump in for last-minute scaling or configuration changes if necessary. Good luck!
Anything can fail, anytime. But after successfully running several improvement iterations following this 10-step program, we have found that actual competition days can be very relaxed, and serve to confirm the good work in preparing for all possible situations without introducing huge amounts of change.