Site reliability

Disaster Recovery and resilience in microservices architectures

Back when monolithic enterprise applications were the norm, there was a simple blueprint for disaster recovery: make sure there are backups, and if anything happens to the data center in which the application is hosted, make sure that you can restore your backups to a secondary data center. This also meant that disaster recovery capabilities could be added to an application as an afterthought, and that mainly IT operations teams (and not necessarily software development teams) would need to worry about it. In today's highly distributed architectures, that approach doesn't work anymore, yet much of the IT industry's best practices for Business Continuity are still caught in the old patterns. Here, we propose a more modern and agile approach to business continuity planning (BCP) and disaster recovery (DR).

BCP/DR still matters

Business Continuity Planning (BCP) is the practice of identifying threats to the company, and ensuring that critical business functions can continue even when disaster strikes. Disaster Recovery (DR) plans, procedures and tools are designed to ensure controlled recovery from disasters, and thus supports the goal of business continuity. Both contribute towards "resilience", the ability of an organization (or the whole society) to deal with adverse circumstances.

Ideally, a company meticulously maintains an inventory of its critical business processes, regularly assesses the business impact of various threats to those processes ("BIA", Business Impact Analysis) and maintains policies, procedures, tools and records that make sure that everyone is prepared when disruptive events happen.

At a certain state of maturity, companies need to have BCP/DR capabilities, even if they are not in the business of operating critical infrastructure such as power plants. Regulatory bodies as well as B2B customers want to have proof that companies can run their business even under adverse conditions. Violation of service level agreements in B2B contracts can mean serious consequences. Even without SLAs, data center failures carry high costs, estimated at over $740.000 per hour by Emerson and Ponemon Institute in 2016. With digital transformation making industries more dependent on IT, the cost is bound to rise.

Traditional BCP/DR approaches

In the world of large enterprise IT, a large number of critical processes has relied on a small set of monolithic software applications over the last decased. For recovering from disasters, that is a fairly simple situation to plan for: As long as there are only a few applications to worry about, and they are connected only through batch jobs that synchronize data between them, a one-size-fits all approach can be used to recover from most threat scenarios on an infrastructure level. The DR recipe found in traditional organizations usually consists of the following:

  1. Backup/Restore systems for important data, running daily or more frequently, often focusing on a single database system or virtual machine snapshots.
  2. A method of replicating data to a secondary data center, and being able to provision new servers (or turn on existing ones) there quickly.
  3. Disaster recovery plans that describe how to initiate the failover to the secondary data center.
  4. Annual tests of the disaster recovery plans, and documentation of those tests.

This recipe has the great advantage that only a few teams (often, the infrastructure and database teams) are responsible for disaster recovery, and most other teams don't need to worry about resilience too much. Of course, there are some drawbacks to that approach:

  • It assumes that important data is centralized in a few spots, and is blind to any persistent data stored elsewhere.
  • It's expensive to test frequently, so the plans often fall out-of-date between tests.
  • It uses the same procedure for all applications, and constrains the architectural choices that can be made in each application.
  • It takes hours or even days to fully implement in case of a disaster.

They don't work for microservices

Netflix popularized the concept of "microservices", small services with a narrow scope. Their reasoning is that breaking a large system into many small independent parts is the only way to keep up a high rate of change while still maintaining overall stability. Though there a certainly challenges with implementing that approach, microservices have become the de-facto standard for designing large systems. That means that modern applications are composed of a network of small services, each of which manages its own persistent data in different databases and storage locations. Often, they are hosted at different cloud locations around the globe. What does that mean for BCP/DR?

First, it means that traditional "Backup/Restore to secondary datacenter" plan is to BCP/DR what the firewall is to IT Security: an outdated approach that can't cover the needs of distributed systems. It's good to have this basic capability, and it's not sufficient: Resilience needs to be built into systems on a higher level in order to be effective.

Second, it means that microservice architectures are less vulnerable to large catastrophic risks, but more vulnerable to many smaller risks (such as outages of other services) that can happen - this can also be seen in the DevOps survey.

Building DR capabilities instead of a DR plan

In an architecture composed of networked microservices, a major requirement is that they store persistent data independently in order to reduce dependencies. That thwarts the plan of maintaining a single DR plan that centers on backing up data in a large database. Instead, what matters is that each service has a capability to deal with extreme circumstances. What was a DR plan in the old world needs to turn into a set of DR capabilities, managed independently across different teams, but governed and actively managed.

Where a monolithic system needs a DR plan, a microservice architecture needs a DR blueprint that can be (but doesn't have to be) used by service owners.

Where a backup/restore-based DR system is provided for a monolithic system, DR capabilities need to be built and maintained for each service in a microservice architecture.

Where a DR test is performed every year in a monolithic system, resilience needs to be continuously tested in a microservice architecture.

Practices for building DR capabilities

We have found these practices to be effective when

  1. The Business Continuity Policy is still relevant. Independent of application architectures, people still need to know that resilience and continuity are important concepts, and be aware of the impact that system outages could have on business goals and processes. Thus, the BC policy is still relevant, but it doesn't help if only the infrastructure team knows it exists. Instead, it needs to be communicated to all teams that plan, build and run critical systems.
  2. Establish a highly visible risk management practice to find resilience risks and assess their impact in business terms. All teams building technology should be encouraged to think about the threats their systems face, and contribute their assessment of substantial risks into a central inventory. This helps technology leaders allocate resources (give development time) to address the largest risks and balance delivery of new features with ensuring overall stability of the system.
  3. Maintain service dependency maps. In highly distributed application landscapes, there is a risk of cascading failures of services. Shielding services from failures in other services starts with knowing which services they depend on, and what to do when those services are not available. Likewise, assessing the risks from outages of services is hardly possible without knowing which business processes and other services depend on them. Therefore, keeping a map of dependencies between processes and services up-to-date (and making the map available to all teams who need it) is crucial to understanding where resilience can be improved.
  4. Measure and continuously improve resilience. When disasters don't frequently occur, it's not easy to measure how prepared an organization is to deal with them. However, it's often easy to identify a vulnerable system by looking at its availability, spikes in response times or effort involved in deploying new versions. Measuring and publicly reporting such performance indicators goes a long way to keeping people aware of the goal of resilience. Google's Site Reliability Engineering (SRE) approach, for instance, emphasizes the need for development teams to deliver on their service level objectives (SLOs), and encourages operational teams to push back on delivery of new features if stability objectives have not been met.
  5. Implement a DR blueprint, consisting of tools and procedures. Though there can be no one-size-fits-all solution to disaster recovery, DR needs of different services can be very similar. For instance, services using MySql as database technology can benefit from a "best practice" for replicating data across global regions and switching to a replication slave when the master fails. Similarly, Velero is a base technology for restoring services that run on Kubernetes clusters.
  6. Demand a demonstration of DR capabilities when services go into production. Depending on an organization's software lifecycle model or product maturity model, not all services need to have the same level of resilience. For critical services however, there should be controls that make sure that a service is reasonably resilient as soon as it is live. Documenting resilience capabilities in a "Service Runbook" is a good start, but only by continuously testing for resilience can you make sure that the plans actually work.
  7. Encourage game days to test how services respond to unusual conditions, and document the results from the tests. Not every organization needs to have a "Chaos Testing" practice, and not many microservices architectures have been built to withstand random loss of individual services. However, nobody can expect DR recovery plans to work that were never tested. And though there is a risk involved in simulating outages, the risk is higher when creating brittle architectures by assuming that nothing will fail.

Corporate Digital Responsibility

With "Digital Innovation" top-of-mind in many industries, it's easy to see that decision makers are willing to take risks and ship digital products quickly in order to test new markets as Minimum Viable Products (MVPs). The proponents of "Corporate Digital Responsibility" argue that the risks from those MVPs are ignored, and that disasters don't happen frequently enough to provide adequate "market feedback" that could make sure that the risks are addressed.

However, it's questionable whether new regulatory requirements for CDR would actually contribute to greater resiliency of digital services. How does one audit the ability of a system to recover from various disruptive events, when those events happen very seldom? It's likely that a regulatory approach to resilience based on checklists of tried-and-true traditional practices, and thus enforce the same kind of monolithic "one-size-fits-all" approach that is getting less and less relevant in distributed systems and diverse operation environments.

Rather, I would hope that increasing market awareness for stability and resiliency, in combination with better built-in resilience of components, will eventually lead to services that are more reliable. With society increasingly networked and dependent on digital services for everyday life, the importance of resilience is sure to rise.