From a 20.000 foot perspective, it's hard to imagine how the "Capacity Management" process as described by ITIL should remain relevant in the cloud era. It was made for centralized groups that are responsible for scarce compute and storage resources, whereas with cloud infrastructure, resources are for all practical purposes no longer limited, and managing them is the cloud provider's job.
Does cloud infrastructure make capacity management someone else's problem?
No, it doesn't - the focus for capacity management just shifts to higher layers. Managing scalability limits and making efficient use of resources on the application level is still highly relevant, even on cloud infrastructure. It may be the cloud providers' responsibility to provide sufficient resources, but as an application owner in a hyperscale environment, you still need to make sure your app doesn't hit scalability limits.
And with infrastructure cost decreasing, why not just ruthlessly overprovision all resources in lieu of keeping an eye on capacity? Unfortunately, infrastructure cost is not zero yet, and you will find overprovisioned infrastructure quickly eating into your digital products' margin - not to mention running out of capacity when application usage suddenly skyrockets.
Autoscaling solves all problems?
The two main risks addressed by capacity management are:
- Running out of resources, and risking service downtimes
- Overspending due to excessive resource usage
Autoscaling in the cloud is positioned to solve both problems at the same time, by making sure that services have just the right amount of resources at any given time. But it's not that easy. A closer look at a typical cloud-native application reveals a spider web of interconnected scalability limits, bottlenecks and configuration parameters. Take a service running on AWS EC2 as an example:
- An application running on an EC2 instance is confined to that instance's resource limits - CPU, network, storage throughput (configurable through volume sizes and provisioned IOPS), in addition to application-specific limits such as thread pool sizes, connection pool sizes, heap limits.
- Though an autoscaling group can provision new EC2 instances when needed, new instances will take some time to spin-up, making resource supply less elastic than potential demand
- The maximum number of EC2 instances will be constrained by network CIDR block limits and the approved maximum of EC2 instances (per region and instance type, and total per region)
- Applications will depend on shared components that have their own, potentially not autoscaling limits: internet gateways, proxies, artfifact repositories for deployment, AWS APIs such as tagging, CloudWatch
- Persistent storage will impose its own scalability limits, such as instance sizes and provisioned storage for RDS databases, provisioned DynamoDb reads & writes, sharding and block size settings for streams
- While container clusters such as Kubernetes reduce the spin-up time for services and allow for more advanced autoscaling, the configured resource limits of the clusters themselves remain, in addition to the limits of any persistent storage systems.
- Even serverless applications have maximum concurrent invocations, function timeouts and maximum memory settings, and using IPs in private VPCs when running inside a VPC
In any non-trivil application, scalability and cost-efficiency are still dependent on good design, performance testing and continuous measurement of performance, cost and resource saturation. Those aspects are, in essence, what cloud capacity management is about.
Cloud capacity management focuses on applications, not on resouce pools
Capacity management in the pre-cloud era was very focused on managing resource pools (such as shared storage, mainframe compute, virtualization environments), and forecasting slow-moving trends in their usage. This part of capacity planning remains relevant for groups managing shared resouces - public cloud providers, private cloud providers, or groups managing container clusters. In the age of digital transformation and DevOps, there is growing demand for capacity management from groups managing applications and services - everyone operating cloud-based applications in a dynamic environment has a responsiblity for balancing application performance and cost.
Cloud capacity management needs to
- be aware of interdependencies between resources, and associated scalability limits
- be able to deal with rapid change in software and infrastructure, and
- be able to deal with spikes in demand that are impossible to plan for ahead of time.
So what kind of tool stack will support capacity management for cloud services?
Capacity management tools need to be integrated into monitoring and cloud management
The objective of cloud capacity management is to make sure that services have sufficient, but not excessive, resources available to them at any given time. In an ideal world, capacity managers need to be able to observe traffic and saturation, monitor resource bottlenecks and limits, simulate load on the system, and analyze the behavior of systems under load.
Let's assume that your business service will experience seasonal peaks, such as Black Friday sales. Service limits and dependencies frequently change, as new deployments take place nearly every day. You run performance tests against your production environment, generating huge amounts of monitoring data on various levels. How do you make sense of the data? Which metrics will indicate that a subsystem has reached a limit, how system errors propagate through the network of services, which leading indicators could have given early warning of trouble to come?
We think that next-generation tools will need to provide these critical capabilities:
- Observe: monitor and measure application inputs and outputs, traffic and latency, errors, saturation of relevant limits on various levels of the stack.
- Connect: build an awareness of what services are running and what dependencies they have on infrastructure and other services, building a topology and discovering configured limits as well as potential bottlenecks that are approaching.
- Simulate: simulate load and perform stress tests, to discover hidden constraints and depdendencies - beyond a certain level of complexity, experimentation needs to replace knowledge of theoretical scalability limits.
- Analyze: analyze behavior of systems under load, and generate insights and recommindations based on trends, correlation, dependencies and saturation of constratins.
- Act: take automated action to prevent services from running into capacity limits, and to release resources that are no longer needed.
What types of tools can be used in cloud capacity management?
Many organizations have tools for observing (monitoring) infrastructure and simulating load (performance tests). Only larger organizations have tools for connecting data sources and running analytics.
Traditional Capacity management tools have typically been built for groups that manage shared resource pools - compute and storage. Some of their methodology may still apply to running Kubernetes clusters, for example, but most features will be difficult to apply to a fast-changing, application-centric world.
APM (Application Performance Management) tools use distributed tracing to discover connections between applications, and in many cases also between applications and infrastructure. That's excellent for building a topology of components and keeping it up-to-date, something that is difficult to do in a manually-maintained CMDB. However, APM's focus is not really to connect data across all layers and look at resource usage; most tools focus on finding application-level dependencies and troubleshooting user-facing issues.
Cloud Management Platforms are driven by the need to govern self-service cloud usage, and manage the infrastructure cost from cloud providers. Most of them collect not only billing data, but also data on resource usage, and offer "rightsizing" functionality - a type of capacity management focused on virtual machines and storage volumes. The platforms that aim to be the only way applications are deployed to (multi-) clouds are theoretically in a good position to connection application-level and infrastructure-level data, however they rarely go as deep into monitoring and topology-building as APM tools do.
The emerging "AIOps" software category provides tools for processing more data than humans can't handle. Given the rapid rate of change and the amount of data available in modern systems, there can be no doubt that humans can't watch and process all data needed for capacity management. Semi-intelligent analysis systems will need to help humans make sense of the data more effectively, and pinpoint potential scalability bottlenecks.
Who is in charge of managing cloud capacity?
In a resource-oriented capacity management approach, it's clear who is responsible for capacity planning: it's the IT infrastructure team that provides shared resource pools such as storage and compute clusters, whereas application-centric teams are in charge of consumption.
In an application-oriented capacity management approach, the responsibility for capacity planning needs to be with the teams owning each service. In essence, that means that modern tools need to enable a number of small development teams to easily perform tasks that were once the domain of a centralized group of specialists. Capacity management tools need to enable developers and DevOps engineers to fully own their "slice" of the system, and observe, connect, simulate and analyze their components.
Building a siloed operations specialist group for capacity management will fall short of the mark, as centralized teams are increasingly struggling to keep up with development teams' speed of delivery. Instead, a central capacity management group can act as competence center, offering consulting and a tool stack for semi-autonomous service teams. It might also be organized as a less formal "community of practice", with individuals sharing practices and knowledge.
What's your approach to cloud capacity management?
It's obvious that there is no simple answer to the question; many groups and technical layers are involved in the process, and all practices and tools will highly depend on who is responsible for capacity, and whether organizations focus more on cost efficiency versus user experience and time to market for digital services.
This article atttempted to outline the emerging area, and it's interesting to see how related the software categories for capacity management, cloud management, APM and AIOps are with regards to their contributions to helping balance user experience and cost. To continue on your journey into the wonderful world of capacity, see below for a some examples of tools and the areas they focus on.
Capacity Management Tools
Focus areas: Resource utilization in aggregate, forecasts, saturation limits, reservations, "what if" analysis
Examples: BMC TrueSight Capacity Optimization, Densify, ITRS Insights Capacity Planner, Syncsort Athene
Cloud Management Tools
Focus areas: Analyze cloud cost and usage, "rightsizing" and cleaning up resources, define policies, cloud migration planning, self-service provisioning and deployment
Examples: Cisco CloudCenter, Cloudability, CloudCheckr, CloudHealth (Vmware), Rightscale
Focus areas: Distributed tracing through applications and services, metrics for application and infrastructure performance, finding incident root causes
Examples: AppDynamics (Cisco), DataDog, Instana, New Relic
Focus areas: Collecting machine data, running statistics and machine learning, detecting anomalies and outliers, finding incident root causes
Examples: BigPanda, CA AIOps, FixStream, MoogSoft,