In a survey called "From DevOps to AIOps" that ran in the first quarter of 2019, I asked development and operations professionals about the state of their organization and technology. The respondents were mainly from SaaS and e-commerce companies, where both failure to deliver features and failure to operate services safely usually cause intense pain. What I wanted to learn about was:
- What are the largest problems teams face?
- How to they deal with the pressure to deliver features faster and faster?
- How do they detect and resolve incidents?
- What types of monitoring capabilities have they developed?
- How do they make sure services have enough resources and capacity, while also managing infrastructure cost?
- How could the new breed of "AIOps" tools address teams' needs in SaaS and e-Commerce?
The results are interesting, I'll outline some of the findings in this article.
Faster delivery + insufficient automation + low observability = headaches in production
Across the board, people report that they struggle with keeping up with business demands to deliver features faster, with complaints about incidents and availability as well as the effort that goes into building and maintaining CI/CD pipelines also ranking high.
I think there is a connection here: Organizations are pushing to accelerate their rate of software delivery. To do that end, they need to automate as much as possible of QA and deployment tasks – but once the quick wins from automating the basic tasks have been harvested, the effort for automating the full cycle increases exponentially (especially in software that wasn't built for delivery automation!). That leaves teams with gaps in their delivery pipeline as well as less-than ideal tooling. Combine this with software components that may not have been designed for maximum resiliency, and it's obvious that there is a certain risk of incidents.
Unexpectedly, infrastructure cost is not a major issue for nearly all respondents - I would have expected organizations to be more sensitive to public cloud costs, but there doesn't seem to be much pressure for cost-cutting in this area.
Cloud-native stacks trump enterprise stacks in nearly all dimensions
I'm glad I included a question about the type of technical architecture - enterprise applications or cloud-native services - in the survey; this turned out to be the biggest influence factor on an organization's performance as well as the maturity of their tools! On practically all measurements featured in the survey, teams using cloud-native stacks fare better:
- They can ship features faster
- The development can take greater responsibility for operations (DevOps)
- The software components get better scores with regards to availability and scalability
- Teams with cloud-native stacks use more advanced observability tools
- They are more proactive in sizing resources and managing cost
Looking at the issues that remain even with cloud-native stacks, it's interesting to see that they are different from the types of issues that enterprise software teams encounter (albeit on a lower level): Issues with resiliency, root cause detection and infrastructure have the highest mind share, whereas problems with undetected incidents and scalability incidents are hardly ever mentioned by cloud-native teams.
This, again, seems plausible: teams operating cloud-native service landscape will have a certain degree of auto-scaling built in, and will certainly be more aware of the need to detect and resolve issues in production quickly. However, they will be dealing with an interconnected network of services, and it's more likely to encounter resilience issues here than in a monolithic enterprise application. And when something goes wrong, it's not trivial to find out which service or component is at fault - therefore, the importance of comprehensive monitoring, log aggregation and distributed tracing is higher than in an enterprise software setting.
Contact me if you're interested in the detail results and numbers.