SRE Doesn’t Scale

We encounter a lot of organizations talking about or attempting to implement SRE as part of our consulting at Real Kinetic. We’ve even discussed and debated ourselves, ad nauseam, how we can apply it at our own product company, Witful. There’s a brief, unassuming section in the SRE book tucked away towards the tail end of chapter 32, “The Evolving SRE Engagement Model.” Between the SLIs and SLOs, the error budgets, alerting, and strategies for handling change management, it’s probably one of the most overlooked parts of the book. It’s also, in my opinion, one of the most important.

Chapter 32 starts by discussing the “classic” SRE model and then, towards the end, how Google has been evolving beyond this model. “External Factors Affecting SRE”, under the “Evolving Services Development: Frameworks and SRE Platform” heading, is the section I’m referring to specifically. This part of the book details challenges and approaches for scaling the SRE model described in the preceding chapters. This section describes Google’s own shift towards the industry trend of microservices, the difficulties that have resulted, and what it means for SRE. Google implements a robust site reliability program which employs a small army of SREs who support some of the company’s most critical systems and engage with engineering teams to improve the reliability of their products and services. The model described in the book has proven to be highly effective for Google but is also quite resource-intensive. Microservices only serve to multiply this problem. The organizations we see attempting to adopt microservices along with SRE, particularly those who are doing it as a part of a move to cloud, frequently underestimate just how much it’s about to ruin their day in terms of thinking about software development and operations.

It is not going from a monolith to a handful of microservices. It ends up being hundreds of services or more, even for the smaller companies. This happens every single time. And that move to microservices—in combination with cloud—unleashes a whole new level of autonomy and empowerment for developers who, often coming from a more restrictive ops-controlled environment on prem, introduce all sorts of new programming languages, compute platforms, databases, and other technologies. The move to microservices and cloud is nothing short of a Cambrian Explosion for just about every organization that attempts it. I have never seen this not play out to some degree, and it tends to be highly disruptive. Some groups handle it well—others do not. Usually, however, this brings an organization’s delivery to a grinding halt as they try to get a handle on the situation. In some cases, I’ve seen it take a year or more for a company to actually start delivering products in the cloud after declaring they are “all in” on it. And that’s just the process of starting to deliver, not actually delivering them.

How does this relate to SRE? In the book, Google says a result of moving towards microservices is that both the number of requests for SRE support and the cardinality of services to support have increased dramatically. Because each service has a base fixed operational cost, even simple services demand more staffing. Additionally, microservices almost always imply an expectation of lower lead time for deployment. This is invariably one of the reasons we see organizations adopting them in the first place. This reduced lead time was not possible with the Production Readiness Review model they describe earlier in chapter 32 because it had a lead time of months. For many of the organizations we work with, a lead time of months to deliver new products and capabilities to their customers is simply not viable. It would be like rewinding the clock to when they were still operating on prem and completely defeat the purpose of microservices and cloud.

But here’s the key excerpt from the book: “Hiring experienced, qualified SREs is difficult and costly. Despite enormous effort from the recruiting organization, there are never enough SREs to support all the services that need their expertise.” The authors conclude, “the SRE organization is responsible for serving the needs of the large and growing number of development teams that do not already enjoy direct SRE support. This mandate calls for extending the SRE support model far beyond the original concept and engagement model.”

Even Google, who has infinite money and an endless recruiting pipeline, says the SRE model—as it is often described by the people we encounter referencing the book—does not scale with microservices. Instead, they go on to describe a more tractable, framework-oriented model to address this through things like codified best practices, reusable solutions, standardization of tools and patterns, and, more generally, what I describe as the “productization” of infrastructure and operations.

Google enforces standards and opinions around things like programming languages, instrumentation and metrics, logging, and control systems surrounding traffic and load management. The alternative to this is the Cambrian Explosion I described earlier. The authors enumerate the benefits of this approach such as significantly lower operational overhead, universal support by design, faster and lower overhead SRE engagements, and a new engagement model based on shared responsibility rather than either full SRE support or no SRE support. As the authors put it, “This model represents a significant departure from the way service management was originally conceived in two major ways: it entails a new relationship model for the interaction between SRE and development teams, and a new staffing model for SRE-supported service management.”

For some reason, this little detail gets lost and, consequently, we see groups attempting to throw people at the problem, such as embedding an SRE on each team. In practice, this usually means two things: 1) hiring a whole bunch of SREs—which even Google admits to being difficult and costly—and 2) this person typically just becomes the “whipping boy” for the team. More often than not, this individual is some poor ops person who gets labeled “SRE.”

With microservices, which again almost always hit you with a near-exponential growth rate once you adopt them, you simply cannot expect to have a handful of individuals who are tasked with understanding the entirety of a microservice-based platform and be responsible for it. SRE does not mean developers get to just go back to thinking about code and features. Microservices necessitate developers having skin in the game, and even Google has talked about the challenges of scaling a traditional SRE model and why a different tack is needed.

“The constant growth in the number of services at Google means that most of these services can neither warrant SRE engagement nor be maintained by SREs. Regardless, services that don’t receive full SRE support can be built to use production features that are developed and maintained by SREs. This practice effectively breaks the SRE staffing barrier. Enabling SRE-supported production standards and tools for all teams improves the overall service quality across Google.”

My advice is to stop thinking about SRE as an implementation specifically and instead think about the problems it’s solving a bit more abstractly. It’s unlikely your organization has Google-level resources, so you need to consider the constraints. You need to think about the roles and responsibilities of developers as well as your ops folks. They will change significantly with microservices and cloud out of necessity. You’ll need to think about how to scale DevOps within your organization and, as part of that, what “DevOps” actually means to your organization. In fact, many groups are probably better off simply removing “SRE” and “DevOps” from their vocabulary altogether because they often end up being distracting buzzwords. For most mid-to-large-sized companies, some sort of framework- and platform- oriented model is usually needed, similar to what Google describes.

I’ve seen it over and over. This hits companies like a ton of bricks. It requires looking at some hard org problems. A lot of self-reflection that many companies find uncomfortable or just difficult to do. But it has to be done. It’s also an important piece of context when applying the SRE book. Don’t skip over chapter 32. It might just be the most important part of the book.

Real Kinetic helps clients build great engineering organizations. Learn more about working with us.

Follow @tyler_treat

13 Replies to “SRE Doesn’t Scale”

Matthew Thompson says:

October 6, 2021 at 11:51 pm

“My advice is to stop thinking about SRE as an implementation specifically and instead think about the problems it’s solving a bit more abstractly.”

I think this is great advice. one of the overriding themes for me in Google’s SRE documentation is that this is to manage infrastructure.
Assuming most organisations are moving to a cloud based model the problem Google are solving with SRE becomes different to the readers.
The intersection with Service Management becomes really important. I think that the first question should be What Service are you looking to make more reliable and what are the problems with it?
Makes me think the S should stand for Service not Site :-)

Quick note on my view on helping to solve the scaling problem.

Supporting and managing Production toil is the main part of Service Reliability Engineering (gotta try ;-)) and you rotate team members through that (e.g. week on call, week doing morning checks). Those not on production toil rotate through supporting product/development groups and teams,taking responsibility for support of other environments and release engineering as well as the Production backlog. Align to sprints of other teams for best shot at harmony and consistent planning of work.

1. Shashi Prashanth says:
  
  October 7, 2021 at 2:38 pm
  
  This is the model that i am following and it works great
  
  1. Matthew Thompson says:
    
    October 14, 2021 at 3:12 am
    
    That’s great to read Shashi – when you say it works great do you find it’s good practice and provides a good steady state or that it’s scaling? (or both!! ;-))
    
    My experience was going from a team being assigned work as it came in to aligning team members to other teams and they manage the work requests directly.
    
2. Ash P. says:
  
  October 11, 2021 at 7:11 am
  
  I totally echo Matthew’s sentiments. That it’s important to consider the problems rather than merely running the solution e.g. decrease pager fatigue rather than run a pager service. Or maybe I’m reading this too simple-like :)
  
  1. Matthew Thompson says:
    
    October 14, 2021 at 3:01 am
    
    I don’t think you are Ash :)
    
    In line with that, if you look at the problems, SRE will scale.
    
Pingback: Scaling Limitations with Site Reliability Engineering – Curated SQL
Prasanna kumar says:

October 7, 2021 at 6:50 am

Great insights, have witnessed the scenarios described literally to each word. The assumption that a SRE can run all the tools to fix all the problems while in reality he or she is a Ops, QE or Infra person labeled as SRE and has no clue is a slap in the face.
While the pattern of micro services is common and SREs can help drive the time to market at a faster rate with automation, maintaining that level of skills might not be possible with every organization. Until a proper balance of skills totally agree that SRE Doesn’t scale

Shashi Prashanth says:

October 7, 2021 at 2:47 pm

I do agree and am experiencing and seeing this in reality. Various score cards, cost factors, reusability, standardization, checklists, picking and choosing services to support will be helpful to avoid the need for more folks.

SNiazi says:

October 7, 2021 at 4:02 pm

You are right on the mark. SRE , DevOps, Obseraability have become a buzz these days & everyone is latching on to these words, specially software vendors

I manage an internet facing systems running in cloud/microservices env which has millions of users using the sytems and its literally becoming a maze of complexities to manage the growing number of microservice stacks running in diff env.

Hiring a team of SRE without having a process & structure in place , to effectively manage bugs, incidents & issues and be able to push them into the dev sprints , getting it throughly tested/vetted with testers, really requires a set of skills & tempermant & culture.

Without this process/structure/curltue & budget to support this backend (regardless you are a startup or big enterprise) merely standing up a team of SRE is like putting a cart before the horse. Hammering a squre peg in a round hole, hoping things will flawlessly work.

1. Matthew Thompson says:
  
  October 14, 2021 at 3:27 am
  
  I think that’s insightful comments SNiazi.
  
  I think the challenge you are highlighting is the skills and capability, particularly the leadership\management of SRE teams. As a side note, this role could be very difficult if the SRE is just a rebranded job title of an Ops Analyst or Engineer. You have to have a clear view on the benefits of SRE thoughtout the team.
  
  For the leadership there’s a reporting structure that’s needed to facilitate good management of Service Reliability Engineering. Tracking trending on SLIs and Incidents come to mind. It aligns to more traditional ITIL functions.
  
  It has a real impact across a technology function, for example if SLIs are decreasing find out why? Is the technical debt\engineers on toil ratio wrong? Similar to tracking incident root causes – problem management – how does this information flow up to the senior management so they can be aware that some products are not reliable?
  
Pingback: SRE Weekly Issue #291 – SRE WEEKLY
Pingback: SRE Weekly Issue #291 – FDE
Pingback: Java Weekly, Issue 407 | Baeldung

Share this:

Related

13 Replies to “SRE Doesn’t Scale”

Leave a Reply Cancel reply