Operational Maturity

Note: This is a talk that was given at PagerDuty Summit on September 12th 2018

The first thing we must do when trying to measure our Operational Maturity is define Operational Maturity. Then we have to measure it. This seems trite and silly to say, but every organization has it's own character, and therefore it's own definition. For this talk we will refer to Operational Maturity as : a team's awareness of it's impact on people, customers and company.

So how do we define the metrics. For my team, it's Service Health. Every member on my team of Resilience Engineers needs to know the Health of every service. But it's different for me - I need to know the overall System Health (as does the rest of Management). But that's very tactical for me as a Manager. As a team Manager I need to know the Team Health. Not only that but we can extrapolate up - my Boss needs to know individual team healths, true, but they also need to know the overall Organizational health.

Well - there are some tools out there that we can use to get this information. Sort of. And this has been how we kinda sorta tryta figure it out. How many JIRA tickets of a specific type came in yesterday/this last oncall sprint. Can we look at the number of PagerDuty Incidents, what about the Slack Channels where people kick off interrupts. How about the Nagios Alerts that drive pagerduty and interrupt normal day-to-day flow. What do the External Monitors and Metrics show? It's frankly a mess and honestly, as much as we want to know this information, we really don't actually get a good measure. Especially as it's not really trackable or comparable across teams.

Let's have a look at what we intend to drive with this data/information. Well, at Castlight, it's in the name of my department. I run Resilience Engineering. To some, it looks like traditional SRE (and kinda is). To others, it's pure DevOps (and that's a whole talk in itself). To us it's a combination of SRE Methods and DevOps Cultural Practices, coupled with a mandate for education. We pride ourselves in the software industry on our relience on accountability. And trust me, I love Accountability - it means I know who to page! But Resilience isn't about accountability - who to point at to fix something that went wrong. It's about Responsibility. If you are responsible for your code/service, then proactively you want to nurture it and collaborate with teams before things go wrong, or at the instant that things seem to be going wrong. And this means that people don't get paged. No - I'm not trying to put PagerDuty out of business - I need PagerDuty right now to page when things get broken, but there's more to what we can get out of it. Because building Resiliency isn't just in the code, it's in the team. As much as I care how functional the code is/will be in production, I care how functional the team can be on a day to day basis.

Lots of people have tried to drive this cultural and behavioral change. There's plenty of movements out there - we've definitely heard "Make your Devs oncalls". And true - that needs to be a part of it, but it's reductionist. Just because Devs are oncall doesn't mean your practice is better. In fact, you may drive more burnout. "Prevent Oncall Burnout" - and yes I do have the PagerDuty sticker, but this is treating a symptom - why do we get Oncall Burnout? Because we overload specific people and people approach problems in varying "interesting" in some cases ways. What we need is a set of metrics that promote responsible engineering. This treats the root cause. We want to look at things like the Mean Time To Respond AND the Mean Time To Resolve as a combinatorial metric. We need to look at the total cost of Response - in engineer time as well as SLA penalties. We want to see the time without Major Incident. We want to see the general feeling of the team - the VIBE if you like.

What we can do with this in the real world is we can Promote Responsble engineering from the top Down, by rewarding those who actually solve issues (and don't just ack and snooze!). AND from the Bottom Up because we can allow time off for those who need it (this last week was a doozy for someone - they need R&R).

I refer to the analysis of these sorts of metrics as Metalytics. We know that when you monitor and measure the health of your team it effectively gives insight into overall system health. It's kinda akin to finding a planet by looking at the star's wobble. Not only do we now have insight into the overall system health, as well as the team health, but by managing the team health we are better placed to keep the system whole. A Healthy team is one that can respond when needed in a highly functional and productive manner.

And for the business (who ultimately has to understand why we want to spend time looking at things that aren't actually planets, but stars),it means we save money. Actual Cash Dollars. We have more productive engineers, because they aren't stressed by oncall rotations that keep them up at night. There are fewer sick days, and lower attrition because we have a happier workforce. That means we don't spend lots of money finding the best firefighter (or more likely the least bad firefighter that we can afford) but rather, we retain great firefighters who get to work on decent projects. If we measure it we can manage it (Measurement is nothing without Management and as Peter Drucker says "you can't manage what you don't measure"). So we can decrease our MTTR, our Total Cost of Response and increase our Time without Major Incident. Doing that decreases our SLA costs.

On a final note, that actually means that we can increase potential Revenue. The Quality of the system is also the Quality of the team. And when you buy SAAS, you aren't just buying the code, but the support and the response. Someone with 5 9s is going to be able to maximize the returns over someone with 3 9s for the same quality/type of service.