Software Operational Fitness

Glossary

SLA - A Service License agreement, a ratified agreement of how a service will work.

Introduction

Contrary to what most people think, software isn't just a one and done unit of work. New features, and data about customer usage profiles are always in flux. In a previous article we discussed Failure Containers, a way to have software components behave gracefully in case of failures. In this article, we will discuss how to establish a framework for ensuring healthy ops.

The Situation

We are going to assume the team is using some scanning products for security. We are also assuming there is some ticketing system that customers are using for support.

As I look back on my time at Amazon, I remember fondly the concept of "working backwards". If we apply this principle to our software fitness, we can create a list of criteria that makes software successful, and healthy. For example (Including But Not Limited To):

Security Scanner Results - How many security issues do we have in the System?
Tickets - How many Tickets do we have, and how many of them are over SLA?
Cost Analysis - Has there been any anomalies in our cost breakdown in the past few days?
Metric/Alarms Analysis - Are there any metrics/alarms that need to be addressed?
Incidents - Were there any Incidents?
Do we have any Outstanding Action Items that need to be addressed?
What do our JMX metrics look? Do we need to scale?
What does peak volume look like?

These are a few ways to ensure you can measure what's going on in your system. Feel free to mix and match.

Stoplight Dashboard Are Scalable Mediums for Data

If we draw from the Pyramid Principle - put critical information first, then details of that info later - we can design some messaging that will be explicit at first glance.

Constructing a Stoplight Dashboard can help us with this — a central page where we have our criteria — the relevant metric(s), and whether or not we are passing or failing those metric(s). I recommend something tabular, because you can see what the previous weeks were, and see if your metrics are trending downward. Being able to see the trend of this metric is critical. You can dig into anomalies very quickly together as a team.

Use this Stoplight Dashboard as a forcing function so that your teammates and manager regularly see risks, and the metric profile of application. You will be able to onboard engineers faster, be more thoughtful in code reviews, and give visibility to directors, and such.

In my past, any anomaly in the metrics were subject to further investigation. We were able to nip issues in the bud, from having extreme ownership of our services.

When you’re working with a globally deployed product, these stoplight dashboards can help the team manage multiple deployments, and when to investigate an issue in some geography.

Build Forcing Functions with Meetings

The dashboard gives visibility to the team on what is going on in the system, and lets you manage the operation of your supported products. This creates a baseline of information your team can use to make decisions about work priority, and changes in architecture.

Incident Retrospectives Can Help Predict the Future

The meetings are the right place to discuss Incidents that have occurred. They can allow you to raise the bar on incidence identification, remediation, and analysis with your team. If you are looking to impact your team as a rising star engineer, this is an excellent place to begin influencing the team.

I have had great luck with any customer impacting constituting:

Timeline of the Event
Full Root Cause Analysis (dig very deep)
Immediate Remediation Steps
Proposed Prevention

I watched other teams who did not follow this methodology in their ops suffer from the same issue multiple times. This contributed to engineer fatigue, and customer impact.

Buy In May Be Difficult, but Easy With Data

If you’re having trouble with getting buy in from your team or manager, try to make your own stoplight dashboard for some problematic feature. Use data from logging, timeseries metrics, and the number of incidents/tickets you see from this. Running through the exercise of the root cause analysis may demonstrate to your team that this “retrospective” will also act as a proactive prevention for other similar incidents.

Stoplight Dashboards can be built with Excel, or even dashboarding software like Datadog, or Grafana. Feel free to start with something simple to show your team how this Ops ritual would work.

In my past, I was able to bring a Spotlight dashboard into a weekly review for my team. Over the course of three to six months, we reduced security risks by 95%, and reduced number of incoming tickets by 60%.

The best part was, my team saw the value and then consistently raised the bar for root cause analysis, and system architecture.

Conclusion

A regularly scheduled ops meeting, where you have an explicit rubric will take you a few hours to complete per week, but the benefits are invaluable. In my experience these sort of meetings, and structured reviews contributed to preventative fixes, ownership, and even multiple kudos to my team for leading the org in onboarding time, and team efficacy.