Your friendly neighborhood AppSec advisor and honeypot enthusiast. Formerly @ Goldman Sachs and Ernst & Young. Find his thoughts in code form committed to Github.
Having the privilege to work with so many great enterprise customers has its benefits. One great benefit is you are always learning something new. With a diverse set of organizations operating in different ways and seeking to solve different problems, having to learn something new is unavoidable.
In this blog post, I want to share a very simple and helpful availability metric I learned about while responding to a customer's request. This metric is called aggregate availability, and it comes straight out of Google's seminal book on the subject, Site Reliability Engineering. I would imagine, aggregate availability may not be new to many readers, but if it new to you, read more about it here.
It involves a very simple calculation. Aggregate availability is equal to successful requests divided by total requests. For Signal Sciences' customers, later in this post I provide a Python script to run this calculation across all your sites.
In the use case where I applied this, successful requests were defined as any request that resulted in an HTTP status code less than 500. The 5XX series of error codes indicate some type of server-side error occurred, and an unsuccessful request.
Conveniently, Signal Sciences maintains data on total requests processed by agents, and time-series data with counts for all status codes. This data is available via the API, the two endpoints used are Get Overview Report Data and Get Timeseries Request Info.
Note, the script can take one argument, which specifies the time period you want to run the calculation for. In the example below, -7d was provided to run the calculation against all requests for the last 7days.
Signal Sciences Demo Site1:
Total Requests: 1952594 Server Errors: 7670 Aggregate Availability: 99.61%
Signal Sciences Demo Site2:
Total Requests: 1936173 Server Errors: 35 Aggregate Availability: 100.00%
Signal Sciences Demo Site3:
Total Requests: 0 Server Errors: 0 Aggregate Availability: No site activity.
Aggregate availability is definitely a useful metric to keep an eye on. A drop in availability would be a huge concern for many organizations. I hope you've learned something new in this blog post. I'll continue to share tips and new learnings like this one in the future.