#88: SLI, SLO and SLA: a number, a threshold and a legal document respectively
Many people, when asked about SLA, simply shout 99%. The correct answer to that question is probably a long, boring PDF, written by lawyers. Yes, SLA is a legal obligation. Not a metric or a number. You probably meant SLI or SLO.
So, Service Level Indicator, SLI for short, is simply a metric. A number that can be objectively measured within your system, that somehow describes its health. A typical SLI is the response time, uptime or error rate. SLI is what you put on those shiny dashboards in the office monitors. A response time, typically measured in milliseconds, explains how fast your system responds. Uptime tells how much time the system was operational, within a certain period. Error rate tells us how many requests ended with a failure.
All these metrics, and many more, change over time. They are dynamic. Of course, you can think of many other SLIs, more relevant to your actual business. The point is, SLI explains how your system is really doing from the bird’s eye view.
When you are providing a service to a customer, the customer may be interested in your SLIs. Before they make a purchase or during normal operations, your indicators matter to their business. For example, if your uptime is poor or response times skyrocket, their SLIs may be impacted as well. To avoid disappointments, the customer may require you to keep certain SLIs high or low. They may ask you to keep uptime above 99.99%. Or to keep the average response time below 100 milliseconds. This is understandable.
This threshold is known as SLO - Service Level Objective. Your engineering objective is to keep SLIs within agreed SLOs. 99.99% is your objective. As well as 100 milliseconds.
OK, but SLO is not just a gentlemen’s agreement. It should be a legal obligation. If you fail to meet those thresholds, legal consequences may apply. Typically you must make a partial refund. This agreement is called Service Level Agreement. SLA for short.
SLA should be quite precise. For example, how response time is measured? Including network round-trip? From which location? Also, is it average, median or 99th percentile? Finally, SLA documents typically define multiple SLOs. The bigger the violation, the higher refund you can expect.
Defining uptime is even more complex. If a critical feature is broken due to a bug, can you still consider your application up? On the other hand, if your website is down for maintenance every single day for an hour, does that count as an outage? I hope it’s not. Otherwise, the website of Polish Railway would have an uptime of less than 97%.
SLA is also important internally, within an organization. At Google, there used to be a distributed locking service called Chubby. It was so reliable that everyone basically assumed it has a 100% SLA. So, when Chubby had an outage, a lot of other services failed as well. The solution was surprising. If Chubby didn’t have an outage in the last quarter, they purposefully took it down, randomly. Developers learned how to deal with outages, knowing that SLO is real.
That’s it, thanks for listening, bye!