🟢 The green line indicates the proportion of total non-error requests till that corresponding time point.
🔴 The red line is a static line at 99% health.
Regular Alerting: There would be 17 alarms for every time the Service’s error_rate was > 1%. And this would result in: - Alert fatigue is real bro! - Constant distraction towards Maintenance and no time for features.
Service Level Objective: The only alert that mattered was when the Service Level Objective dropped below the desired 99% because they are not sensitive towards point-in-time fluctuations.
How to set Service Level Objectives?
At first, the thought of not paying attention to every Error seems counterintuitive, but as the business matures and the scale increases, one agrees that failures are inevitable.
How do you define the “acceptable” behavior of this Service? And the variables are:
🎯 How many requests should be allowed to fail, or Compliance Target.
⏱️ Say you are ok if 1% of the requests fail. 1% over what Compliance Period? A year, a week, a month, an hour ......?
How to choose the correct compliance?
⏱️ Compliance Period
Usually 1x or 2x of your engineering sprint cycle; this ensures that errors from one sprint do not break the budget of the next sprint. However, shorter periods also tend to be more sensitive to failure and may create more noise.
🎯 Compliance Target
Make sure to choose a limit that serves you well without multiple breakages. For example, if you desire to set 99.9%, it’s wise to start with 99% and see if it served you well.