SkillRary

Connectez-vous pour commenter

Difference Between SLAs, SLOs and SLIs

  • Amruta Bhaskar
  • Jun 22, 2021
  • 0 commentaires
  • 2331 Vues

The SLA, SLO, and SLI are related concepts though they're different concepts.

  • SLA or Service Level Agreement is a contract that the service provider promises customers on service availability, performance, etc.
  • SLO or Service Level Objective is a goal that service provider wants to reach.
  • SLI or Service Level Indicator is a measurement the service provider uses for the goal.

So here is the relationship. The service provider needs to collect metrics based on SLI, define thresholds of metrics based on SLO, and monitor the thresholds of metrics so that it won't break SLA. In practice, the SLIs are the metrics in the monitoring system; the SLOs are alerting rules, and the SLAs are the numbers of the monitoring metrics applying to the SLOs.

Usually, the SLO and the SLA are similar while the SLO is tighter than the SLA. The SLOs are generally used for internal-only, and the SLAs are for external. If a service availability violates the SLO, operations need to react quickly to avoid it breaking SLA, otherwise, the company might need to refund some money to customers.

The SLA, SLO, and SLI are based on such assumption that is the service will not be available 100%. Instead, we guarantee that the system will be available greater than a certain number, for example, 99.5%.

Identifying SLIs that Matter to the User

The team knows it needs to examine availability and set SLOs for it, so it begins looking at the user journey. The QA team has already done some documentation, so the team refers to the user journeys outlined there and augments this documentation with their journeys.

The team identifies critical points that receive the brunt of complaints. Team members also look into black box monitoring, a tactic that helps identify issues from a user’s perspective. With black box monitoring, the team acts as an external user of the service with no access to the internal monitoring tools. This allows team members to concentrate on a few metrics that directly correlate with user happiness.

After looking at their user’s journey, the team determines that the individual load pages of each tab on the expenses feature don’t load slowly individually, but when someone needs to skim through 2 or more pages, it becomes tedious. So the team also decides to create an SLO for response time at the load balancers as well.

Establishing Corresponding SLOs

After the team determines its SLIs, it’s time to set up the SLOs. The team is looking at the availability of the site (a common complaint), as well as the latency issue on the expense page. While the team plans to add more SLOs later, these two will serve as the guinea pigs.

For the latency issue, the team sets an SLO for all pages to load in under 1 second. This faster load time means that users won’t be irritated scrolling through multiple pages. The team then moves on to the availability of SLO.

Based on traffic levels, customer usage, NPS scores, the team has determined that its customers are likely to be happy with 99.5% availability. On the other hand, data from previous months suggests customer satisfaction and usage doesn’t seem to increase when uptime is greater than 99.9%. This means that there’s no reason to optimize at this point for higher than a 99.5% uptime metric.

  • With SLOs in place, the team will need to work on what to do if these targets are missed by creating an error budget policy. This policy will detail:
  • The acceptable level of failure in the system over a given period (the error budget)
  • Alerting and on-call procedures for the service
  • Escalation policies in the event of error budget depletion

An agreement to halt feature development and focus on the reliability after a certain amount of time where the error budget is exceeded.

Once everyone agrees, the SLOs are launched. The team watches carefully and reiterates at the monthly error budget meeting. After a few months, the team feels confident enough to add more SLOs.

Agreeing on SLAs

SLAs are an external metric, therefore not goaled the same way as SLOs. SLAs are a business agreement with users that dictates a certain level of usability. The engineering team is aware of SLAs but doesn’t set them. Instead, the team sets SLOs more stringently than the SLAs, giving themselves a buffer.

For example, the team’s 99.5% availability SLO means the service can only be down 3.65 hours per month. However, the SLA that the organization signs with users specify that it must maintain a 99% availability. This means the service can be down 7.31 hours per month. The team has a buffer of 3.66 hours per month. Now, the team can work on new features with guardrails for reliability. The organization will benefit from happier users and the team has the confidence to innovate while remaining reliable.

When used together, SLI, SLO, and SLAs are powerful tools that allow you to provide the best for your users. While it can be tricky to get these metrics right, a culture of revision, iteration, and blamelessness will help you achieve your reliability goals.

Connectez-vous pour commenter

( 0 ) commentaires