Skip to content

Systems Observability - What To Consider Before Writing Your First Line of Code

Posted on:April 7, 2019 at 11:30 AM (13 min read)

Observability is still a relatively novel concept to many organisations. While it’s easy to say that you want to do “observability”, it is in fact a state that your systems are in. Monitoring, logging, and tracing of systems are things that we should do in order to achieve observable systems.

This is the first post I’ll be making in a series of post centred around observability practices, and will be covering the basics of practices to help you achieve more observable systems. The next post in this series will be covering observability as code, and how I’ve used Terraform to make setting up DataDog a breeze for new applications. Something I’ll also want to cover is SLIs, and how they should be determined when making applications production-ready.

I am also paraphrasing a few things from Google’s Site Reliability Handbook and Distributed Systems Observability, both of which I have found incredibly useful in determining what good observability looks like.

When considering observability, it is useful to know the following:

In this post, I’ll aim to cover a few things we should be doing, such as:

Table of contents

Open Table of contents

Defining SLIs, SLAs, and SLOs

Before beginning to even come up with monitoring for your systems, you should understand what “healthy” looks like for them and how that can be determined. While we would all love to have 100% uptime of our systems, we should know that this is something that is impossible to determine given the amount of variables that come into play outside of what we have control over.

As Google’s SRE Handbook mentions, the more reliable you want to make a service, the more it’ll cost to operate. Moving from 99.9% to 99.99% can be a costly endeavour and often requires some Herculean efforts to pull off. What we want to define ourselves is the lowest level of reliability we can get away with (a static, numerical value), and that is what we define as a Service-Level Objective (SLO).

A Service-Level Agreement (SLA) is a promise to someone that our availability SLO will meet a certain level over a certain period, and failure to do so means a penalty will be paid. This could be a partial refund of a subscription, or additional subscription time added to someone’s account. The severity of the penalties issued by the breach of a SLA should dictate how much money is invested in ensuring reliability of systems. This could be something like 99% uptime over the course of a year. For a 24/7 service, this equates to 87.6 hours of downtime breaching the SLA.

A Service-Level Indicator (SLI) however, is a metric that we use internally, and this metric should visualise a service availability percentage that we use to determine if we have been running within our SLO for a certain period of time. Monitoring a SLI and alerting on it should indicate that we need to invest more effort in the reliability of the system that has been affected.

Having an idea of all three of these should give us some clarity around what we’re using to determine system availability.

Running distributed systems gives us the flexibility to set different SLOs for different services that we run. When running monolithic systems, it becomes incresingly likely that a service cannot degrade gracefully, but an issue unrelated to core business functionality can cause outages to core systems.

Good Coding and Testing Practices

Historically, the idea of testing was something that was only ever referred to as a pre-production or pre-release activity. This school of thought is slowly being phased out as development teams are now responsible for developing, testing, and operating the services they make.

The idea that production environments are sacred and not meant to be fiddled around with also means that our pre-production environments are at best a pale imitation of what production actually looks like. The 12 Factor App manifesto focuses on the applications we write having minimal divergence between development and production.

While pre-production testing is very much a common practice in modern software development, the idea of testing with live traffic is seen as something alarming. This requires not only a change in mindset, but also importantly requires an overhaul in system design, along with a solid investment in release engineering practices and tooling.

In essence, not only do we want to architect for failure, but also coding and testing for failure when the default is to code and test for success. We must also acknowledge that our work isn’t done once we’ve pushed our code to production.

We should aspire to look to expand the reach of our testing. The following diagram mentions many of the ways we can begin writing more resilient systems:

{{< figure src=“/img/testing.webp” alt=“Testing” position=“center” style=“border-radius: 8px;” caption=“Figure 3-1 from Distributed Systems Observability by Cindy Sridharan” >}}

What Should We Be Monitoring?

Observability is a superset of both monitoring and testing. It provides information about unpredictable failure modes that couldn’t be monitored or tested.

{{< figure src=“/img/monitoring.webp” alt=“Monitoring” position=“center” style=“border-radius: 8px;” caption=“Figure 2-1 from Distributed Systems Observability by Cindy Sridharan” >}}

That being said, we should still focus on having a minimal set of requirements for monitoring our systems. A good set of metrics to begin with for monitoring are the USE Method and RED Metrics. Depending on the use case, we should be able to monitor some if not all of these metrics. Monitoring data should at all times provide a bird’s-eye view of the overall health of a system by recording and exposing high-level metrics over time across all components of the system. This includes, but is not limited to:

Monitoring data accompanying an alert should provide the ability to drill down into components and units of a system as a first port of call in any incident response to diagnose the scope and coarse nature of any fault.

It is also worth noting that good monitoring means metrics are being shipped out of our hosts, ideally to a Time Series Database like Prometheus. If you find yourself SSH’ing into a box to debug issues, this usually means that you’re not shipping enough information from your hosts to make your systems observable.

USE Metrics (For every resource, check utilisation, saturation, and errors):

RED Metrics (Rate, Errors, Duration):

Application tracing works its way into this setup as well, as traces give us some valuable information that could otherwise be lost with basic USE or RED metrics. One thing to look out for is not jumping onto the easiest thing to measure, which is often the mean of some quantity. We can’t necessarily monitor and alert on the mean usage of something like CPU utilisation, as it can be utilised in a very imbalanced way. The same can be said about latency.

Example:

As long as we are monitoring what is referred to as the Four Golden Signals in Google’s SRE Handbook, we are covering what is considered to be a minimal, yet effective set of metrics for determining bottlenecks in our systems or potential outages. The metrics above should cover those four.

What Should We Be Logging?

While logging gives is a great high level overview, it is only really useful if we are getting useful information from our logs. It’s not really worth logging literally everything that your application does, as there are much better ways of introspecting application behaviour such as tracing.

In most organisations, there always seems to be a severe lack of understanding of the logging platforms that have been implemented. If you’ve read The Phoenix Project, you’ll probably realise that the lack of understanding of things like logging platforms comes from Brent. He is the guy that knows how to do everything, is responsive to everyone, and generally the most helpful guy. As a result Brent becomes a bottleneck for all work endeavors. This comes from a culture of “easier to do it than explain it or teach it”, and this becomes a problem when dealing with distributed systems. If you identify a Brent in your organisation, call it out, and have that information shared with everyone before it becomes a problem.

When writing our own applications, we must consider what we could use the logs for, and how we can enable more verbose logging (DEBUG). The requirements for logging should include, but not be limited to:

For the last two points (logging major entry, exit, and decision points), the idea is to have them all semantically linked in a way (request ID), that allows us to:

What Should We Be Alerting?

While each team is able to tweak what they would like to get alerted on, alerts should at a minimum contain the USE and RED metrics (mentioned in the monitoring section). This includes at a minimum:

We should also keep in mind that we do not create too much noise with our alerting, which dilutes the effectiveness of an alert which should be informing us of abnormalities in our systems. If we are setting scaling policies on our services at 70% utilisation, we shouldn’t be alerted on this, but only when the pressure exerted on our systems fails to release as part of a scaling activity (i.e. cumulative resource utilisation hitting 80%+ for X amount of time).

There are two common events that may be sent as alerts: Warnings and Breaches:

The human cost of alerting is something that should come into consideration when creating alerting policies. A lot of the time, these alerts are “paging events” which either disrupt someone’s day-to-day work or their sleep schedules. As mentioned in Google’s SRE Book (paraphrasing):

…a balanced on-call work distribution limits potential drawbacks of excessive on-call work, such as burnout or inadequate time for project work. (Chapter 11. Being On-Call - Andrea Spadaccini)

Some Notes

In Part Two of this series, I’ll be covering monitoring as code (as all good things should be) in DataDog, and how robust Infrasturcture as Code Tooling such as Terraform allows for idempotency and repeatability in setting all of this up.

In part two of this series, I’ve decided to cover some of the basics of monitoring like setting useful SLIs, SLOs, and what they mean for your business.

I wouldn’t have been able to write this without all of the great material already available: