theme

seriph

colorSchema

light

title

impl Observability for Apotheosis {}

fonts

mono
JetBrains Mono

transition

slide-left

background

/soft-pastel-blend.png

impl Observability for Apotheosis {}

moving towards better understood software

Who Am I?

I'm Jacob! Some people call me dino.

Some things I do:

Software Engineer at CarbinX Technologies
CTO at Annona
Part of a systems research group at UofA

Where you can find me:

On GitHub as jdeinum
On nullspaces.io
On LinkedIn as Jacob Deinum

Why Care? — The Cost

$76M

median annual cost of high-impact outages per org

New Relic Observability Forecast 2025

$794K

average cost per incident

PagerDuty Cost of Incidents 2024

$300K+

per hour of downtime for 90% of enterprises

ITIC Hourly Cost of Downtime 2024

Why Care? — The Hidden Cost

33%

of engineer time spent firefighting — another 33% on maintenance

New Relic Observability Forecast 2025

175 min

average time to resolve a single incident

PagerDuty Cost of Incidents 2024

2.5x

harder to detect an incident than to resolve it

SolarWinds IT Trends Report 2024

Why Care? — It's Getting Worse

+43%

increase in customer-facing incidents year over year

PagerDuty Cost of Incidents 2024

82%

of orgs have MTTR > 1 hour — up from 47% just 4 years ago

Logz.io Observability Pulse 2024

10%

of organizations practice full observability today

Logz.io Observability Pulse 2024

Why Care? — The Payoff

2.8x

faster issue resolution for observability leaders vs. beginners

Splunk State of Observability 2024

2.6x

annual ROI on observability investment

Splunk State of Observability 2024

1/2

hourly outage cost with full-stack observability vs. without

New Relic Observability Forecast 2025

Observability vs. Monitoring

"One night a police officer sees a drunk searching the ground beneath a streetlight and asks what he is looking for. The drunk says he has lost his keys. The police officer can't find them either and asks: 'Are you sure you lost them here, under the streetlight?' The drunk replies: 'No, but this is where the light is best.'"

— Brendan Gregg

Observability vs. Monitoring

Monitoring — Is something I know about broken?

Observability — Can I get answers to arbitrary questions about my system?

Methodologies

Scientific Method

Hypothesis-driven investigation

USE Method

Utilization, Saturation, Errors

RED Method

Rate, Errors, Duration

Methodologies — Scientific Method

Methodologies — USE Method

For every resource, check utilization, saturation, and errors.

• Resource — CPU, memory, disk, network, locks

• Utilization — what % of time is the resource busy?

• Saturation — is work queuing up?

• Errors — are there error events?

Methodologies — RED Method

For every service, check the request rate, errors, and duration.

• Rate — how many requests per second?

• Errors — how many are failing?

• Duration — how long do they take?

The Three Pillars of Observability

Metrics

A statistic measured over time

"How much? How often?"

Logs

Discrete events with rich contextual detail

"What happened?"

Traces

A request's journey across services and components

"Where did time go?"

Together they form a more complete picture of system behaviour

Metrics

A statistic measured over time

CPU usage — % of processor time in use across cores
Memory usage — bytes allocated, heap, stack, RSS
Network bandwidth — bytes in / out per second
Request rate — requests per second handled by a service
Error rate — proportion of requests resulting in an error
Latency — time to serve a request — p50, p95, p99

Logs

An event that occurred within your program

Unstructured

ERROR Failed to connect to database: timeout after 30s

Structured

```json { "level": "error", "message": "Failed to connect to database", "error": "timeout after 30s" } ```

Logs — Levels

ERROR

Actionable failures — something broke and needs attention

WARN

Unexpected but recovered — worth tracking, not paging

INFO

Significant business events — meaningful and infrequent

DEBUG

Developer diagnostics — off in production by default

TRACE

Finest-grained detail — execution paths, variable states, loop iterations

Logs — What to Log

Each log should represent a complete unit of work — include everything needed to understand what happened, why it happened, and how to fix it.

✗ Missing context

{
  "level": "error",
  "message": "Failed to process payment"
}

✓ Full context

{
  "level": "error",
  "message": "Failed to process payment",
  "order_id": "ord_8f3k2",
  "user_id": "usr_19ac4",
  "amount": 49.99,
  "provider": "stripe",
  "error": "card_declined",
  "duration_ms": 320,
  "trace_id": "4bf92f3577b34da6"
}

Logs — Cost & Sampling

Logs scale linearly with traffic — with high enough traffic levels, logging everything becomes really expensive.

Probabilistic

Log a fixed % of all requests — simple but unsophisticated

Head-based

Decide at request start — low overhead, but may drop interesting events

Tail-based

Decide at request end — keep errors and slow requests, drop healthy-path traffic

Traces

The journey of a request through your system

A trace is made up of one or more spans — each span represents a logical unit of execution in your program.

Trace 4bf92f

API Gateway

Span 1

Auth

Span 2

Payment Service

Span 3

Cache

Span 4

DB Query

Span 5

Traces — Anatomy

Name

What operation this span represents

Trace ID

Unique ID shared across all spans in the request

Start / End

When the span began and finished

Status

Ok, Error, or Unset

Attributes

Key-value pairs — user ID, HTTP method, DB query, etc.

Events

Timestamped annotations — "cache miss", "retry"

{
  "name": "POST /payments",
  "trace_id": "4bf92f3577b34da6",
  "span_id": "00f067aa0ba902b7",
  "start": "2024-01-15T10:23:41Z",
  "end": "2024-01-15T10:23:41.320Z",
  "status": "error",
  "attributes": {
    "user_id": "usr_19ac4",
    "http.method": "POST",
    "http.status_code": 402
  },
  "events": [
    { "name": "card_declined" }
  ]
}

Traces — Context Propagation

Traces — Cost & Sampling

Traces scale with traffic in the same way logs do — tracing everything becomes expensive fast.

Same sampling strategies apply — probabilistic, head-based, and tail-based.

What is OpenTelemetry?

A CNCF vendor-neutral, open standard for instrumentation:

A data format
- This is what a span/trace/metric/log looks like
An API & SDK
- The interface and actual implementations
A wire protocol (OTLP)
- How telemetry is transported between systems
Semantic conventions — standardized attribute names across the industry
- http.method, db.system, rpc.service — not method, database, service
- Ensures dashboards, alerts, and queries work consistently across services and vendors
And more — a collector, auto-instrumentation...

Instrument once, route anywhere.

The Grafana Stack

Prometheus

Scrapes and stores metrics — the source of truth for numeric time-series data

Loki

Stores and queries logs — indexes labels only, not full content

Tempo

Stores traces at scale — object storage backed, queryable via TraceQL

Grafana

The unified UI — dashboards, alerts, and cross-signal exploration across all three backends

Alloy

The collector — receives telemetry from your services and routes it to the right backend

Pyroscope

Continuous profiling — always-on flame graphs integrated with traces

Your App

→

Alloy

→

Prometheus

Loki

Tempo

→

Grafana

Prometheus — Metric Types

Counter

Monotonically increasing — only ever goes up

http_requests_total

Gauge

Current snapshot — can go up or down

memory_usage_bytes

Histogram

Bucketed observations — tracks distribution of values

request_duration_seconds_bucket

Summary

Pre-calculated quantiles on the client side

request_duration_seconds{quantile="0.99"}

Prometheus — Time Series

A single metric becomes a time series — a sequence of (timestamp, value) pairs recorded at regular intervals.

http_requests_total = 42

Add labels to describe dimensions of the metric — each unique combination of labels creates a new time series.

http_requests_total{method="GET", status="200", service="api"} = 42

Prometheus scrapes these on a schedule and stores each series independently — giving you the ability to filter, aggregate, and compare across dimensions at query time.

Prometheus — Cardinality

Every unique label combination creates a new time series — more combinations means more memory and disk.

Good labels

Bounded, low cardinality — method, status_code, service, region

Bad labels

Unbounded, high cardinality — user_id, order_id, trace_id, email

Symptoms

Prometheus OOM, slow queries, huge WAL, scrape timeouts

Grafana Stack — Connecting the Signals

Exemplars

A metric data point that carries a trace ID — see a p99 spike, click the dot, jump straight to the trace for that request

Log → Trace

OTel injects trace_id and span_id into every log line — click a log entry in Loki to open the full trace in Tempo

Trace → Logs

From a slow span in Tempo, filter Loki for all logs sharing that trace_id — see exactly what your code was doing

Service Graph

Derived automatically from trace data — inter-service call rates, error rates, and topology without any extra configuration

Grafana Stack — The Investigation Flow

Alert fires

error rate spike

→

Dashboard

see the spike

→

Trace

click exemplar

→

Logs

root cause

Alerting Philosophy

Page on symptoms

Elevated error rate, high latency — things users actually experience

Investigate causes

CPU, memory, disk — leading indicators for dashboards and tickets, not pages

Every alert needs

A clear user impact, urgency, and a runbook — if it fires and nobody acts, remove it

Profiling — Flame Graphs

Width = time spent. Read bottom-up. Wide blocks are the hot path.

main()

handle_request()

parse_body()

send_response()

serde::de::Deserialize<T>

← bottleneck

Profiling — Instrumentation

SDK / Library — Add the Pyroscope SDK to your service. You control what gets profiled, labels are rich, and you get per-request granularity. Costs a small runtime overhead.

eBPF — Zero code changes. The kernel observes your process from outside — CPU sampling, off-CPU blocking, memory allocations. Works for any language, any binary.

Profiling — Traces & Profiles

Traces tell you which span was slow. Flame graphs tell you which line of code caused it. Together they close the loop.

In practice — Pyroscope attaches a profile to a trace span. Jump from a slow span directly to the flame graph captured during that exact time window.

Questions?

Thank You

jdeinum

nullspaces.io

Jacob Deinum

FilesExpand file tree

slides.md

Latest commit

History

slides.md

File metadata and controls

impl Observability for Apotheosis {}

Who Am I?

Why Care? — The Cost

Why Care? — The Hidden Cost

Why Care? — It's Getting Worse

Why Care? — The Payoff

Observability vs. Monitoring

Observability vs. Monitoring

Methodologies

Scientific Method

USE Method

RED Method

Methodologies — Scientific Method

Methodologies — USE Method

Methodologies — RED Method

The Three Pillars of Observability

Metrics

Logs

Traces

Metrics

Logs

Logs — Levels

Logs — What to Log

Logs — Cost & Sampling

Traces

Traces — Anatomy

Traces — Context Propagation

Traces — Cost & Sampling

What is OpenTelemetry?

The Grafana Stack

Prometheus — Metric Types

Counter

Gauge

Histogram

Summary

Prometheus — Time Series

Prometheus — Cardinality

Grafana Stack — Connecting the Signals

Grafana Stack — The Investigation Flow

Alerting Philosophy

Profiling — Flame Graphs

Profiling — Instrumentation

Profiling — Traces & Profiles

Questions?

Thank You