Skip to content

Latest commit

 

History

History
1081 lines (770 loc) · 33 KB

File metadata and controls

1081 lines (770 loc) · 33 KB
theme seriph
colorSchema light
title impl Observability for Apotheosis {}
fonts
mono
JetBrains Mono
transition slide-left
background /soft-pastel-blend.png

impl Observability for Apotheosis {}

moving towards better understood software


Who Am I?

I'm Jacob! Some people call me dino.


Some things I do:

  • Software Engineer at CarbinX Technologies
  • CTO at Annona
  • Part of a systems research group at UofA

Where you can find me:


Why Care? — The Cost

$76M
median annual cost of high-impact outages per org
$794K
average cost per incident
$300K+
per hour of downtime for 90% of enterprises

Why Care? — The Hidden Cost

33%
of engineer time spent firefighting — another 33% on maintenance
175 min
average time to resolve a single incident
2.5x
harder to detect an incident than to resolve it

Why Care? — It's Getting Worse

+43%
increase in customer-facing incidents year over year
82%
of orgs have MTTR > 1 hour — up from 47% just 4 years ago
10%
of organizations practice full observability today

Why Care? — The Payoff

2.8x
faster issue resolution for observability leaders vs. beginners
2.6x
annual ROI on observability investment
1/2
hourly outage cost with full-stack observability vs. without

Observability vs. Monitoring

"One night a police officer sees a drunk searching the ground beneath a streetlight and asks what he is looking for. The drunk says he has lost his keys. The police officer can't find them either and asks: 'Are you sure you lost them here, under the streetlight?' The drunk replies: 'No, but this is where the light is best.'"
— Brendan Gregg

Observability vs. Monitoring

Monitoring — Is something I know about broken?

Observability — Can I get answers to arbitrary questions about my system?


Methodologies

Scientific Method

Hypothesis-driven investigation

USE Method

Utilization, Saturation, Errors

RED Method

Rate, Errors, Duration


Methodologies — Scientific Method


Methodologies — USE Method

For every resource, check utilization, saturation, and errors.
• Resource — CPU, memory, disk, network, locks
• Utilization — what % of time is the resource busy?
• Saturation — is work queuing up?
• Errors — are there error events?


Methodologies — RED Method

For every service, check the request rate, errors, and duration.
• Rate — how many requests per second?
• Errors — how many are failing?
• Duration — how long do they take?

The Three Pillars of Observability

Metrics

A statistic measured over time

"How much? How often?"

Logs

Discrete events with rich contextual detail

"What happened?"

Traces

A request's journey across services and components

"Where did time go?"

Together they form a more complete picture of system behaviour

Metrics

A statistic measured over time
  • CPU usage — % of processor time in use across cores
  • Memory usage — bytes allocated, heap, stack, RSS
  • Network bandwidth — bytes in / out per second
  • Request rate — requests per second handled by a service
  • Error rate — proportion of requests resulting in an error
  • Latency — time to serve a request — p50, p95, p99

Logs

An event that occurred within your program
Unstructured
ERROR Failed to connect to database: timeout after 30s
Structured
```json { "level": "error", "message": "Failed to connect to database", "error": "timeout after 30s" } ```

Logs — Levels

ERROR
Actionable failures — something broke and needs attention
WARN
Unexpected but recovered — worth tracking, not paging
INFO
Significant business events — meaningful and infrequent
DEBUG
Developer diagnostics — off in production by default
TRACE
Finest-grained detail — execution paths, variable states, loop iterations

Logs — What to Log

Each log should represent a complete unit of work — include everything needed to understand what happened, why it happened, and how to fix it.
✗ Missing context
{
  "level": "error",
  "message": "Failed to process payment"
}
✓ Full context
{
  "level": "error",
  "message": "Failed to process payment",
  "order_id": "ord_8f3k2",
  "user_id": "usr_19ac4",
  "amount": 49.99,
  "provider": "stripe",
  "error": "card_declined",
  "duration_ms": 320,
  "trace_id": "4bf92f3577b34da6"
}

Logs — Cost & Sampling

Logs scale linearly with traffic — with high enough traffic levels, logging everything becomes really expensive.
Probabilistic
Log a fixed % of all requests — simple but unsophisticated
Head-based
Decide at request start — low overhead, but may drop interesting events
Tail-based
Decide at request end — keep errors and slow requests, drop healthy-path traffic


Traces

The journey of a request through your system
A trace is made up of one or more spans — each span represents a logical unit of execution in your program.
Trace 4bf92f
API Gateway
Span 1
Auth
Span 2
Payment Service
Span 3
Cache
Span 4
DB Query
Span 5

Traces — Anatomy

Name
What operation this span represents
Trace ID
Unique ID shared across all spans in the request
Start / End
When the span began and finished
Status
Ok, Error, or Unset
Attributes
Key-value pairs — user ID, HTTP method, DB query, etc.
Events
Timestamped annotations — "cache miss", "retry"
{
  "name": "POST /payments",
  "trace_id": "4bf92f3577b34da6",
  "span_id": "00f067aa0ba902b7",
  "start": "2024-01-15T10:23:41Z",
  "end": "2024-01-15T10:23:41.320Z",
  "status": "error",
  "attributes": {
    "user_id": "usr_19ac4",
    "http.method": "POST",
    "http.status_code": 402
  },
  "events": [
    { "name": "card_declined" }
  ]
}

Traces — Context Propagation


Traces — Cost & Sampling

Traces scale with traffic in the same way logs do — tracing everything becomes expensive fast.
Same sampling strategies apply — probabilistic, head-based, and tail-based.

What is OpenTelemetry?

A CNCF vendor-neutral, open standard for instrumentation:

  • A data format
    • This is what a span/trace/metric/log looks like
  • An API & SDK
    • The interface and actual implementations
  • A wire protocol (OTLP)
    • How telemetry is transported between systems
  • Semantic conventions — standardized attribute names across the industry
    • http.method, db.system, rpc.service — not method, database, service
    • Ensures dashboards, alerts, and queries work consistently across services and vendors
  • And more — a collector, auto-instrumentation...
Instrument once, route anywhere.

The Grafana Stack

Prometheus
Scrapes and stores metrics — the source of truth for numeric time-series data
Loki
Stores and queries logs — indexes labels only, not full content
Tempo
Stores traces at scale — object storage backed, queryable via TraceQL
Grafana
The unified UI — dashboards, alerts, and cross-signal exploration across all three backends
Alloy
The collector — receives telemetry from your services and routes it to the right backend
Pyroscope
Continuous profiling — always-on flame graphs integrated with traces
Your App
Alloy
Prometheus
Loki
Tempo
Grafana

Prometheus — Metric Types

Counter

Monotonically increasing — only ever goes up

http_requests_total

Gauge

Current snapshot — can go up or down

memory_usage_bytes

Histogram

Bucketed observations — tracks distribution of values

request_duration_seconds_bucket

Summary

Pre-calculated quantiles on the client side

request_duration_seconds{quantile="0.99"}

Prometheus — Time Series

A single metric becomes a time series — a sequence of (timestamp, value) pairs recorded at regular intervals.
http_requests_total = 42
Add labels to describe dimensions of the metric — each unique combination of labels creates a new time series.
http_requests_total{method="GET", status="200", service="api"} = 42
Prometheus scrapes these on a schedule and stores each series independently — giving you the ability to filter, aggregate, and compare across dimensions at query time.


Prometheus — Cardinality

Every unique label combination creates a new time series — more combinations means more memory and disk.
Good labels
Bounded, low cardinality — method, status_code, service, region
Bad labels
Unbounded, high cardinality — user_id, order_id, trace_id, email
Symptoms
Prometheus OOM, slow queries, huge WAL, scrape timeouts


Grafana Stack — Connecting the Signals

Exemplars
A metric data point that carries a trace ID — see a p99 spike, click the dot, jump straight to the trace for that request
Log → Trace
OTel injects trace_id and span_id into every log line — click a log entry in Loki to open the full trace in Tempo
Trace → Logs
From a slow span in Tempo, filter Loki for all logs sharing that trace_id — see exactly what your code was doing
Service Graph
Derived automatically from trace data — inter-service call rates, error rates, and topology without any extra configuration

Grafana Stack — The Investigation Flow

Alert fires
error rate spike
Dashboard
see the spike
Trace
click exemplar
Logs
root cause

Alerting Philosophy

Page on symptoms
Elevated error rate, high latency — things users actually experience
Investigate causes
CPU, memory, disk — leading indicators for dashboards and tickets, not pages
Every alert needs
A clear user impact, urgency, and a runbook — if it fires and nobody acts, remove it

Profiling — Flame Graphs

Width = time spent. Read bottom-up. Wide blocks are the hot path.
main()
handle_request()
parse_body()
send_response()
serde::de::Deserialize<T>
← bottleneck

Profiling — Instrumentation

SDK / Library — Add the Pyroscope SDK to your service. You control what gets profiled, labels are rich, and you get per-request granularity. Costs a small runtime overhead.

eBPF — Zero code changes. The kernel observes your process from outside — CPU sampling, off-CPU blocking, memory allocations. Works for any language, any binary.


Profiling — Traces & Profiles

Traces tell you which span was slow. Flame graphs tell you which line of code caused it. Together they close the loop.

In practice — Pyroscope attaches a profile to a trace span. Jump from a slow span directly to the flame graph captured during that exact time window.


Questions?


Thank You