You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"One night a police officer sees a drunk searching the ground beneath a streetlight and asks what he is looking for. The drunk says he has lost his keys. The police officer can't find them either and asks: 'Are you sure you lost them here, under the streetlight?' The drunk replies: 'No, but this is where the light is best.'"
— Brendan Gregg
Observability vs. Monitoring
Monitoring — Is something I know about broken?
Observability — Can I get answers to arbitrary questions about my system?
Methodologies
Scientific Method
Hypothesis-driven investigation
USE Method
Utilization, Saturation, Errors
RED Method
Rate, Errors, Duration
Methodologies — Scientific Method
Methodologies — USE Method
For every resource, check utilization, saturation, and errors.
• Resource — CPU, memory, disk, network, locks
• Utilization — what % of time is the resource busy?
• Saturation — is work queuing up?
• Errors — are there error events?
Methodologies — RED Method
For every service, check the request rate, errors, and duration.
• Rate — how many requests per second?
• Errors — how many are failing?
• Duration — how long do they take?
The Three Pillars of Observability
Metrics
A statistic measured over time
"How much? How often?"
Logs
Discrete events with rich contextual detail
"What happened?"
Traces
A request's journey across services and components
"Where did time go?"
Together they form a more complete picture of system behaviour
Metrics
A statistic measured over time
CPU usage — % of processor time in use across cores
Memory usage — bytes allocated, heap, stack, RSS
Network bandwidth — bytes in / out per second
Request rate — requests per second handled by a service
Error rate — proportion of requests resulting in an error
Latency — time to serve a request — p50, p95, p99
Logs
An event that occurred within your program
Unstructured
ERROR Failed to connect to database: timeout after 30s
Structured
```json
{
"level": "error",
"message": "Failed to connect to database",
"error": "timeout after 30s"
}
```
Logs — Levels
ERROR
Actionable failures — something broke and needs attention
WARN
Unexpected but recovered — worth tracking, not paging
INFO
Significant business events — meaningful and infrequent
DEBUG
Developer diagnostics — off in production by default
Prometheus scrapes these on a schedule and stores each series independently — giving you the ability to filter, aggregate, and compare across dimensions at query time.
Prometheus — Cardinality
Every unique label combination creates a new time series — more combinations means more memory and disk.
Good labels
Bounded, low cardinality — method, status_code, service, region
Bad labels
Unbounded, high cardinality — user_id, order_id, trace_id, email
A metric data point that carries a trace ID — see a p99 spike, click the dot, jump straight to the trace for that request
Log → Trace
OTel injects trace_id and span_id into every log line — click a log entry in Loki to open the full trace in Tempo
Trace → Logs
From a slow span in Tempo, filter Loki for all logs sharing that trace_id — see exactly what your code was doing
Service Graph
Derived automatically from trace data — inter-service call rates, error rates, and topology without any extra configuration
Grafana Stack — The Investigation Flow
Alert fires
error rate spike
→
Dashboard
see the spike
→
Trace
click exemplar
→
Logs
root cause
Alerting Philosophy
Page on symptoms
Elevated error rate, high latency — things users actually experience
Investigate causes
CPU, memory, disk — leading indicators for dashboards and tickets, not pages
Every alert needs
A clear user impact, urgency, and a runbook — if it fires and nobody acts, remove it
Profiling — Flame Graphs
Width = time spent. Read bottom-up. Wide blocks are the hot path.
main()
handle_request()
parse_body()
send_response()
serde::de::Deserialize<T>
← bottleneck
Profiling — Instrumentation
SDK / Library — Add the Pyroscope SDK to your service. You control what gets profiled, labels are rich, and you get per-request granularity. Costs a small runtime overhead.
eBPF — Zero code changes. The kernel observes your process from outside — CPU sampling, off-CPU blocking, memory allocations. Works for any language, any binary.
Profiling — Traces & Profiles
Traces tell you which span was slow. Flame graphs tell you which line of code caused it. Together they close the loop.
In practice — Pyroscope attaches a profile to a trace span. Jump from a slow span directly to the flame graph captured during that exact time window.