Mastering Observability 101 is the flagship open course from TelemetryFlow Observability (TFO) — a structured, 6-week, 12-session journey that takes learners from the fundamentals of system monitoring all the way to modern, vendor-neutral, control-plane–driven observability with OpenTelemetry and TelemetryFlow.
This repository contains the complete curriculum: the master COURSE.md syllabus, the downloadable slide decks in pdf/, and the shared brand/license assets in assets/. The material is designed to be delivered live (instructor-led) or self-paced.
Author: Dwi Fahni Denni · Platform: TelemetryFlow Observability (TFO) · License: CC BY-NC-SA 4.0 · Version: 1.0.0 · Last Update: 2026-06-24
- Overview
- Learning Outcomes
- Who Is This Course For?
- Prerequisites
- Course Architecture
- Curriculum (12 Sessions)
- Slide Decks (PDF)
- How to Use This Repository
- Learning Pathway
- Key Concepts Covered
- Hands-On & Assignments
- Documentation
- Enterprise Training
- Repository Structure
- Installation / Local Use
- Contributing
- Code of Conduct
- License
- Author
- Acknowledgments
Observability is no longer optional. Modern distributed systems — microservices, Kubernetes, event-driven architectures — generate a flood of telemetry data, yet debugging has become harder, not easier. The "Dashboard of Doom" (100 green dashboards while users see errors) is a symptom of the gap between monitoring (asking "is it broken?") and observability (asking "why is it broken?").
TelemetryFlow Academy bridges that gap. The course is built around three progressive arcs:
- Foundations (Weeks 1–2) — From monitoring to observability, the MELT pillars (Metrics, Events, Logs, Traces), and reliability engineering with SLIs/SLOs.
- Deep Dive (Weeks 3–5) — Distributed tracing, structured logging, OpenTelemetry (API, SDK, Collector), profiling, and frontend observability.
- Scale & Platform (Week 6) — The pain of scale (cost, noise, rigidity), control-plane concepts, and a hands-on introduction to TelemetryFlow Observability.
Each session is delivered as a self-contained slide deck (~12–15 slides), paired with discussion prompts, hands-on exercises, and an assignment.
By the end of this course, learners will be able to:
- Explain the difference between monitoring and observability, and articulate why observability is a system property, not a tool.
- Identify and categorize the four pillars (MELT) of telemetry data and choose the right pillar for each scenario.
- Apply the RED, USE, and Google Golden Signals methods, and define meaningful SLIs, SLOs, and error budgets.
- Read and author distributed traces and structured logs, and correlate them via
trace_id/span_id. - Describe the OpenTelemetry architecture (API, SDK, Collector, semantic conventions) and instrument a sample application.
- Configure an OTel Collector pipeline (receivers, processors, exporters, connectors) and reason about deployment topologies.
- Perform end-to-end correlation across metrics, logs, traces, and exemplars, and design dashboards/alerts that avoid alert fatigue.
- Recognize the pain points of telemetry at scale (cost, noise, rigidity, vendor lock-in) and evaluate how a control plane / smart gateway (TelemetryFlow) addresses them.
| Audience | What You'll Get Out of It |
|---|---|
| SRE / Platform Engineers | A structured vocabulary for observability plus hands-on OpenTelemetry & TFO patterns. |
| Backend / Microservices Developers | Learn to instrument code the right way and debug distributed systems fast. |
| DevOps Engineers | Operate the OTel Collector at scale and design resilient telemetry pipelines. |
| Engineering Managers / Architects | Understand the cost/rigor trade-offs and what a control plane unlocks. |
| Students / Career Switchers | A vendor-neutral, modern foundation in observability — no proprietary lock-in. |
- Basic familiarity with Linux command line and HTTP/REST.
- Comfort reading YAML / JSON configuration.
- Exposure to at least one programming language (Go, Python, Java, JS/TS, or similar).
- Conceptual familiarity with containers (Docker) and ideally Kubernetes (helpful but not required for Weeks 1–4).
- No prior observability tooling experience required — we start from first principles.
graph LR
subgraph W1["Week 1 — Foundations"]
S1["S1<br/>From Monitoring<br/>to Observability"]
S2["S2<br/>The Three Pillars<br/>(MELT)"]
end
subgraph W2["Week 2 — Metrics & Reliability"]
S3["S3<br/>Metrics & Reliability<br/>SLIs / SLOs"]
S4["S4<br/>Tracing &<br/>Structured Logs"]
end
subgraph W3["Week 3 — OpenTelemetry I"]
S5["S5<br/>OpenTelemetry<br/>Deep Dive Pt. 1"]
S6["S6<br/>The OTel<br/>Collector"]
end
subgraph W4["Week 4 — Correlation & Scale"]
S7["S7<br/>E2E Correlation<br/>& Dashboarding"]
S8["S8<br/>The Pain<br/>of Scale"]
end
subgraph W5["Week 5 — Advanced Observability"]
S9["S9<br/>Profiling<br/>& RUM"]
S10["S10<br/>Pipeline &<br/>Control Plane"]
end
subgraph W6["Week 6 — TelemetryFlow (TFO)"]
S11["S11<br/>TFO Architecture<br/>& Core Concepts"]
S12["S12<br/>TFO Hands-on<br/>& Prod Readiness"]
end
S1 --> S2 --> S3 --> S4 --> S5 --> S6
S6 --> S7 --> S8 --> S9 --> S10 --> S11 --> S12
style W1 fill:#d1fae5
style W2 fill:#bfdbfe
style W3 fill:#fde68a
style W4 fill:#fecaca
style W5 fill:#ddd6fe
style W6 fill:#fef3c7
The journey is intentionally cumulative: every session builds on the previous one, ending in a hands-on lab on TelemetryFlow Observability where learners apply dynamic routing, tail-based sampling, and cost-control policies.
The full authoritative outline lives in COURSE.md. Below is a quick-reference summary.
The evolution from monoliths → SOA → microservices, why monitoring (known-knowns) breaks down, the Rumsfeld Matrix, and the mindset shift: monitoring asks "is it broken?", observability asks "why is it broken?"
Metrics, Events, Logs, Traces (and Profiles as the 4th pillar). Cardinality, the danger of siloed pillars, and a decision matrix for choosing the right pillar per scenario.
Metric types (Counter, Gauge, Histogram, Summary), the RED and USE methods, Google's Golden Signals, SLIs/SLOs/SLAs, error budgets, and an intro to PromQL.
The hop-by-hop debugging nightmare, trace anatomy (Trace/Span/Parent IDs), W3C Trace Context (traceparent/tracestate), structured (JSON) logging, and the magic link: injecting trace_id into logs.
The fragmented past (vendor lock-in), OpenTelemetry as the CNCF standard (merging OpenTracing + OpenCensus), API vs SDK, semantic conventions, resources, and auto- vs manual-instrumentation.
Why a Collector exists, the Receivers → Processors → Exporters → Connectors pipeline, agent vs gateway deployment modes, YAML config authoring, and the "YAML hell" problem at scale.
The correlation matrix, exemplars, trace-to-log and log-to-metric linking, dashboard anti-patterns (spaghetti, wall of green), alerting on symptoms not causes, and curing alert fatigue.
The dark side of telemetry at scale: cost explosion (ingestion pricing), the "observability tax", noise fatigue, pipeline rigidity (restart-induced data drops), vendor lock-in, and the case for a control plane.
Continuous profiling (eBPF, SIGPROF), the holy grail of profiling + traces, frontend observability, Real User Monitoring (RUM), Core Web Vitals, session replay, and synthetics.
The data-plane limit, the control-plane concept, policy-based routing, head- vs tail-based sampling, GitOps for telemetry, and the blueprint: App → Agent → Control Plane/Smart Gateway → Backend.
Where TFO sits (between OTel Agent and backends), its four core features — dynamic pipeline management, smart routing, centralized configuration (GitOps), and cost control — and how each maps to a Session 8 pain point.
Real-world use cases (cost optimization, multi-IO, tail-based sampling at the gateway), a full hands-on lab (deploy app → OTel Agent → TFO → backends), dynamic routing without agent restarts, and migration best practices.
Each session ships as a polished PDF slide deck in pdf/. Decks are published as sessions are finalized.
| Session | Title | File | Status |
|---|---|---|---|
| 1 | From Monitoring to Observability | Mastering Observability - Session 01.pdf |
✅ Available |
| 2 | The Three Pillars of Observability (MELT) | Mastering Observability - Session 02.pdf |
✅ Available |
| 3 | Metrics & Reliability (SLIs/SLOs) | Mastering Observability - Session 03.pdf |
✅ Available |
| 4 | Distributed Tracing & Structured Logging | Mastering Observability - Session 04.pdf |
✅ Available |
| 5–12 | OpenTelemetry, Scale, TFO | Coming soon | 🚧 In progress |
The slide count per session (~12–15) is documented in
COURSE.mdnext to each session header.
- Read the
COURSE.mdsyllabus top-to-bottom to understand the arc. - Open the
pdf/deck for each session in order; take notes. - Complete each session's assignment before moving on.
- Use the discussion prompts to reflect on real incidents in your own systems.
- Schedule two sessions per week (≈ 90 minutes each) over 6 weeks.
- Present the PDF deck; pause for the discussion slides and hands-on exercises.
- Assign the per-session assignment as homework.
- Encourage learners to share "war stories" — Sessions 1 and 8 are explicitly designed for this.
- Adapt the TFO hands-on lab (Session 12) to your local environment.
See CONTRIBUTING.md for how to propose fixes, translations, additional examples, or new sessions.
graph TB
Start([Start: Beginner]) --> Track1{Background?}
Track1 -->|Backend Dev| Core[Weeks 1-4:<br/>Foundations + Tracing]
Track1 -->|DevOps / SRE| Full[Full Course:<br/>All 12 Sessions]
Track1 -->|Manager / Architect| Strategic[Sessions 1, 2, 7, 8,<br/>10, 11 — Strategy Track]
Core --> OTel[Week 3:<br/>OpenTelemetry]
OTel --> Scale[Weeks 4-5:<br/>Correlation & Scale]
Scale --> TFO[Week 6:<br/>TelemetryFlow Hands-on]
Full --> TFO
Strategic --> TFO
TFO --> Graduate([End: Observability<br/>Practitioner])
style Start fill:#d1fae5
style Graduate fill:#fef3c7
style TFO fill:#bfdbfe
- Monitoring vs. Observability · Rumsfeld Matrix (known-knowns → unknown-unknowns)
- The Four Pillars (MELT) · Cardinality trade-offs · High- vs low-cardinality data
- RED method (Rate, Errors, Duration) · USE method (Utilization, Saturation, Errors)
- Google's Golden Signals · SLIs / SLOs / SLAs · Error budgets
- W3C Trace Context (
traceparent,tracestate) · Context propagation across HTTP/gRPC - Structured (JSON) logging · Trace-to-log correlation via
trace_id - Exemplars (linking metrics → traces)
- API vs SDK distinction · No-op defaults · Semantic conventions
- Auto-instrumentation vs manual spans · Resource & service attributes
- Collector pipeline: Receivers · Processors · Exporters · Connectors
- Agent (sidecar/DaemonSet) vs Gateway deployment modes
- Ingestion-based pricing cost models · The "observability tax"
- Head-based vs tail-based sampling · Policy-based routing
- Data plane vs control plane · GitOps for telemetry
- Dynamic pipeline management (no-restart reconfiguration)
- Smart routing (by content / metadata / policy)
- Centralized, GitOps-compatible configuration
- Cost control via noise filtering before expensive backends
Every session ends with either an assignment or an interactive discussion. Highlights:
| Session | Type | Activity |
|---|---|---|
| 1 | Assignment | Identify 3 "unknown-unknowns" in your current project. |
| 2 | Hands-on + Assignment | Categorize scenarios to pillars; map your stack to MELT. |
| 3 | Hands-on | Write a PromQL query to measure an SLI. |
| 4 | Hands-on | Read a waterfall chart; correlate a slow span to an error log. |
| 5 | Hands-on | Auto-instrument a sample app with the OTel SDK. |
| 6 | Hands-on | Author a Collector config (Receiver → Batch → OTLP Exporter). |
| 7 | Hands-on | Navigate an E2E incident: Alert → Metric → Trace → Log. |
| 9 | Hands-on | Analyze a CPU hotspot correlated with a slow span. |
| 11 | Assignment | Design a data-reduction policy for your org. |
| 12 | Hands-on Lab | Deploy app → OTel Agent → TFO → backends; route status=500 to a premium backend without restarting agents. |
telemetryflow-academy/
├── README.md # This file — course overview & quickstart
├── COURSE.md # Master syllabus (authoritative, 12 sessions)
├── CHANGELOG.md # Version history for course materials
├── CONTRIBUTING.md # How to contribute / report issues / suggest sessions
├── CODE_OF_CONDUCT.md # Community standards
├── LICENSE # CC BY-NC-SA 4.0 legal code
├── assets/ # Shared brand & license assets
│ ├── by-nc-sa.png # CC BY-NC-SA 4.0 badge (PNG)
│ └── by-nc-sa.svg # CC BY-NC-SA 4.0 badge (SVG)
├── docs/ # Companion documentation (platform & training context)
│ ├── telemetryflow-platform.md # TelemetryFlow Observability platform overview
│ └── enterprise-training.md # Commercial enterprise observability training
└── pdf/ # Published slide decks (one per session)
├── Mastering Observability - Session 01.pdf
├── Mastering Observability - Session 02.pdf
├── Mastering Observability - Session 03.pdf
├── Mastering Observability - Session 04.pdf
├── ...
└── Mastering Observability - Session 12.pdf
The docs/ folder contains companion documentation that gives learners
deeper context around the platform that caps the course (Sessions 11 & 12).
📄 docs/telemetryflow-platform.md —
TelemetryFlow Observability (TFO) — Platform Context
A learner-friendly adaptation of the canonical TelemetryFlow ecosystem overview. Read this before Sessions 11 & 12 to understand what TFO is, how it is architected, and how every concept from Sessions 1–10 maps to a concrete TFO feature. Includes the full Course → TFO Concept Map table.
🎓 docs/enterprise-training.md —
Enterprise Observability Training
Commercial, instructor-led training that builds on this open course. Covers role-based tracks (SRE, DevOps, Leadership), delivery formats (on-site / virtual / hybrid), customization, TFO Practitioner and TFO Architect certification, and the engagement model. Vendor-neutral by default.
For teams that need to move from concept to production rollout, TelemetryFlow Academy offers commercial, instructor-led enterprise training that builds on this open course (CC BY-NC-SA 4.0).
| Open Course | Enterprise Training | |
|---|---|---|
| Delivery | Self-paced | Instructor-led, cohort-based |
| Customization | Fixed 12-session curriculum | Your stack, your services, your incidents |
| Certification | — | TFO Practitioner / TFO Architect |
| Support | Community (GitHub Issues) | Private channel + office hours |
| License / terms | CC BY-NC-SA 4.0 (free) | Commercial services agreement |
Popular tracks include Observability Jumpstart (2d), OpenTelemetry Immersion (3d), SLO Engineering (2d), and the flagship TFO Production Rollout (5d). Vendor-neutral by default — TFO-specific modules are optional.
👉 See docs/enterprise-training.md for the full
track catalog, delivery formats, certification pathways, and how to
request a quote.
This is a documentation/course repository — there is nothing to compile or run. To use the materials locally:
# Clone the repository
git clone https://github.com/telemetryflow/telemetryflow-academy.git
cd telemetryflow-academy
# Open the syllabus
open COURSE.md # macOS | xdg-open COURSE.md on Linux
# Open a slide deck
open "pdf/Mastering Observability - Session 01.pdf"💡 The PDF decks are the primary learning artifact. They are self-contained and can be read without any additional tooling beyond a PDF viewer.
We welcome contributions that improve clarity, fix errata, add translations, or propose additional sessions and exercises. Please read CONTRIBUTING.md first.
Some contribution ideas:
- 🌍 Translations of slide decks or the syllabus
- 🧪 Hands-on lab templates (Docker Compose, sample apps) to pair with a session
- 📝 Errata & clarifications via issues or PRs
- 💡 New session proposals for a follow-up course (e.g. Observability 201)
Participation in TelemetryFlow Academy — in issues, PRs, discussions, or any live session — is governed by our Code of Conduct. Please be excellent to each other.
TelemetryFlow Academy — Mastering Observability 101
© 2026 Telemetri Data Indonesia
Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license — CC BY-NC-SA 4.0.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
- NonCommercial — You may not use the material for commercial purposes without explicit permission.
- ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
See the full legal code in LICENSE.
Dwi Fahni Denni — Author & Instructor, Mastering Observability 101
TelemetryFlow Observability (TFO) — https://telemetryflow.id
- The OpenTelemetry project and the broader CNCF observability community for establishing the vendor-neutral standard this course teaches.
- The SRE and observability pioneers whose frameworks shape this curriculum — Google SRE (SLIs/SLOs/error budgets), the RED and USE methods (Tom Wilkie and Brendan Gregg), and the W3C Trace Context working group.
- The Telemetri Data Indonesia community for reviewing and supporting the course.
- Every learner who shares a "war story" — those stories make the material better for the next cohort.
Built with ❤️ by Telemetri Data Indonesia
Observability is a property of a system, not a tool. Go build observable systems.

