Skip to content

jonathan-major/DDFS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DDFS Day One

An end-to-end Datadog observability stack for an LLM agent — git clone to fully-instrumented in 15 minutes — wired up the way I'd actually want to run one in production.

Lives at github.com/jonathan-major/DDFS.

A small but representative stack — FastAPI service, LangGraph triage agent, Postgres with pgvector, Redis, Celery worker, static HTML chat UI — instrumented with Datadog APM, Logs, and LLM Observability. The point is to see what "AI-era observability" actually looks like when you take it seriously: APM traces, log/trace correlation, LLM Observability spans, five custom metrics, a seven-tile dashboard, two monitors, and an SLO — all defined as JSON in the repo and pushed via a small Python script.


Why this exists

Most "LLM observability" content stops at one of two failure modes. Either it's a screenshot tour of a vendor dashboard with no working code behind it, or it's a thousand-line code dump with no story about what you'd actually look at when something goes wrong. I wanted to see the whole loop end to end:

  • A non-trivial agent (multi-node LangGraph with two different Claude models, RAG retrieval, a self-eval step, and a human-escalation path)
  • Real APM auto-instrumentation across the boring middle of the stack (FastAPI / psycopg / redis / Celery)
  • Custom application metrics flowing via DogStatsD with bounded cardinality (5 nodes × 4 intent values + service/env constants — no user IDs, no question content in tag values)
  • Dashboards / monitors / SLOs as code, not click-ops

This repo is the result. It tells two stories at once:

  1. AI-era observability. A LangGraph customer-support triage agent over a synthetic feature-flag product corpus ("Bramble," 29 chunks). Every node is wrapped with Datadog LLM Observability decorators (@workflow / @task / @llm / @retrieval / @tool). The trace tree shows prompt, response, model, token counts, cost in dollars, and self-evaluation scores from the confidence-check node, all in one navigable view.
  2. The boring-but-essential rest of the stack. APM auto-instrumentation across FastAPI, Celery, Postgres, Redis, and the Anthropic SDK. Structured JSON logs with dd.trace_id correlation injected by ddtrace. Five custom DogStatsD metrics (ddfs.agent.requests, ddfs.agent.cost_usd, ddfs.agent.confidence_score, ddfs.agent.escalations, and a ddfs.agent.node.duration_ms distribution with per-node p95 percentiles). Infrastructure metrics via Datadog Agent container labels.

What you see in fifteen minutes

git clone https://github.com/jonathan-major/DDFS
cd DDFS
cp .env.example .env             # add DD_API_KEY, DD_APP_KEY, ANTHROPIC_API_KEY
make up                          # docker compose up -d (full stack + Datadog agent)
make seed-docs                   # embed the 29-chunk Bramble corpus into pgvector
make demo                        # fire 20 synthetic questions through the agent
make dd-apply                    # push dashboards/monitors/SLO to your org via API

Open the DDFS Day One — Agent Overview dashboard in your Datadog org and you should see:

  • Requests / min — live traffic from the demo run
  • Median cost / conversation — around $0.006-0.008 with Sonnet+Haiku
  • Escalation rate — fraction of conversations the LLM-as-judge sent to a human
  • Cost per conversation over time — a timeseries you can correlate with make demo runs
  • p95 latency by LangGraph node — five curves; the draft_answer span (the graph's Sonnet draft node) runs ~3-4× slower than the Haiku spans (classify_intent, score_confidence). That one chart tells the "right model for the job" story without a paragraph.
  • Confidence score distribution — average and p10 of the self-evaluation score; the p10 line drops sharply on out-of-corpus questions
  • Intents handled (top list)question, bug_report, feature_request, unknown, tagged by the classifier

Then drop into APM → ddfs-day-one-api → recent trace. One flame graph shows the entire conversation: FastAPI request → LangGraph workflow → five task spans → nested llm spans for the Haiku and Sonnet calls → retrieval span for pgvector → tool spans for the Redis enqueue and Postgres write. Click any log line; one click later you're at the trace that emitted it.

The first interesting surprise comes unprompted: Watchdog Insights flags draft_answer as a 6.3× p95 latency outlier (~5s vs 800ms baseline) without anyone configuring an anomaly rule. That's the kind of thing you don't get for free from a roll-your-own stack.


Architecture

                           [ static HTML chat UI ]
                                       │
                                       ▼
                              [ FastAPI service ]
                                       │
                       ┌───────────────┼────────────────┐
                       ▼               ▼                ▼
                [ LangGraph agent ]  [ Postgres +    [ Redis +
                  classify_intent     pgvector ]      Celery escalation queue ]
                  retrieve_docs                              │
                  draft                                      ▼
                  score_confidence                  [ Celery worker:
                  dispatch                            index_doc,
                       │                              drain_escalations ]
                       ▼
                [ Anthropic Claude ]
                  Haiku  → classify_intent, score_confidence
                  Sonnet → draft (the user-facing generation)

Every Python service runs under ddtrace-run, so FastAPI / psycopg / redis / Celery / Anthropic SDK are auto-instrumented without any code changes. The LangGraph nodes are wrapped with the ddtrace.llmobs decorators so spans land in LLM Observability with the right semantic shape. Both Sonnet and Haiku are used deliberately — Haiku for cheap classification and self-evaluation, Sonnet for the user-facing draft — and the per-node latency widget makes the tradeoff visible.


Repo layout

DDFS/
├── README.md
├── apps/
│   ├── api/                          FastAPI + LangGraph + Celery
│   │   ├── main.py                   ASGI entry
│   │   ├── agent/
│   │   │   ├── graph.py              StateGraph definition
│   │   │   ├── nodes.py              5 nodes: classify / retrieve / draft / score / dispatch
│   │   │   ├── tools.py              pgvector retriever, Redis escalation, Postgres record
│   │   │   ├── state.py              TypedDict state
│   │   │   └── instrumentation.py    DD LLM Obs + DogStatsD wrapper
│   │   ├── routes/agent.py           POST /agent/ask
│   │   ├── db/pgvector_setup.py      Schema + 29-chunk Bramble corpus
│   │   ├── tasks/index_docs.py       Celery worker
│   │   ├── requirements.txt
│   │   └── Dockerfile
│   └── web/                          static HTML chat UI served by nginx
├── monitoring/
│   ├── dashboards/agent-overview.json     7-tile dashboard
│   ├── monitors/                     escalation-rate + LLM-cost-anomaly
│   └── slos/agent-availability.json  99.5% success-rate SLO
├── scripts/
│   └── apply_monitoring.py           pushes monitoring/*.json to DD org via REST API
├── demo/questions.txt                8 synthetic customer questions
├── docs/
│   ├── DAY-ONE.md                    minute-by-minute narrative
│   └── LIVE-DATADOG-TOUR.md          guided tour of the live Datadog views
├── docker-compose.yml
├── Makefile
└── .env.example

How the instrumentation is wired

Every LangGraph node uses a @task decorator from agent/instrumentation.py. The decorator does two things in one wrap: it opens a ddtrace.llmobs task span (so the node shows up as a child of the top-level @workflow in the LLM Observability trace tree), and it times the function body so per-node p95 latency lands as a DogStatsD distribution metric tagged by node name.

@task(name="classify_intent")
def classify_intent(state):
    ...
    metric_increment("ddfs.agent.requests", tags=[f"intent:{intent}"])
    return {...}

The same pattern: @llm on every Claude call, @retrieval on the pgvector query, @tool on Redis and Postgres writes. The agent code stays readable — none of the instrumentation requires touching the LangGraph state shape or the model invocation logic.


Monitoring as code

monitoring/*.json is the source of truth for the dashboards, monitors, and SLO. scripts/apply_monitoring.py reads each file, looks up the resource by name in your Datadog org, and either POSTs a new one or PUTs an update. The script is idempotent — re-running it never duplicates resources — and small enough to read in one sitting.

make dd-apply          # pushes monitoring/*.json via the REST API

For a team that wants state tracking and an audit trail, this same JSON drops cleanly into a Terraform datadog_dashboard_json / datadog_monitor / datadog_service_level_objective resource — the script is the lightweight starting point, not the ceiling.


What this is not

  • Not a production app. The agent is small and the corpus is fake. The point is the instrumentation pattern, not the agent logic.
  • Not a Datadog tutorial. The Datadog docs already do that. This repo is a working stack you can extend.

Cost note

Datadog's Free tier covers up to 5 infrastructure hosts with core collection and visualization only — APM, Logs, and LLM Observability are paid features. The 14-day Pro trial is enough to run this stack end-to-end. The dashboards, monitors, and SLO defined as JSON here persist regardless of the trial state.

Datadog does not publish LLM Observability pricing on its public pricing page (the SKU is listed under "AI Observability" but without rates) — check directly with Datadog before pointing this at a production workload.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors