LogFlow · Anomaly Engine

Real-time log-stream anomaly detection with incident correlation and visual blast-radius mapping across service dependency graphs.

Why this exists

Modern distributed systems emit thousands of log lines per second. When an incident hits, on-call engineers spend the first ten minutes of an outage answering four questions:

What just changed?
Is this one problem or three?
Who is actually broken — and who is just reacting?
What's about to get hit next?

Off-the-shelf log tools (Splunk, ELK, Datadog) answer question 1 well, but they treat services as independent silos and they surface raw alerts, not stories. You get fourteen separate errors across payments, checkout, cart, ledger, and notifications and have to correlate them in your head.

LogFlow Anomaly Engine is a standalone demo of what a blast-radius- and story-aware log engine looks like. It ingests a live log stream, runs three anomaly detectors in parallel, applies user-defined alert rules, infers the service dependency graph from the logs themselves, clusters every signal into incidents, ranks the root cause, projects the downstream blast radius onto a live D3 force graph, and exposes every layer for drill-down — trace waterfalls, per-service detail drawers, template explorer, cross-service correlation heatmap, full-text log search, and a live alert-rules manager.

Feature highlights

Detection & correlation

Rolling Z-score detector on per-service volume, errors, and p95 latency
IsolationForest detector (scikit-learn) on multi-dimensional service feature vectors
Drain log-template miner (~120 LoC from scratch) detecting brand-new templates after warmup
Alert rules engine with user-defined thresholds + dwell time + live CRUD
Incident correlation engine that clusters signals by time + graph proximity and ranks the root service via weighted reverse BFS
EWMA forecaster per service for a 10-second error-rate forecast

Dependency modeling

Live service dependency graph built incrementally from trace_id / parent-service co-occurrence
Edge decay (configurable) so the graph reflects current behavior
Weighted BFS blast radius with hop distances, bounded depth

UI

Full dark dashboard with a sticky header, KPI bar, and 8 interactive panels
Interactive D3 force graph — click any node for a service drawer; active incident lights up its root with a pulsing ring and every edge in its blast radius turns red
Incident panel — ranked active incidents, root service, state machine (active / resolving / resolved), severity bar, impacted-service chips with hop distances
Service drawer — p50/p95/p99 latency sparklines, forecast, upstream/downstream callers, top templates, recent errors (click any to open its trace waterfall)
Trace waterfall modal — proper span timeline for any trace_id
Drain template explorer with live service/level filters and bar-chart counts
Cross-service correlation heatmap (Pearson on per-second error rates)
Log search — rolling-window full-text search with service + level filters
Live alert-rules manager — toggle, add, delete rules from the UI
Chaos banner showing which scenarios are currently injecting failure
Seven scenario injectors — payments outage, catalog latency, inventory deadlock, notifications flood, search index failure, checkout retry storm, recommendations cold start — each with realistic downstream cascade effects

Observability

/metrics Prometheus endpoint (gauges + counters, per-service labels)

Quickstart

git clone https://github.com/rayancheca/logflow-anomaly-engine.git
cd logflow-anomaly-engine
./run.sh
# → backend  http://127.0.0.1:8766
# → frontend http://127.0.0.1:5174

The run.sh launcher creates the Python venv, installs dependencies, boots FastAPI, then boots Vite.

Manual setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn backend.main:app --host 127.0.0.1 --port 8766

# in another shell
cd frontend
npm install
npm run dev

Trigger a scenario

Click any button under Inject failure scenario. Every scenario injects realistic downstream cascade effects so the incident engine has something to correlate:

Scenario	Primary	Cascade
`payments_outage`	4.5× errors / 5× latency on `payments`	`ledger`, `checkout`, `notifications` warn
`catalog_latency`	8× p95 latency on `catalog`	`search`, `cart` warn
`inventory_deadlock`	3.5× errors on `inventory`	`checkout`, `fulfillment` warn
`notifications_flood`	5.5× warn flood on `notifications`	—
`search_index_fail`	3× errors on `search`	`gateway` warn
`checkout_retry_storm`	3.5× warn on `checkout`	`cart`, `payments` warn
`recommendations_cold`	6× latency on `recommendations`	—

Within seconds the incident panel fills, the service graph pulses the root node (not just a random symptom), and every downstream service is highlighted with its hop distance. Click any chip to open that service's drawer; click any error line to open its trace waterfall.

Architecture

             ┌────────────────────┐
             │  Synthetic Log     │      (scenarios + cascade)
             │  Generator         │
             └─────────┬──────────┘
                       │ LogRecord(JSON)
                       ▼
             ┌────────────────────┐
             │  Async Stream Bus  │      (Kafka-shaped API, in-process)
             └─────────┬──────────┘
                       │
       ┌───────────────┼────────────────┐
       ▼               ▼                ▼
 ┌────────────┐  ┌─────────────┐  ┌──────────────┐
 │  DuckDB    │  │  Drain      │  │  Service     │
 │  Storage   │  │  Template   │  │  Graph       │
 │ (columnar) │  │  Miner      │  │  Builder     │
 └─────┬──────┘  └──────┬──────┘  └──────┬───────┘
       │                │                │
       └────────────────┼────────────────┘
                        ▼
              ┌────────────────────┐
              │  3-layer Anomaly   │  z-score · iforest · new-template
              │    Detector +      │
              │  Rules Engine      │  user thresholds + dwell
              └─────────┬──────────┘
                        │ Anomaly + blast_radius
                        ▼
              ┌────────────────────┐
              │  Incident Engine   │  cluster by time + graph proximity
              │  + Forecaster      │  rank root cause by reverse BFS
              └─────────┬──────────┘
                        ▼
              ┌────────────────────┐
              │  FastAPI + WS hub  │  ticks + REST deep-dive endpoints
              └─────────┬──────────┘
                        ▼
              ┌────────────────────┐
              │  React / D3        │
              │  Dashboard         │
              └────────────────────┘

Tech stack

Layer	Choice	Notes
Stream bus	`asyncio.Queue` with Kafka-shaped API	swap for `confluent-kafka` by replacing one file
Storage	DuckDB	embedded columnar OLAP — same execution model as ClickHouse
Detection	scikit-learn IsolationForest + rolling Z-score + Drain + rules engine	four complementary paradigms
Correlation	Custom incident engine, reverse-BFS root-cause ranking	clusters raw anomalies into stories
Forecast	EWMA (α=0.35) with running variance	cheap 10s-ahead error-rate prediction
API	FastAPI + WebSockets + Prometheus text export	native async, single-digit-ms WS ticks
Frontend	React · TypeScript · Vite	strict types, fast HMR
Styling	Tailwind v3 with a token palette	dark neon theme
Viz	D3.js	force graph + area/line timeline + correlation heatmap + waterfall

Technical deep-dive

1. Log ingestion

The generator simulates a 12-service e-commerce DAG (gateway → auth/catalog/search → cart → checkout → payments → ledger, with recommendations, fulfillment, notifications, analytics and sessions branching off). Each trace walks the DAG randomly, emitting a correlated LogRecord per service with realistic latency, error distributions, and parent/child linkage.

Scenarios inject a primary degradation plus a cascade: a payments outage also poisons ledger, checkout, and notifications with smaller secondary effects and a 2-second lag, so the incident engine has something non-trivial to correlate. Without the cascade you'd just see one red node.

Records flow through a tiny Kafka-shaped pub/sub (stream_bus.py). Two independent consumer groups drain the topic — one writes to DuckDB, the other feeds the Drain miner.

2. Detection

Rate detector. A RollingStat keeps the last 60 samples per service for log volume, error count, and p95 latency. On each tick we compute a Z-score and fire a rate_spike if σ ≥ 3 with a minimum absolute floor.

Feature detector. Every tick we build a per-service feature matrix [total, errors, warns, mean_lat, p95_lat, error_rate], take log1p to dampen heavy tails, and fit a fresh IsolationForest with contamination=0.06.

Structural detector. The Drain class is a compact ~130-line implementation of the Drain log parsing algorithm (He et al., ICWS 2017). Log lines are tokenised, numeric / hex / id tokens are replaced by <*>, and tokens walk a fixed-depth tree to a leaf where the most similar existing template is either merged or a new one is created. Brand-new templates after warmup fire a new_template anomaly.

Rules engine. User-defined rules of the form metric op threshold for duration_s on service fire rule_fired anomalies on transition — the engine tracks per-rule dwell state so a single transient blip doesn't page. Three defaults ship enabled: error_rate > 8% (any), payments p95 > 400ms, notifications > 500/min.

3. Service graph + blast radius

The graph is inferred from the logs, not configured. Every tick we pull the (parent_service, service) pairs for the current window, bump edge weights, and decay the whole graph by 0.92 so the shape reflects current traffic. Blast radius is a weighted forward BFS bounded to max_depth=4, ignoring edges below weight=0.1.

4. Incident correlation + root cause

This is the piece that upgrades raw alerts into stories.

Two anomalies join the same incident iff they fire within a 25-second window and are graph-adjacent (one's blast radius contains the other's service, or their service sets touch). An incident holds a set of member services and their union blast radius, transitions through an active → resolving → resolved state machine, and survives for 45s after its last signal.

Root-cause ranking. Failures propagate from callee → caller: when a leaf service breaks, every upstream caller that depends on it starts erroring. So the root cause is the most-reached service — the one the other members reach by forward blast. For each member s we count how many other members have s in their forward blast radius, tie-breaking by average hop depth. The service with the highest score wins and gets crowned as root_service — then the UI pulses that node, not whichever service happened to alert first.

5. Deep-dive views

Service drawer fetches /api/service/{name} every 1.5s and renders a latency sparkline (p50/p95/p99), forecast banner, upstream/downstream edge tables, top drain templates, and recent errors (each a button that opens the trace waterfall).
Trace waterfall fetches /api/trace/{trace_id} and renders a proper span-indented timeline with per-service colour + latency labels.
Template explorer polls /api/templates every 4s with live service/level filters.
Correlation heatmap polls /api/correlations every 3.5s and draws a D3 Pearson heatmap across per-second error rates.
Log search debounces against /api/search with service+level filters; each hit is a button that opens its trace waterfall.
Rules manager reads rules from the WebSocket tick, writes back via POST /api/rules, DELETE /api/rules/{id}, POST /api/rules/{id}/toggle.

API reference

Method	Path	Purpose
`GET`	`/api/health`	health probe
`GET`	`/api/stats`	full `StreamMessage` snapshot
`GET`	`/api/scenarios`	list scenarios + currently active
`POST`	`/api/scenarios/{name}`	inject a failure scenario
`GET`	`/api/incidents`	historical + active incidents
`GET`	`/api/traces/recent`	recent trace ids (errors_only optional)
`GET`	`/api/trace/{trace_id}`	full trace waterfall
`GET`	`/api/service/{name}`	per-service deep metrics + forecast
`GET`	`/api/templates`	drain templates with filters
`GET`	`/api/search`	rolling-window log search
`GET`	`/api/correlations`	service × service error-rate correlation matrix
`GET`	`/api/rules`	alert rules
`POST`	`/api/rules`	add a rule
`DELETE`	`/api/rules/{id}`	delete a rule
`POST`	`/api/rules/{id}/toggle`	enable/disable a rule
`GET`	`/metrics`	Prometheus text export
`WS`	`/ws/stream`	live push of `StreamMessage` JSON

Project layout

logflow-anomaly-engine/
├── backend/
│   ├── main.py            FastAPI app + REST + WebSocket + /metrics
│   ├── pipeline.py        async orchestrator (generator → bus → storage → detector → incident engine)
│   ├── generator.py       12-service log generator with cascading scenarios
│   ├── stream_bus.py      Kafka-shaped async pub/sub
│   ├── storage.py         DuckDB store + window queries, trace/search/correlation/service helpers
│   ├── drain.py           Drain log-template miner (~130 loc)
│   ├── detector.py        rolling z-score + IsolationForest + new-template
│   ├── incidents.py       incident clustering + reverse-BFS root cause
│   ├── rules.py           alert-rules engine with dwell-time transitions
│   ├── forecast.py        EWMA forecaster
│   ├── graph.py           service graph builder + blast-radius BFS
│   ├── schemas.py         Pydantic wire formats
│   └── config.py          tunables
├── frontend/
│   └── src/
│       ├── App.tsx
│       ├── hooks/useLiveStream.ts
│       └── components/
│           ├── Header.tsx
│           ├── KpiBar.tsx
│           ├── ServiceGraph.tsx
│           ├── IncidentPanel.tsx
│           ├── LogStream.tsx
│           ├── TimelineChart.tsx
│           ├── ScenarioControls.tsx
│           ├── ServiceDrawer.tsx
│           ├── TraceWaterfall.tsx
│           ├── TemplateExplorer.tsx
│           ├── CorrelationHeatmap.tsx
│           ├── SearchBar.tsx
│           └── RulesManager.tsx
├── docs/dashboard.png
├── run.sh
├── requirements.txt
└── README.md

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
backend		backend
docs		docs
frontend		frontend
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh
state.md		state.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogFlow · Anomaly Engine

Why this exists

Feature highlights

Detection & correlation

Dependency modeling

UI

Observability

Quickstart

Manual setup

Trigger a scenario

Architecture

Tech stack

Technical deep-dive

1. Log ingestion

2. Detection

3. Service graph + blast radius

4. Incident correlation + root cause

5. Deep-dive views

API reference

Project layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LogFlow · Anomaly Engine

Why this exists

Feature highlights

Detection & correlation

Dependency modeling

UI

Observability

Quickstart

Manual setup

Trigger a scenario

Architecture

Tech stack

Technical deep-dive

1. Log ingestion

2. Detection

3. Service graph + blast radius

4. Incident correlation + root cause

5. Deep-dive views

API reference

Project layout

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages