Summary
TraceFlow Lite has excellent per-run eval infrastructure: EvalRecord captures EvalDecision (PASS/REVISE/FALLBACK) and scores for every trace. TraceStore.list_traces() returns chronologically ordered records. But there's no cross-run trend layer — if eval pass rate drops from 95%→80% across 20 runs, that's completely invisible.
The gap: snapshot evaluation tells you what it did. Cross-run trend analysis tells you whether it's getting better or worse over time.
Proposed: TraceTrendAnalyzer
Add a new persistence/trend.py module with pure stdlib (no new dependencies):
from dataclasses import dataclass
from statistics import linear_regression
from persistence.trace_store import TraceStore
@dataclass
class TraceRunSummary:
"""Per-run behavioral summary derived from existing TraceStore data."""
run_index: int # 0-based order by created_at
trace_id: str
pass_rate: float # PASS / (PASS + REVISE + FALLBACK)
revise_rate: float # REVISE fraction
fallback_rate: float # FALLBACK fraction
avg_score: float | None # mean of scores dict values, if present
@dataclass
class TraceTrendReport:
window: int # number of runs analyzed
pass_rate_slope: float # OLS slope (positive = improving)
pass_rate_direction: str # "improving" | "worsening" | "stable"
any_regression: bool # True if pass_rate_slope < -threshold
summaries: list[TraceRunSummary]
class TraceTrendAnalyzer:
"""Detect behavioral drift across sequential TraceFlow runs."""
def __init__(self, store: TraceStore, regression_threshold: float = 0.01):
self.store = store
self.threshold = regression_threshold
def analyze(self, window: int = 20) -> TraceTrendReport:
traces = self.store.list_traces(limit=window)
if len(traces) < 2:
raise ValueError(f"Need ≥ 2 traces, got {len(traces)}")
summaries = []
for i, trace in enumerate(traces):
evals = self.store.get_evals(trace.trace_id)
if not evals:
pass_rate = 1.0 if trace.status.value == "success" else 0.0
summaries.append(TraceRunSummary(i, trace.trace_id, pass_rate, 0.0, 0.0, None))
continue
total = len(evals)
passes = sum(1 for e in evals if e.decision.value == "pass")
revises = sum(1 for e in evals if e.decision.value == "revise")
fallbacks = sum(1 for e in evals if e.decision.value == "fallback")
scores = [s for e in evals if e.scores for s in e.scores.values()]
summaries.append(TraceRunSummary(
i, trace.trace_id,
passes / total, revises / total, fallbacks / total,
sum(scores) / len(scores) if scores else None
))
xs = [float(s.run_index) for s in summaries]
ys = [s.pass_rate for s in summaries]
slope, _ = linear_regression(xs, ys)
if slope > self.threshold:
direction = "improving"
elif slope < -self.threshold:
direction = "worsening"
else:
direction = "stable"
return TraceTrendReport(
window=len(summaries),
pass_rate_slope=slope,
pass_rate_direction=direction,
any_regression=direction == "worsening",
summaries=summaries
)
CLI Integration
Add a trend command to the existing CLI:
python -m traceflow trend --window 20
# Output:
# TraceTrend (last 20 runs)
# Pass rate slope: -0.023/run ⚠️ WORSENING
# Direction: worsening
# Regression detected: True
#
# Run trace_id pass_rate direction
# 0 abc123... 0.95 -
# 1 def456... 0.90 ↓
# ...
CI Gate Integration
# In your CI pipeline:
analyzer = TraceTrendAnalyzer(store, regression_threshold=0.02)
report = analyzer.analyze(window=10)
if report.any_regression:
raise SystemExit(f"Eval regression detected: {report.pass_rate_slope:.3f}/run")
This complements TraceFlow's existing change-safety model (quality gates prevent bad individual runs) with longitudinal detection (trend analysis catches slow degradation across runs).
Implementation Notes
- Reads existing
TraceStore and EvalRecord data — zero schema changes
- Pure stdlib:
statistics.linear_regression (Python 3.10+)
- Additive only — no modifications to existing modules
- Extends the "safe to ship" mission: ship safely AND detect if reliability is drifting post-ship
Reference
The cross-session behavioral drift gap is documented across 65+ independent agent evaluation frameworks in: PDR in Production v2.6 — DOI: 10.5281/zenodo.19376669. TraceFlow's eval gate + trace persistence infrastructure is exactly the right foundation for this layer.
Summary
TraceFlow Lite has excellent per-run eval infrastructure:
EvalRecordcapturesEvalDecision(PASS/REVISE/FALLBACK) andscoresfor every trace.TraceStore.list_traces()returns chronologically ordered records. But there's no cross-run trend layer — if eval pass rate drops from 95%→80% across 20 runs, that's completely invisible.The gap: snapshot evaluation tells you what it did. Cross-run trend analysis tells you whether it's getting better or worse over time.
Proposed:
TraceTrendAnalyzerAdd a new
persistence/trend.pymodule with pure stdlib (no new dependencies):CLI Integration
Add a
trendcommand to the existing CLI:CI Gate Integration
This complements TraceFlow's existing change-safety model (quality gates prevent bad individual runs) with longitudinal detection (trend analysis catches slow degradation across runs).
Implementation Notes
TraceStoreandEvalRecorddata — zero schema changesstatistics.linear_regression(Python 3.10+)Reference
The cross-session behavioral drift gap is documented across 65+ independent agent evaluation frameworks in: PDR in Production v2.6 — DOI: 10.5281/zenodo.19376669. TraceFlow's eval gate + trace persistence infrastructure is exactly the right foundation for this layer.