Skip to content

feat(trace-store): TraceTrendAnalyzer — cross-run eval score trend detection #15

@nanookclaw

Description

@nanookclaw

Summary

TraceFlow Lite has excellent per-run eval infrastructure: EvalRecord captures EvalDecision (PASS/REVISE/FALLBACK) and scores for every trace. TraceStore.list_traces() returns chronologically ordered records. But there's no cross-run trend layer — if eval pass rate drops from 95%→80% across 20 runs, that's completely invisible.

The gap: snapshot evaluation tells you what it did. Cross-run trend analysis tells you whether it's getting better or worse over time.

Proposed: TraceTrendAnalyzer

Add a new persistence/trend.py module with pure stdlib (no new dependencies):

from dataclasses import dataclass
from statistics import linear_regression
from persistence.trace_store import TraceStore

@dataclass
class TraceRunSummary:
    """Per-run behavioral summary derived from existing TraceStore data."""
    run_index: int          # 0-based order by created_at
    trace_id: str
    pass_rate: float        # PASS / (PASS + REVISE + FALLBACK)
    revise_rate: float      # REVISE fraction
    fallback_rate: float    # FALLBACK fraction
    avg_score: float | None # mean of scores dict values, if present

@dataclass
class TraceTrendReport:
    window: int                    # number of runs analyzed
    pass_rate_slope: float         # OLS slope (positive = improving)
    pass_rate_direction: str       # "improving" | "worsening" | "stable"
    any_regression: bool           # True if pass_rate_slope < -threshold
    summaries: list[TraceRunSummary]

class TraceTrendAnalyzer:
    """Detect behavioral drift across sequential TraceFlow runs."""
    
    def __init__(self, store: TraceStore, regression_threshold: float = 0.01):
        self.store = store
        self.threshold = regression_threshold
    
    def analyze(self, window: int = 20) -> TraceTrendReport:
        traces = self.store.list_traces(limit=window)
        if len(traces) < 2:
            raise ValueError(f"Need ≥ 2 traces, got {len(traces)}")
        
        summaries = []
        for i, trace in enumerate(traces):
            evals = self.store.get_evals(trace.trace_id)
            if not evals:
                pass_rate = 1.0 if trace.status.value == "success" else 0.0
                summaries.append(TraceRunSummary(i, trace.trace_id, pass_rate, 0.0, 0.0, None))
                continue
            total = len(evals)
            passes = sum(1 for e in evals if e.decision.value == "pass")
            revises = sum(1 for e in evals if e.decision.value == "revise")
            fallbacks = sum(1 for e in evals if e.decision.value == "fallback")
            scores = [s for e in evals if e.scores for s in e.scores.values()]
            summaries.append(TraceRunSummary(
                i, trace.trace_id,
                passes / total, revises / total, fallbacks / total,
                sum(scores) / len(scores) if scores else None
            ))
        
        xs = [float(s.run_index) for s in summaries]
        ys = [s.pass_rate for s in summaries]
        slope, _ = linear_regression(xs, ys)
        
        if slope > self.threshold:
            direction = "improving"
        elif slope < -self.threshold:
            direction = "worsening"
        else:
            direction = "stable"
        
        return TraceTrendReport(
            window=len(summaries),
            pass_rate_slope=slope,
            pass_rate_direction=direction,
            any_regression=direction == "worsening",
            summaries=summaries
        )

CLI Integration

Add a trend command to the existing CLI:

python -m traceflow trend --window 20

# Output:
# TraceTrend (last 20 runs)
# Pass rate slope: -0.023/run  ⚠️ WORSENING
# Direction: worsening
# Regression detected: True
# 
# Run  trace_id           pass_rate  direction
# 0    abc123...          0.95       -
# 1    def456...          0.90       ↓
# ...

CI Gate Integration

# In your CI pipeline:
analyzer = TraceTrendAnalyzer(store, regression_threshold=0.02)
report = analyzer.analyze(window=10)
if report.any_regression:
    raise SystemExit(f"Eval regression detected: {report.pass_rate_slope:.3f}/run")

This complements TraceFlow's existing change-safety model (quality gates prevent bad individual runs) with longitudinal detection (trend analysis catches slow degradation across runs).

Implementation Notes

  • Reads existing TraceStore and EvalRecord data — zero schema changes
  • Pure stdlib: statistics.linear_regression (Python 3.10+)
  • Additive only — no modifications to existing modules
  • Extends the "safe to ship" mission: ship safely AND detect if reliability is drifting post-ship

Reference

The cross-session behavioral drift gap is documented across 65+ independent agent evaluation frameworks in: PDR in Production v2.6 — DOI: 10.5281/zenodo.19376669. TraceFlow's eval gate + trace persistence infrastructure is exactly the right foundation for this layer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions