feat(trace-store): TraceTrendAnalyzer — cross-run eval score trend detection

## Summary

TraceFlow Lite has excellent per-run eval infrastructure: `EvalRecord` captures `EvalDecision` (PASS/REVISE/FALLBACK) and `scores` for every trace. `TraceStore.list_traces()` returns chronologically ordered records. But there's no cross-run trend layer — if eval pass rate drops from 95%→80% across 20 runs, that's completely invisible.

The gap: **snapshot evaluation tells you what it did. Cross-run trend analysis tells you whether it's getting better or worse over time.**

## Proposed: `TraceTrendAnalyzer`

Add a new `persistence/trend.py` module with pure stdlib (no new dependencies):

```python
from dataclasses import dataclass
from statistics import linear_regression
from persistence.trace_store import TraceStore

@dataclass
class TraceRunSummary:
    """Per-run behavioral summary derived from existing TraceStore data."""
    run_index: int          # 0-based order by created_at
    trace_id: str
    pass_rate: float        # PASS / (PASS + REVISE + FALLBACK)
    revise_rate: float      # REVISE fraction
    fallback_rate: float    # FALLBACK fraction
    avg_score: float | None # mean of scores dict values, if present

@dataclass
class TraceTrendReport:
    window: int                    # number of runs analyzed
    pass_rate_slope: float         # OLS slope (positive = improving)
    pass_rate_direction: str       # "improving" | "worsening" | "stable"
    any_regression: bool           # True if pass_rate_slope < -threshold
    summaries: list[TraceRunSummary]

class TraceTrendAnalyzer:
    """Detect behavioral drift across sequential TraceFlow runs."""
    
    def __init__(self, store: TraceStore, regression_threshold: float = 0.01):
        self.store = store
        self.threshold = regression_threshold
    
    def analyze(self, window: int = 20) -> TraceTrendReport:
        traces = self.store.list_traces(limit=window)
        if len(traces) < 2:
            raise ValueError(f"Need ≥ 2 traces, got {len(traces)}")
        
        summaries = []
        for i, trace in enumerate(traces):
            evals = self.store.get_evals(trace.trace_id)
            if not evals:
                pass_rate = 1.0 if trace.status.value == "success" else 0.0
                summaries.append(TraceRunSummary(i, trace.trace_id, pass_rate, 0.0, 0.0, None))
                continue
            total = len(evals)
            passes = sum(1 for e in evals if e.decision.value == "pass")
            revises = sum(1 for e in evals if e.decision.value == "revise")
            fallbacks = sum(1 for e in evals if e.decision.value == "fallback")
            scores = [s for e in evals if e.scores for s in e.scores.values()]
            summaries.append(TraceRunSummary(
                i, trace.trace_id,
                passes / total, revises / total, fallbacks / total,
                sum(scores) / len(scores) if scores else None
            ))
        
        xs = [float(s.run_index) for s in summaries]
        ys = [s.pass_rate for s in summaries]
        slope, _ = linear_regression(xs, ys)
        
        if slope > self.threshold:
            direction = "improving"
        elif slope < -self.threshold:
            direction = "worsening"
        else:
            direction = "stable"
        
        return TraceTrendReport(
            window=len(summaries),
            pass_rate_slope=slope,
            pass_rate_direction=direction,
            any_regression=direction == "worsening",
            summaries=summaries
        )
```

## CLI Integration

Add a `trend` command to the existing CLI:

```bash
python -m traceflow trend --window 20

# Output:
# TraceTrend (last 20 runs)
# Pass rate slope: -0.023/run  ⚠️ WORSENING
# Direction: worsening
# Regression detected: True
# 
# Run  trace_id           pass_rate  direction
# 0    abc123...          0.95       -
# 1    def456...          0.90       ↓
# ...
```

## CI Gate Integration

```python
# In your CI pipeline:
analyzer = TraceTrendAnalyzer(store, regression_threshold=0.02)
report = analyzer.analyze(window=10)
if report.any_regression:
    raise SystemExit(f"Eval regression detected: {report.pass_rate_slope:.3f}/run")
```

This complements TraceFlow's existing change-safety model (quality gates prevent bad individual runs) with longitudinal detection (trend analysis catches slow degradation across runs).

## Implementation Notes

- Reads existing `TraceStore` and `EvalRecord` data — **zero schema changes**
- Pure stdlib: `statistics.linear_regression` (Python 3.10+)
- Additive only — no modifications to existing modules
- Extends the "safe to ship" mission: ship safely AND detect if reliability is drifting post-ship

## Reference

The cross-session behavioral drift gap is documented across 65+ independent agent evaluation frameworks in: *PDR in Production* v2.6 — DOI: [10.5281/zenodo.19376669](https://doi.org/10.5281/zenodo.19376669). TraceFlow's eval gate + trace persistence infrastructure is exactly the right foundation for this layer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(trace-store): TraceTrendAnalyzer — cross-run eval score trend detection #15

Summary

Proposed: `TraceTrendAnalyzer`

CLI Integration

CI Gate Integration

Implementation Notes

Reference

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(trace-store): TraceTrendAnalyzer — cross-run eval score trend detection #15

Description

Summary

Proposed: TraceTrendAnalyzer

CLI Integration

CI Gate Integration

Implementation Notes

Reference

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Proposed: `TraceTrendAnalyzer`