feat: implement bench skill-coverage#617
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the bench skill-coverage subcommand, which measures agent skill activation and correlates it with outcomes across a sweep. It includes the command specification, CLI argument parsing, report generation logic, audit redaction integration, and comprehensive integration tests. The reviewer feedback highlights several key areas for improvement: resolving an absolute local file URI used in a documentation link, optimizing trajectory file loading by merging redundant read passes, and caching skill descriptions in a map to avoid repetitive filesystem scans and manifest parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| let root_cfg_opt = if let Ok(traj) = load_trajectory(&traj_path) { | ||
| if let Some(manifest) = &traj.info.manifest { | ||
| serde_json::from_value::<crate::config::schema::RootCfg>( | ||
| manifest.config_redacted.clone(), | ||
| ) | ||
| .ok() | ||
| } else { | ||
| None | ||
| } | ||
| } else { | ||
| None | ||
| }; | ||
| if let Some(root_cfg) = root_cfg_opt { | ||
| instance_enabled = root_cfg.skills.enabled; | ||
| instance_skill_paths.clone_from(&root_cfg.skills.paths); | ||
| } | ||
|
|
||
| if let Ok(traj) = load_trajectory(&traj_path) { | ||
| // Parse active skills from trajectory other metadata | ||
| if let Some(raw_skills) = traj.info.other.get("active_skills") { | ||
| if let Ok(skills) = serde_json::from_value::< | ||
| Vec<crate::skills::ActiveSkillManifest>, | ||
| >(raw_skills.clone()) | ||
| { | ||
| // Merge active skills by name uniquely | ||
| for skill in skills { | ||
| if !active_skills.iter().any(|s| s.name == skill.name) { | ||
| active_skills.push(skill); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The trajectory file is loaded and parsed twice for each trajectory path. This is highly inefficient as trajectory files can be very large. Load the trajectory once and perform both the configuration extraction and active skills parsing on the same loaded instance.
if let Ok(traj) = load_trajectory(&traj_path) {
if let Some(manifest) = &traj.info.manifest {
if let Ok(root_cfg) = serde_json::from_value::<crate::config::schema::RootCfg>(
manifest.config_redacted.clone(),
) {
instance_enabled = root_cfg.skills.enabled;
instance_skill_paths.clone_from(&root_cfg.skills.paths);
}
}
// Parse active skills from trajectory other metadata
if let Some(raw_skills) = traj.info.other.get("active_skills") {
if let Ok(skills) = serde_json::from_value::<
Vec<crate::skills::ActiveSkillManifest>,
>(raw_skills.clone())
{
// Merge active skills by name uniquely
for skill in skills {
if !active_skills.iter().any(|s| s.name == skill.name) {
active_skills.push(skill);
}
}
}
}
}References
- Before suggesting I/O optimizations such as merging multiple data processing passes, verify whether the functions actually perform redundant disk access. If a function processes data already loaded into memory by upstream components, merging it with an I/O-bound function is unnecessary and adds complexity without performance gain.
|
|
||
| - Editing, generating, or recommending skill manifests | ||
| - Co-activation analytics across a corpus (`bench skill-coverage` is future) | ||
| - Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md)) |
There was a problem hiding this comment.
The link to spec-skill-coverage.md uses an absolute local file URI (file:///c:/Users/markm/...). This should be a relative link so that it works correctly when viewed on GitHub or other environments.
| - Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md)) | |
| - Co-activation analytics across a corpus (implemented in [bench skill-coverage](spec-skill-coverage.md)) |
| let mut sorted_ids: Vec<String> = sweep.instances.keys().cloned().collect(); | ||
| sorted_ids.sort(); | ||
|
|
||
| let mut instance_data: Vec<InstanceSkillData> = Vec::new(); |
There was a problem hiding this comment.
Define a skill_descriptions map to cache the descriptions of scanned skills during the first pass. This will allow us to avoid redundant filesystem scans and manifest parsing in the subsequent loop.
| let mut instance_data: Vec<InstanceSkillData> = Vec::new(); | |
| let mut instance_data: Vec<InstanceSkillData> = Vec::new(); | |
| let mut skill_descriptions = HashMap::new(); |
| if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) { | ||
| for manifest in registry.manifests() { | ||
| eligible_skills.insert(manifest.name.clone()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Populate the skill_descriptions map with the skill names and descriptions during the initial registry scan.
if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) {
for manifest in registry.manifests() {
eligible_skills.insert(manifest.name.clone());
skill_descriptions.insert(manifest.name.clone(), manifest.description.clone());
}
}| for name in &d.eligible_skills { | ||
| all_raw_skills.entry(name.clone()).or_insert_with(|| { | ||
| let expanded_paths = d | ||
| .eligible_paths | ||
| .iter() | ||
| .map(crate::skills::expand_skill_path_pub) | ||
| .collect::<Vec<_>>(); | ||
| let description = crate::skills::SkillRegistry::scan_paths(expanded_paths) | ||
| .ok() | ||
| .and_then(|r| { | ||
| r.manifests() | ||
| .iter() | ||
| .find(|m| &m.name == name) | ||
| .map(|m| m.description.clone()) | ||
| }) | ||
| .unwrap_or_else(|| "Configured skill".to_owned()); | ||
| SkillUniverseEntry { | ||
| name: name.clone(), | ||
| description, | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
Use the cached skill_descriptions map to retrieve descriptions instead of scanning the filesystem and parsing the skill manifests repeatedly for each eligible skill.
for name in &d.eligible_skills {
all_raw_skills.entry(name.clone()).or_insert_with(|| {
let description = skill_descriptions
.get(name)
.cloned()
.unwrap_or_else(|| "Configured skill".to_owned());
SkillUniverseEntry {
name: name.clone(),
description,
}
});
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1905aa5b39
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| per_instance: t.per_instance, | ||
| })?; | ||
| if is_json { | ||
| println!("{}", serde_json::to_string_pretty(&report)?); |
There was a problem hiding this comment.
Preserve the artifact header in JSON stdout
When --format json is used, this prints the bare SkillCoverageReport, so stdout omits the artifact_kind and schema_version fields promised by the new JSON schema in docs/spec-skill-coverage.md. The file artifact is written through artifact::to_writer_pretty, but CLI consumers that parse stdout cannot validate or route the report as a skill_coverage artifact; serialize stdout with the same artifact wrapper or include those fields in the report type.
Useful? React with 👍 / 👎.
| .entry(redacted.clone()) | ||
| .or_insert_with(|| SkillUniverseEntry { | ||
| name: redacted, | ||
| description: entry.description.clone(), |
There was a problem hiding this comment.
Redact skill descriptions before writing coverage artifacts
For configured-but-inactive skills this value comes from scanning the on-disk SKILL.md description and is copied into both stdout and skill-coverage.json without passing through the redactor. In sweeps where a skill description contains a token, key, or other sensitive literal, this re-exposes data that active skill provenance normally redacts in trajectories; redact the description with the same trajectory surface before inserting it into the report.
Useful? React with 👍 / 👎.
| if !active_skills.iter().any(|s| s.name == skill.name) { | ||
| active_skills.push(skill); |
There was a problem hiding this comment.
Count repeated activations across retry trajectories
When an instance has multiple trajectory files (for example run-*.traj.json from retries or resumes) and the same skill activates in more than one run, this de-duplicates by skill name before aggregation. As a result total_activations and the reason counts report one activation for that instance even though the field is documented as the total number of activations; keep per-run activations for those counters while still counting instances_activated distinctly.
Useful? React with 👍 / 👎.
| // Group configurations to detect drift | ||
| let mut config_groups: HashMap<String, (Vec<String>, Vec<String>)> = HashMap::new(); // fingerprint -> (instance_ids, paths) | ||
| for d in &instance_data { | ||
| if d.skills_enabled { |
There was a problem hiding this comment.
Include disabled skill configs in drift detection
When a sweep mixes instances with skills enabled and disabled, this condition skips the disabled instances entirely, so config_groups.len() can remain 1 and skill_set_drift is omitted even though the effective skill configuration changed. This hides a drift scenario that materially affects activation rates; include the enabled/disabled state in the fingerprint rather than dropping disabled configurations.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## trunk #617 +/- ##
==========================================
+ Coverage 85.35% 85.43% +0.08%
==========================================
Files 115 118 +3
Lines 66346 68039 +1693
==========================================
+ Hits 56627 58127 +1500
- Misses 9719 9912 +193 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a75667898
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let mut sorted_paths = d.eligible_paths.clone(); | ||
| sorted_paths.sort(); | ||
| let paths_str = sorted_paths.join(","); | ||
| let fingerprint = format!("enabled={}:{}", d.skills_enabled, paths_str); |
There was a problem hiding this comment.
Redact skill paths before serializing drift groups
When a sweep has skill-set drift, these path strings come from the raw resolved [skills] config and are copied into the drift fingerprint (and the same unredacted vector is later emitted as groups[].paths). If an operator's configured skill directory contains a token or other sensitive literal in its path, bench skill-coverage writes it to stdout and skill-coverage.json, unlike the redacted active-skill provenance; run these paths through the trajectory redactor before building the fingerprint/output.
Useful? React with 👍 / 👎.
| for traj_path in traj_paths { | ||
| if let Ok(traj) = load_trajectory(&traj_path) { |
There was a problem hiding this comment.
Fail instead of silently dropping unreadable trajectories
When a trajectory file exists but cannot be parsed or classified, this if let Ok skips it and still pushes the instance as though it had no active skills. For completed sweeps with a truncated run-*.traj.json or schema mismatch, the command exits 0 while underreporting activations, activation rates, and drift; propagate the load error instead of treating corrupted trajectories as inactivity.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
|
|
||
| let redactor = Redactor::default_enabled(); |
There was a problem hiding this comment.
Use the sweep's redaction policy for scanned skills
When the original sweep configured custom redaction literals or patterns, this fresh default redactor does not know about them. The fields later redacted from on-disk SKILL.md manifests, such as inactive skill names and descriptions, can therefore leak values that the trajectories' own redaction policy would have masked; construct the redactor from the sweep config before serializing coverage output.
Useful? React with 👍 / 👎.
| for d in &instance_data { | ||
| let is_active = d.active_skills_unique.iter().any(|s| s.name == raw_name); | ||
| let is_eligible = d.eligible_skills.contains(&raw_name); |
There was a problem hiding this comment.
Match active skills after applying trajectory redaction
For a skill name that is redacted in trajectory provenance, active_skills_unique contains the redacted marker while raw_name comes from the unredacted scanned skill registry. This comparison then treats the activated skill as inactive (or nondeterministically overwrites it with a duplicate redacted key), so secret-bearing skill names produce incorrect activation counts even though the report is supposed to aggregate by redacted name; compare against the redacted skill name or normalize both sides consistently.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8f5becae1b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .iter() | ||
| .map(crate::skills::expand_skill_path_pub) | ||
| .collect::<Vec<_>>(); | ||
| if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) { |
There was a problem hiding this comment.
Propagate skill registry scan failures
When an enabled skill path contains an unreadable, malformed, or duplicate SKILL.md, SkillRegistry::scan_paths returns an error but this if let Ok treats that configuration as having no eligible skills. The command then exits successfully with configured-but-inactive skills missing from skill_universe/by_skill, which corrupts the coverage report instead of surfacing the broken skill library to the operator.
Useful? React with 👍 / 👎.
| if in_scope { | ||
| *per_bucket_total.entry(bname).or_default() += 1; |
There was a problem hiding this comment.
Count every bucket instance in outcome totals
When skill-set drift means a skill is only configured for some instances, this denominator is incremented only for instances where the skill is active or eligible. The report schema documents by_outcome.instances_total as the total instances in the bucket, so excluding drifted-out instances overstates usage_rate and removes those not-active outcomes from resolved_rate_when_not_active, making the per-skill deltas misleading across mixed skill configurations.
Useful? React with 👍 / 👎.
| let mut instance_skill_paths = global_skill_paths.clone(); | ||
| let mut instance_enabled = enabled; | ||
|
|
||
| let traj_paths = resolve_trajectory_paths(&args.sweep_dir, id); |
There was a problem hiding this comment.
Fail when a listed instance has no trajectory
When results.json lists an instance but none of the supported trajectory paths exist, traj_paths is empty and the instance is still pushed with no active skills. For completed sweeps with a missing or unbundled trajectory file, the command exits 0 and underreports activations/rates instead of surfacing the missing artifact, so coverage can look like inactivity rather than incomplete input data.
Useful? React with 👍 / 👎.
| m.total_activations.to_string(), | ||
| activated.to_string(), | ||
| format!("{usage:.4}"), | ||
| format!("{:.4}", m.share_of_all_activations), | ||
| reasons_str, |
There was a problem hiding this comment.
Scope text totals to the selected bucket
When --bucket resolved|unresolved|errored is used, these columns still render global total_activations, global activation share, and global reason counts while only instances_activated/rates come from the selected bucket. That makes the filtered text table internally inconsistent; for example an --bucket unresolved row can show activations and reasons that happened only in resolved instances, misleading operators who use the bucket view to analyze failures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 06413bd6d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| manifest.config_redacted.clone(), | ||
| ) { | ||
| instance_enabled = root_cfg.skills.enabled; | ||
| instance_skill_paths.clone_from(&root_cfg.skills.paths); |
There was a problem hiding this comment.
Don't scan skills from redacted trajectory paths
When a sweep's redaction config masks part of [skills].paths, each trajectory's info.manifest.config_redacted contains the masked string rather than a real directory. Overwriting instance_skill_paths with that redacted value makes the later SkillRegistry::scan_paths see a nonexistent path and omit configured-but-inactive skills from skill_universe/by_skill (and can report fake drift), even though results.json still has the usable resolved path; keep the unredacted sweep config for scanning and use redacted trajectory config only for display/fingerprints.
Useful? React with 👍 / 👎.
| for traj_path in traj_paths { | ||
| let traj = load_trajectory(&traj_path)?; |
There was a problem hiding this comment.
Attribute retry activations to their own run outcome
When an instance has multiple run-*.traj.json files, this loop merges active skills from every attempt but the instance has only the single aggregate bucket computed above. In a retry/pass@k sweep where a skill appears only in a failed first attempt and a later attempt resolves without it, those failed-attempt activations are still counted under the resolved bucket, which corrupts the by_outcome correlation rates; load each trajectory's outcome alongside its active skills or keep per-run buckets before aggregating.
Useful? React with 👍 / 👎.
| if let Ok(skills) = serde_json::from_value::<Vec<crate::skills::ActiveSkillManifest>>( | ||
| raw_skills.clone(), | ||
| ) { |
There was a problem hiding this comment.
Report malformed active skill manifests
When info.active_skills is present but does not deserialize cleanly, this if let Ok falls through and treats the trajectory as if it had no active skills. For a completed sweep with a truncated active-skill entry or a schema mismatch, skill-coverage exits successfully while underreporting activations and reason counts; propagate the parse error so corrupt provenance is not reported as inactivity.
Useful? React with 👍 / 👎.
| .is_some_and(|n| n.starts_with("run-") && n.ends_with(".traj.json")) | ||
| }) | ||
| .collect(); | ||
| paths.sort(); |
There was a problem hiding this comment.
Sort run trajectories by numeric slot
For instances with ten or more run-*.traj.json files, paths.sort() orders lexicographically (run-10 before run-2). Because the later loop overwrites instance_enabled and instance_skill_paths for each trajectory, the report can use an older run's skill configuration as the final one and omit inactive skills/drift from the real last attempt; sort by the parsed run index instead.
Useful? React with 👍 / 👎.
…sort run trajectories numerically, and handle redacted config paths
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f98870dc64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let expanded_paths = global_skill_paths | ||
| .iter() | ||
| .map(crate::skills::expand_skill_path_pub) | ||
| .collect::<Vec<_>>(); | ||
| let registry = crate::skills::SkillRegistry::scan_paths(expanded_paths)?; |
There was a problem hiding this comment.
Scan the per-instance skill paths for drifted configs
When an instance's trajectory has a different skills.paths value and those paths are not redacted, instance_skill_paths is used for drift reporting but ignored for discovery here. This scans only the sweep-level global_skill_paths, so configured-but-inactive skills that exist only in the drifted group are omitted from skill_universe/by_skill, even though the spec says drift should report the union of all configured skills across configurations.
Useful? React with 👍 / 👎.
| if let Some(instances) = | ||
| val.get("instances").and_then(serde_json::Value::as_array) |
There was a problem hiding this comment.
Handle resolved_ids when loading run outcomes
For sb-cli reports that use the supported resolved_ids/submitted_ids shape instead of an instances array, this helper records no per-run outcomes and falls back to the heuristic below. Fresh evidence: this new loader only reads instances, while the existing evaluator parser handles the sb-cli resolved_ids/submitted_ids shape; in pass@k sweeps that can attribute a skill activation to the wrong run bucket whenever the resolved sample is not the guessed one.
Useful? React with 👍 / 👎.
| if let Ok(mut root_cfg) = | ||
| toml::from_str::<crate::config::schema::RootCfg>(&manifest.config.resolved) |
There was a problem hiding this comment.
Reject malformed resolved sweep config
If results.json has a malformed manifest.config.resolved, this branch silently leaves skills disabled, paths empty, and the default redactor in place, so the command can exit successfully with inactive skills missing instead of treating the sweep artifact as corrupted. Since the report depends on that resolved config for the skill universe and redaction policy, propagate the parse error when the manifest is present.
Useful? React with 👍 / 👎.
…ape, and propagate resolved config parser errors
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 245f1b6747
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if is_active { | ||
| instances_activated_global += 1; |
There was a problem hiding this comment.
Count activated instances by original instance id
For pass@k/retry sweeps where the same skill activates in more than one run-*.traj.json for the same instance, instance_data contains one row per run slot, so this increment reports two instances_activated (and activation_rate later divides by the number of runs) instead of one distinct instance as promised by the schema. Fresh evidence is that this revision now pushes each run file as a separate InstanceSkillData (id-run-N), so retry attempts inflate instance-level rates while total_activations should be the per-run counter.
Useful? React with 👍 / 👎.
| let run_index = if let Some(idx_str) = filename.split("-run-").nth(1) { | ||
| idx_str.parse::<u32>().unwrap_or(1) |
There was a problem hiding this comment.
Parse run report suffix from the last marker
When bench evaluate was run with a custom run id that itself contains -run- (for example smoke-run-2026), the report file becomes smoke-run-2026-run-2.json but this parses the first marker and records the entries under run 2026 instead of run 2 (or run 1 if the middle segment is nonnumeric). Those explicit per-run outcomes then never match the run-2.traj.json slots, so skill activations fall back to the heuristic and can be bucketed under the wrong outcome; parse the suffix with the last -run- separator.
Useful? React with 👍 / 👎.
| if let Ok(entries) = std::fs::read_dir(report_dir) { | ||
| for entry in entries.flatten() { | ||
| let path = entry.path(); | ||
| if path.extension().and_then(|s| s.to_str()) == Some("json") { |
There was a problem hiding this comment.
Ignore stale sb-cli reports from older evaluations
When the same sweep is evaluated more than once, sb_cli_reports can contain report JSON for multiple run ids, but this loader reads every JSON file and inserts all of their per-run outcomes into the same map. If an older report disagrees with the current evaluation.json, whichever directory entry is visited last wins for a given instance/run slot, so retry skill activations can be bucketed nondeterministically using stale evaluator results; restrict the scan to the report paths/run id recorded in the current evaluation provenance.
Useful? React with 👍 / 👎.
…run index suffix, ignore stale reports
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39a7fdd276
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .active_skills_unique | ||
| .iter() | ||
| .any(|s| redactor.redact_text(&s.name, surface::TRAJECTORY).text == redacted_name); | ||
| let is_parent_resolved = resolved_set.contains(&d.parent_instance_id); |
There was a problem hiding this comment.
Use run outcomes for bucket resolved rates
In retry/pass@k sweeps, this uses the parent instance's final resolution for every run row, so an earlier failed run-1.traj.json from an instance that later resolves is counted as resolved inside the unresolved bucket. The run is now correctly placed in d.bucket, but by_outcome.unresolved.resolved_rate_when_active/resolved_rate_when_not_active can still become non-zero (or 1.0) for those failed attempts, corrupting the per-bucket correlation metrics; bucket-specific resolved totals should be based on the run bucket rather than resolved_set for the parent.
Useful? React with 👍 / 👎.
| } | ||
| } | ||
|
|
||
| let _ = parse_report_outcomes(&path, run_index, &mut resolved_by_run); |
There was a problem hiding this comment.
Propagate sb-cli report parse failures
When a current sb_cli_reports/*.json file for a retry/pass@k evaluation is unreadable or malformed, this drops the error and leaves resolved_by_run incomplete, so the later heuristic can silently bucket run activations under the wrong outcome. Because these reports are the only precise source for per-run outcomes after bench evaluate, skill-coverage should fail on parse/read errors for matching reports instead of producing correlation metrics from guessed outcomes.
Useful? React with 👍 / 👎.
| let entry = config_groups | ||
| .entry(fingerprint) | ||
| .or_insert_with(|| (Vec::new(), d.eligible_paths.clone())); | ||
| entry.0.push(d.id.clone()); |
There was a problem hiding this comment.
Count drift groups by parent instance
In retry/pass@k sweeps instance_data contains one row per run-*.traj.json, but this records the synthetic run id (inst-run-N) for drift groups. A single instance with two attempts and the same skill config is therefore reported as two instance_counts, so the drift summary can exceed the number of sweep instances; use distinct parent_instance_ids per fingerprint for the instance count.
Useful? React with 👍 / 👎.
| ) -> Result<(), Error> { | ||
| let text = std::fs::read_to_string(path)?; | ||
| let val = serde_json::from_str::<serde_json::Value>(&text)?; | ||
| if let Some(instances) = val.get("instances").and_then(serde_json::Value::as_array) { |
There was a problem hiding this comment.
Parse array-shaped evaluator reports
The existing evaluator loader accepts sb-cli/generic reports that are a top-level array of {instance_id, resolved} rows, but this per-run loader only checks an instances object field and then the resolved_ids shape. For pass@k sweeps whose sb_cli_reports/*-run-N.json use the supported array shape, resolved_by_run stays empty and the command falls back to the heuristic, which can attribute skill activations to the wrong run outcome.
Useful? React with 👍 / 👎.
…te parse failures, count drift by parent instances, parse array reports
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c1aa5033b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } else if run_index == 1 { | ||
| pass_at_1 | ||
| } else { | ||
| resolved_count == runs || (run_index == runs) |
There was a problem hiding this comment.
Avoid double-counting resolved retry slots
When per-run evaluator reports are unavailable and a pass@k sweep has pass_at_1=true, resolved_count=1, and runs>1, this fallback marks run 1 resolved via pass_at_1 and also marks the last run resolved via run_index == runs. That contradicts the aggregate resolved_count and puts activations from an actually unresolved later retry into the resolved bucket, corrupting the per-outcome skill correlation for sweeps that only have results.json metadata.
Useful? React with 👍 / 👎.
| if nested.exists() { | ||
| return vec![nested]; |
There was a problem hiding this comment.
Include run files beside legacy trajectories
When an instance directory contains a legacy trajectory.json plus newer run-N.traj.json files, this early return loads only the legacy file and drops the retry/pass@k slots. The codebase supports that mixed layout elsewhere by treating trajectory.json as run 1 and run-N.traj.json as later slots, so skill activations and per-run outcome buckets are underreported for those sweeps.
Useful? React with 👍 / 👎.
| if let (Some(inst_id), Some(resolved)) = ( | ||
| row.get("instance_id").and_then(serde_json::Value::as_str), | ||
| row.get("resolved").and_then(serde_json::Value::as_bool), | ||
| ) { |
There was a problem hiding this comment.
Treat missing per-run resolved flags as false
For current sb_cli_reports rows that use the generic evaluator shape with instance_id/eval_exit_reason but omit resolved, the evaluator parser treats the row as unresolved, while this loader drops it because resolved must be a bool. In pass@k sweeps that leaves resolved_by_run incomplete and the later heuristic can move activations from that failed run into the wrong outcome bucket.
Useful? React with 👍 / 👎.
|
|
||
| let run_bucket = if is_run_resolved { | ||
| OutcomeBucket::Resolved | ||
| } else if traj.info.outcome.as_deref() == Some(crate::trajectory::outcome::ERROR) { |
There was a problem hiding this comment.
Preserve errored buckets for legacy trajectories
When a single-run/legacy trajectory omits info.outcome but the corresponding results.json row has outcome: "error", this check falls through to Unresolved instead of using the result-row outcome as a fallback. Those errored instances then inflate the unresolved bucket and hide skill usage in the errored bucket for older or minimally populated trajectory artifacts.
Useful? React with 👍 / 👎.
Implements bench skill-coverage command to analyze skill co-activation and drift analytics across completed sweeps. Closes #494.