feat: implement bench skill-coverage by madmax983 · Pull Request #617 · madmax983/rust_swe_agent

madmax983 · 2026-06-06T02:32:41Z

Implements bench skill-coverage command to analyze skill co-activation and drift analytics across completed sweeps. Closes #494.

…ift analytics

gemini-code-assist

Code Review

This pull request introduces the bench skill-coverage subcommand, which measures agent skill activation and correlates it with outcomes across a sweep. It includes the command specification, CLI argument parsing, report generation logic, audit redaction integration, and comprehensive integration tests. The reviewer feedback highlights several key areas for improvement: resolving an absolute local file URI used in a documentation link, optimizing trajectory file loading by merging redundant read passes, and caching skill descriptions in a map to avoid repetitive filesystem scans and manifest parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-06T02:34:39Z

+            let root_cfg_opt = if let Ok(traj) = load_trajectory(&traj_path) {
+                if let Some(manifest) = &traj.info.manifest {
+                    serde_json::from_value::<crate::config::schema::RootCfg>(
+                        manifest.config_redacted.clone(),
+                    )
+                    .ok()
+                } else {
+                    None
+                }
+            } else {
+                None
+            };
+            if let Some(root_cfg) = root_cfg_opt {
+                instance_enabled = root_cfg.skills.enabled;
+                instance_skill_paths.clone_from(&root_cfg.skills.paths);
+            }
+
+            if let Ok(traj) = load_trajectory(&traj_path) {
+                // Parse active skills from trajectory other metadata
+                if let Some(raw_skills) = traj.info.other.get("active_skills") {
+                    if let Ok(skills) = serde_json::from_value::<
+                        Vec<crate::skills::ActiveSkillManifest>,
+                    >(raw_skills.clone())
+                    {
+                        // Merge active skills by name uniquely
+                        for skill in skills {
+                            if !active_skills.iter().any(|s| s.name == skill.name) {
+                                active_skills.push(skill);
+                            }
+                        }
+                    }
+                }
+            }


The trajectory file is loaded and parsed twice for each trajectory path. This is highly inefficient as trajectory files can be very large. Load the trajectory once and perform both the configuration extraction and active skills parsing on the same loaded instance.

if let Ok(traj) = load_trajectory(&traj_path) { if let Some(manifest) = &traj.info.manifest { if let Ok(root_cfg) = serde_json::from_value::<crate::config::schema::RootCfg>( manifest.config_redacted.clone(), ) { instance_enabled = root_cfg.skills.enabled; instance_skill_paths.clone_from(&root_cfg.skills.paths); } } // Parse active skills from trajectory other metadata if let Some(raw_skills) = traj.info.other.get("active_skills") { if let Ok(skills) = serde_json::from_value::< Vec<crate::skills::ActiveSkillManifest>, >(raw_skills.clone()) { // Merge active skills by name uniquely for skill in skills { if !active_skills.iter().any(|s| s.name == skill.name) { active_skills.push(skill); } } } } }

References

Before suggesting I/O optimizations such as merging multiple data processing passes, verify whether the functions actually perform redundant disk access. If a function processes data already loaded into memory by upstream components, merging it with an I/O-bound function is unnecessary and adds complexity without performance gain.

gemini-code-assist · 2026-06-06T02:34:39Z


 - Editing, generating, or recommending skill manifests
- Co-activation analytics across a corpus (`bench skill-coverage` is future)
+- Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md))


The link to spec-skill-coverage.md uses an absolute local file URI (file:///c:/Users/markm/...). This should be a relative link so that it works correctly when viewed on GitHub or other environments.

Suggested change

- Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md))

- Co-activation analytics across a corpus (implemented in [bench skill-coverage](spec-skill-coverage.md))

gemini-code-assist · 2026-06-06T02:34:39Z

+    let mut sorted_ids: Vec<String> = sweep.instances.keys().cloned().collect();
+    sorted_ids.sort();
+
+    let mut instance_data: Vec<InstanceSkillData> = Vec::new();


Define a skill_descriptions map to cache the descriptions of scanned skills during the first pass. This will allow us to avoid redundant filesystem scans and manifest parsing in the subsequent loop.

Suggested change

let mut instance_data: Vec<InstanceSkillData> = Vec::new();

let mut instance_data: Vec<InstanceSkillData> = Vec::new();

let mut skill_descriptions = HashMap::new();

gemini-code-assist · 2026-06-06T02:34:39Z

+            if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) {
+                for manifest in registry.manifests() {
+                    eligible_skills.insert(manifest.name.clone());
+                }
+            }


Populate the skill_descriptions map with the skill names and descriptions during the initial registry scan.

if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) { for manifest in registry.manifests() { eligible_skills.insert(manifest.name.clone()); skill_descriptions.insert(manifest.name.clone(), manifest.description.clone()); } }

gemini-code-assist · 2026-06-06T02:34:40Z

+        for name in &d.eligible_skills {
+            all_raw_skills.entry(name.clone()).or_insert_with(|| {
+                let expanded_paths = d
+                    .eligible_paths
+                    .iter()
+                    .map(crate::skills::expand_skill_path_pub)
+                    .collect::<Vec<_>>();
+                let description = crate::skills::SkillRegistry::scan_paths(expanded_paths)
+                    .ok()
+                    .and_then(|r| {
+                        r.manifests()
+                            .iter()
+                            .find(|m| &m.name == name)
+                            .map(|m| m.description.clone())
+                    })
+                    .unwrap_or_else(|| "Configured skill".to_owned());
+                SkillUniverseEntry {
+                    name: name.clone(),
+                    description,
+                }
+            });
+        }


Use the cached skill_descriptions map to retrieve descriptions instead of scanning the filesystem and parsing the skill manifests repeatedly for each eligible skill.

for name in &d.eligible_skills { all_raw_skills.entry(name.clone()).or_insert_with(|| { let description = skill_descriptions .get(name) .cloned() .unwrap_or_else(|| "Configured skill".to_owned()); SkillUniverseEntry { name: name.clone(), description, } }); }

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1905aa5b39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T02:41:18Z

+        per_instance: t.per_instance,
+    })?;
+    if is_json {
+        println!("{}", serde_json::to_string_pretty(&report)?);


Preserve the artifact header in JSON stdout

When --format json is used, this prints the bare SkillCoverageReport, so stdout omits the artifact_kind and schema_version fields promised by the new JSON schema in docs/spec-skill-coverage.md. The file artifact is written through artifact::to_writer_pretty, but CLI consumers that parse stdout cannot validate or route the report as a skill_coverage artifact; serialize stdout with the same artifact wrapper or include those fields in the report type.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T02:41:18Z

+            .entry(redacted.clone())
+            .or_insert_with(|| SkillUniverseEntry {
+                name: redacted,
+                description: entry.description.clone(),


Redact skill descriptions before writing coverage artifacts

For configured-but-inactive skills this value comes from scanning the on-disk SKILL.md description and is copied into both stdout and skill-coverage.json without passing through the redactor. In sweeps where a skill description contains a token, key, or other sensitive literal, this re-exposes data that active skill provenance normally redacts in trajectories; redact the description with the same trajectory surface before inserting it into the report.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T02:41:18Z

+                            if !active_skills.iter().any(|s| s.name == skill.name) {
+                                active_skills.push(skill);


Count repeated activations across retry trajectories

When an instance has multiple trajectory files (for example run-*.traj.json from retries or resumes) and the same skill activates in more than one run, this de-duplicates by skill name before aggregation. As a result total_activations and the reason counts report one activation for that instance even though the field is documented as the total number of activations; keep per-run activations for those counters while still counting instances_activated distinctly.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T02:41:18Z

+    // Group configurations to detect drift
+    let mut config_groups: HashMap<String, (Vec<String>, Vec<String>)> = HashMap::new(); // fingerprint -> (instance_ids, paths)
+    for d in &instance_data {
+        if d.skills_enabled {


Include disabled skill configs in drift detection

When a sweep mixes instances with skills enabled and disabled, this condition skips the disabled instances entirely, so config_groups.len() can remain 1 and skill_set_drift is omitted even though the effective skill configuration changed. This hides a drift scenario that materially affects activation rates; include the enabled/disabled state in the fingerprint rather than dropping disabled configurations.

Useful? React with 👍 / 👎.

codecov · 2026-06-06T03:30:29Z

Codecov Report

❌ Patch coverage is 81.77215% with 144 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.43%. Comparing base (3e19815) to head (7c1aa50).
⚠️ Report is 2 commits behind head on trunk.

Files with missing lines	Patch %	Lines
src/run/skill_coverage.rs	84.28%	118 Missing ⚠️
src/cli/mod.rs	3.84%	25 Missing ⚠️
src/artifact.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            trunk     #617      +/-   ##
==========================================
+ Coverage   85.35%   85.43%   +0.08%     
==========================================
  Files         115      118       +3     
  Lines       66346    68039    +1693     
==========================================
+ Hits        56627    58127    +1500     
- Misses       9719     9912     +193

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a75667898

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T03:34:24Z

+        let mut sorted_paths = d.eligible_paths.clone();
+        sorted_paths.sort();
+        let paths_str = sorted_paths.join(",");
+        let fingerprint = format!("enabled={}:{}", d.skills_enabled, paths_str);


Redact skill paths before serializing drift groups

When a sweep has skill-set drift, these path strings come from the raw resolved [skills] config and are copied into the drift fingerprint (and the same unredacted vector is later emitted as groups[].paths). If an operator's configured skill directory contains a token or other sensitive literal in its path, bench skill-coverage writes it to stdout and skill-coverage.json, unlike the redacted active-skill provenance; run these paths through the trajectory redactor before building the fingerprint/output.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T03:34:24Z

+        for traj_path in traj_paths {
+            if let Ok(traj) = load_trajectory(&traj_path) {


Fail instead of silently dropping unreadable trajectories

When a trajectory file exists but cannot be parsed or classified, this if let Ok skips it and still pushes the instance as though it had no active skills. For completed sweeps with a truncated run-*.traj.json or schema mismatch, the command exits 0 while underreporting activations, activation rates, and drift; propagate the load error instead of treating corrupted trajectories as inactivity.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T03:34:24Z

+        }
+    }
+
+    let redactor = Redactor::default_enabled();


Use the sweep's redaction policy for scanned skills

When the original sweep configured custom redaction literals or patterns, this fresh default redactor does not know about them. The fields later redacted from on-disk SKILL.md manifests, such as inactive skill names and descriptions, can therefore leak values that the trajectories' own redaction policy would have masked; construct the redactor from the sweep config before serializing coverage output.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T03:34:24Z

+        for d in &instance_data {
+            let is_active = d.active_skills_unique.iter().any(|s| s.name == raw_name);
+            let is_eligible = d.eligible_skills.contains(&raw_name);


Match active skills after applying trajectory redaction

For a skill name that is redacted in trajectory provenance, active_skills_unique contains the redacted marker while raw_name comes from the unredacted scanned skill registry. This comparison then treats the activated skill as inactive (or nondeterministically overwrites it with a duplicate redacted key), so secret-bearing skill names produce incorrect activation counts even though the report is supposed to aggregate by redacted name; compare against the redacted skill name or normalize both sides consistently.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f5becae1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T03:52:29Z

+                .iter()
+                .map(crate::skills::expand_skill_path_pub)
+                .collect::<Vec<_>>();
+            if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) {


Propagate skill registry scan failures

When an enabled skill path contains an unreadable, malformed, or duplicate SKILL.md, SkillRegistry::scan_paths returns an error but this if let Ok treats that configuration as having no eligible skills. The command then exits successfully with configured-but-inactive skills missing from skill_universe/by_skill, which corrupts the coverage report instead of surfacing the broken skill library to the operator.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T03:52:29Z

+                if in_scope {
+                    *per_bucket_total.entry(bname).or_default() += 1;


Count every bucket instance in outcome totals

When skill-set drift means a skill is only configured for some instances, this denominator is incremented only for instances where the skill is active or eligible. The report schema documents by_outcome.instances_total as the total instances in the bucket, so excluding drifted-out instances overstates usage_rate and removes those not-active outcomes from resolved_rate_when_not_active, making the per-skill deltas misleading across mixed skill configurations.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T03:52:29Z

+        let mut instance_skill_paths = global_skill_paths.clone();
+        let mut instance_enabled = enabled;
+
+        let traj_paths = resolve_trajectory_paths(&args.sweep_dir, id);


Fail when a listed instance has no trajectory

When results.json lists an instance but none of the supported trajectory paths exist, traj_paths is empty and the instance is still pushed with no active skills. For completed sweeps with a missing or unbundled trajectory file, the command exits 0 and underreports activations/rates instead of surfacing the missing artifact, so coverage can look like inactivity rather than incomplete input data.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T03:52:30Z

+                m.total_activations.to_string(),
+                activated.to_string(),
+                format!("{usage:.4}"),
+                format!("{:.4}", m.share_of_all_activations),
+                reasons_str,


Scope text totals to the selected bucket

When --bucket resolved|unresolved|errored is used, these columns still render global total_activations, global activation share, and global reason counts while only instances_activated/rates come from the selected bucket. That makes the filtered text table internally inconsistent; for example an --bucket unresolved row can show activations and reasons that happened only in resolved instances, misleading operators who use the bucket view to analyze failures.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06413bd6d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T04:47:35Z

+                    manifest.config_redacted.clone(),
+                ) {
+                    instance_enabled = root_cfg.skills.enabled;
+                    instance_skill_paths.clone_from(&root_cfg.skills.paths);


Don't scan skills from redacted trajectory paths

When a sweep's redaction config masks part of [skills].paths, each trajectory's info.manifest.config_redacted contains the masked string rather than a real directory. Overwriting instance_skill_paths with that redacted value makes the later SkillRegistry::scan_paths see a nonexistent path and omit configured-but-inactive skills from skill_universe/by_skill (and can report fake drift), even though results.json still has the usable resolved path; keep the unredacted sweep config for scanning and use redacted trajectory config only for display/fingerprints.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T04:47:35Z

+        for traj_path in traj_paths {
+            let traj = load_trajectory(&traj_path)?;


Attribute retry activations to their own run outcome

When an instance has multiple run-*.traj.json files, this loop merges active skills from every attempt but the instance has only the single aggregate bucket computed above. In a retry/pass@k sweep where a skill appears only in a failed first attempt and a later attempt resolves without it, those failed-attempt activations are still counted under the resolved bucket, which corrupts the by_outcome correlation rates; load each trajectory's outcome alongside its active skills or keep per-run buckets before aggregating.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T04:47:35Z

+                if let Ok(skills) = serde_json::from_value::<Vec<crate::skills::ActiveSkillManifest>>(
+                    raw_skills.clone(),
+                ) {


Report malformed active skill manifests

When info.active_skills is present but does not deserialize cleanly, this if let Ok falls through and treats the trajectory as if it had no active skills. For a completed sweep with a truncated active-skill entry or a schema mismatch, skill-coverage exits successfully while underreporting activations and reason counts; propagate the parse error so corrupt provenance is not reported as inactivity.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T04:47:35Z

+                            .is_some_and(|n| n.starts_with("run-") && n.ends_with(".traj.json"))
+                    })
+                    .collect();
+                paths.sort();


Sort run trajectories by numeric slot

For instances with ten or more run-*.traj.json files, paths.sort() orders lexicographically (run-10 before run-2). Because the later loop overwrites instance_enabled and instance_skill_paths for each trajectory, the report can use an older run's skill configuration as the final one and omit inactive skills/drift from the real last attempt; sort by the parsed run index instead.

Useful? React with 👍 / 👎.

…sort run trajectories numerically, and handle redacted config paths

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f98870dc64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T22:19:18Z

+                let expanded_paths = global_skill_paths
+                    .iter()
+                    .map(crate::skills::expand_skill_path_pub)
+                    .collect::<Vec<_>>();
+                let registry = crate::skills::SkillRegistry::scan_paths(expanded_paths)?;


Scan the per-instance skill paths for drifted configs

When an instance's trajectory has a different skills.paths value and those paths are not redacted, instance_skill_paths is used for drift reporting but ignored for discovery here. This scans only the sweep-level global_skill_paths, so configured-but-inactive skills that exist only in the drifted group are omitted from skill_universe/by_skill, even though the spec says drift should report the union of all configured skills across configurations.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T22:19:18Z

+                            if let Some(instances) =
+                                val.get("instances").and_then(serde_json::Value::as_array)


Handle resolved_ids when loading run outcomes

For sb-cli reports that use the supported resolved_ids/submitted_ids shape instead of an instances array, this helper records no per-run outcomes and falls back to the heuristic below. Fresh evidence: this new loader only reads instances, while the existing evaluator parser handles the sb-cli resolved_ids/submitted_ids shape; in pass@k sweeps that can attribute a skill activation to the wrong run bucket whenever the resolved sample is not the guessed one.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T22:19:18Z

+        if let Ok(mut root_cfg) =
+            toml::from_str::<crate::config::schema::RootCfg>(&manifest.config.resolved)


Reject malformed resolved sweep config

If results.json has a malformed manifest.config.resolved, this branch silently leaves skills disabled, paths empty, and the default redactor in place, so the command can exit successfully with inactive skills missing instead of treating the sweep artifact as corrupted. Since the report depends on that resolved config for the skill universe and redaction policy, propagate the parse error when the manifest is present.

Useful? React with 👍 / 👎.

…ape, and propagate resolved config parser errors

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 245f1b6747

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T23:11:01Z

+            if is_active {
+                instances_activated_global += 1;


Count activated instances by original instance id

For pass@k/retry sweeps where the same skill activates in more than one run-*.traj.json for the same instance, instance_data contains one row per run slot, so this increment reports two instances_activated (and activation_rate later divides by the number of runs) instead of one distinct instance as promised by the schema. Fresh evidence is that this revision now pushes each run file as a separate InstanceSkillData (id-run-N), so retry attempts inflate instance-level rates while total_activations should be the per-run counter.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:11:01Z

+                    let run_index = if let Some(idx_str) = filename.split("-run-").nth(1) {
+                        idx_str.parse::<u32>().unwrap_or(1)


Parse run report suffix from the last marker

When bench evaluate was run with a custom run id that itself contains -run- (for example smoke-run-2026), the report file becomes smoke-run-2026-run-2.json but this parses the first marker and records the entries under run 2026 instead of run 2 (or run 1 if the middle segment is nonnumeric). Those explicit per-run outcomes then never match the run-2.traj.json slots, so skill activations fall back to the heuristic and can be bucketed under the wrong outcome; parse the suffix with the last -run- separator.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:11:01Z

+        if let Ok(entries) = std::fs::read_dir(report_dir) {
+            for entry in entries.flatten() {
+                let path = entry.path();
+                if path.extension().and_then(|s| s.to_str()) == Some("json") {


Ignore stale sb-cli reports from older evaluations

When the same sweep is evaluated more than once, sb_cli_reports can contain report JSON for multiple run ids, but this loader reads every JSON file and inserts all of their per-run outcomes into the same map. If an older report disagrees with the current evaluation.json, whichever directory entry is visited last wins for a given instance/run slot, so retry skill activations can be bucketed nondeterministically using stale evaluator results; restrict the scan to the report paths/run id recorded in the current evaluation provenance.

Useful? React with 👍 / 👎.

…run index suffix, ignore stale reports

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39a7fdd276

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T23:54:40Z

+                .active_skills_unique
+                .iter()
+                .any(|s| redactor.redact_text(&s.name, surface::TRAJECTORY).text == redacted_name);
+            let is_parent_resolved = resolved_set.contains(&d.parent_instance_id);


Use run outcomes for bucket resolved rates

In retry/pass@k sweeps, this uses the parent instance's final resolution for every run row, so an earlier failed run-1.traj.json from an instance that later resolves is counted as resolved inside the unresolved bucket. The run is now correctly placed in d.bucket, but by_outcome.unresolved.resolved_rate_when_active/resolved_rate_when_not_active can still become non-zero (or 1.0) for those failed attempts, corrupting the per-bucket correlation metrics; bucket-specific resolved totals should be based on the run bucket rather than resolved_set for the parent.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:54:40Z

+                        }
+                    }
+
+                    let _ = parse_report_outcomes(&path, run_index, &mut resolved_by_run);


Propagate sb-cli report parse failures

When a current sb_cli_reports/*.json file for a retry/pass@k evaluation is unreadable or malformed, this drops the error and leaves resolved_by_run incomplete, so the later heuristic can silently bucket run activations under the wrong outcome. Because these reports are the only precise source for per-run outcomes after bench evaluate, skill-coverage should fail on parse/read errors for matching reports instead of producing correlation metrics from guessed outcomes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:54:40Z

+        let entry = config_groups
+            .entry(fingerprint)
+            .or_insert_with(|| (Vec::new(), d.eligible_paths.clone()));
+        entry.0.push(d.id.clone());


Count drift groups by parent instance

In retry/pass@k sweeps instance_data contains one row per run-*.traj.json, but this records the synthetic run id (inst-run-N) for drift groups. A single instance with two attempts and the same skill config is therefore reported as two instance_counts, so the drift summary can exceed the number of sweep instances; use distinct parent_instance_ids per fingerprint for the instance count.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:54:40Z

+) -> Result<(), Error> {
+    let text = std::fs::read_to_string(path)?;
+    let val = serde_json::from_str::<serde_json::Value>(&text)?;
+    if let Some(instances) = val.get("instances").and_then(serde_json::Value::as_array) {


Parse array-shaped evaluator reports

The existing evaluator loader accepts sb-cli/generic reports that are a top-level array of {instance_id, resolved} rows, but this per-run loader only checks an instances object field and then the resolved_ids shape. For pass@k sweeps whose sb_cli_reports/*-run-N.json use the supported array shape, resolved_by_run stays empty and the command falls back to the heuristic, which can attribute skill activations to the wrong run outcome.

Useful? React with 👍 / 👎.

…te parse failures, count drift by parent instances, parse array reports

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c1aa5033b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T01:15:06Z

+                        } else if run_index == 1 {
+                            pass_at_1
+                        } else {
+                            resolved_count == runs || (run_index == runs)


Avoid double-counting resolved retry slots

When per-run evaluator reports are unavailable and a pass@k sweep has pass_at_1=true, resolved_count=1, and runs>1, this fallback marks run 1 resolved via pass_at_1 and also marks the last run resolved via run_index == runs. That contradicts the aggregate resolved_count and puts activations from an actually unresolved later retry into the resolved bucket, corrupting the per-outcome skill correlation for sweeps that only have results.json metadata.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T01:15:07Z

+    if nested.exists() {
+        return vec![nested];


Include run files beside legacy trajectories

When an instance directory contains a legacy trajectory.json plus newer run-N.traj.json files, this early return loads only the legacy file and drops the retry/pass@k slots. The codebase supports that mixed layout elsewhere by treating trajectory.json as run 1 and run-N.traj.json as later slots, so skill activations and per-run outcome buckets are underreported for those sweeps.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T01:15:07Z

+                if let (Some(inst_id), Some(resolved)) = (
+                    row.get("instance_id").and_then(serde_json::Value::as_str),
+                    row.get("resolved").and_then(serde_json::Value::as_bool),
+                ) {


Treat missing per-run resolved flags as false

For current sb_cli_reports rows that use the generic evaluator shape with instance_id/eval_exit_reason but omit resolved, the evaluator parser treats the row as unresolved, while this loader drops it because resolved must be a bool. In pass@k sweeps that leaves resolved_by_run incomplete and the later heuristic can move activations from that failed run into the wrong outcome bucket.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T01:15:07Z

+
+            let run_bucket = if is_run_resolved {
+                OutcomeBucket::Resolved
+            } else if traj.info.outcome.as_deref() == Some(crate::trajectory::outcome::ERROR) {


Preserve errored buckets for legacy trajectories

When a single-run/legacy trajectory omits info.outcome but the corresponding results.json row has outcome: "error", this check falls through to Unresolved instead of using the result-row outcome as a fallback. Those errored instances then inflate the unresolved bucket and hide skill usage in the errored bucket for older or minimally populated trajectory artifacts.

Useful? React with 👍 / 👎.

feat: implement bench skill-coverage command for co-activation and dr…

1905aa5

…ift analytics

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

refactor: address code review and codex comments on skill-coverage

6a75667

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

refactor: address secondary codex comments on skill-coverage

8f5beca

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

feat: address third round of Codex Review suggestions for skill-coverage

06413bd

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

feat(bench): attribute retry activations to correct outcome buckets, …

f98870d

…sort run trajectories numerically, and handle redacted config paths

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

feat(bench): support drifted configs scanning, resolved_ids report sh…

245f1b6

…ape, and propagate resolved config parser errors

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

address round 5 codex reviews: distinct instance metrics, parse last …

39a7fdd

…run index suffix, ignore stale reports

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

address round 6 codex reviews: run-specific resolution rates, propaga…

7c1aa50

…te parse failures, count drift by parent instances, parse array reports

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

	- Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md))
	- Co-activation analytics across a corpus (implemented in [bench skill-coverage](spec-skill-coverage.md))

	let mut instance_data: Vec<InstanceSkillData> = Vec::new();
	let mut instance_data: Vec<InstanceSkillData> = Vec::new();
	let mut skill_descriptions = HashMap::new();

		if !active_skills.iter().any(\|s\| s.name == skill.name) {
		active_skills.push(skill);

		for traj_path in traj_paths {
		if let Ok(traj) = load_trajectory(&traj_path) {

		if in_scope {
		*per_bucket_total.entry(bname).or_default() += 1;

		for traj_path in traj_paths {
		let traj = load_trajectory(&traj_path)?;

		if let Some(instances) =
		val.get("instances").and_then(serde_json::Value::as_array)

		if let Ok(mut root_cfg) =
		toml::from_str::<crate::config::schema::RootCfg>(&manifest.config.resolved)

		let run_index = if let Some(idx_str) = filename.split("-run-").nth(1) {
		idx_str.parse::<u32>().unwrap_or(1)

Conversation

madmax983 commented Jun 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

codecov Bot commented Jun 6, 2026 •

edited

Loading