Skip to content

feat: implement bench skill-coverage#617

Open
madmax983 wants to merge 8 commits into
trunkfrom
feature/issue-494-skill-coverage
Open

feat: implement bench skill-coverage#617
madmax983 wants to merge 8 commits into
trunkfrom
feature/issue-494-skill-coverage

Conversation

@madmax983
Copy link
Copy Markdown
Owner

Implements bench skill-coverage command to analyze skill co-activation and drift analytics across completed sweeps. Closes #494.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the bench skill-coverage subcommand, which measures agent skill activation and correlates it with outcomes across a sweep. It includes the command specification, CLI argument parsing, report generation logic, audit redaction integration, and comprehensive integration tests. The reviewer feedback highlights several key areas for improvement: resolving an absolute local file URI used in a documentation link, optimizing trajectory file loading by merging redundant read passes, and caching skill descriptions in a map to avoid repetitive filesystem scans and manifest parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +423 to +455
let root_cfg_opt = if let Ok(traj) = load_trajectory(&traj_path) {
if let Some(manifest) = &traj.info.manifest {
serde_json::from_value::<crate::config::schema::RootCfg>(
manifest.config_redacted.clone(),
)
.ok()
} else {
None
}
} else {
None
};
if let Some(root_cfg) = root_cfg_opt {
instance_enabled = root_cfg.skills.enabled;
instance_skill_paths.clone_from(&root_cfg.skills.paths);
}

if let Ok(traj) = load_trajectory(&traj_path) {
// Parse active skills from trajectory other metadata
if let Some(raw_skills) = traj.info.other.get("active_skills") {
if let Ok(skills) = serde_json::from_value::<
Vec<crate::skills::ActiveSkillManifest>,
>(raw_skills.clone())
{
// Merge active skills by name uniquely
for skill in skills {
if !active_skills.iter().any(|s| s.name == skill.name) {
active_skills.push(skill);
}
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The trajectory file is loaded and parsed twice for each trajectory path. This is highly inefficient as trajectory files can be very large. Load the trajectory once and perform both the configuration extraction and active skills parsing on the same loaded instance.

            if let Ok(traj) = load_trajectory(&traj_path) {
                if let Some(manifest) = &traj.info.manifest {
                    if let Ok(root_cfg) = serde_json::from_value::<crate::config::schema::RootCfg>(
                        manifest.config_redacted.clone(),
                    ) {
                        instance_enabled = root_cfg.skills.enabled;
                        instance_skill_paths.clone_from(&root_cfg.skills.paths);
                    }
                }

                // Parse active skills from trajectory other metadata
                if let Some(raw_skills) = traj.info.other.get("active_skills") {
                    if let Ok(skills) = serde_json::from_value::<
                        Vec<crate::skills::ActiveSkillManifest>,
                    >(raw_skills.clone())
                    {
                        // Merge active skills by name uniquely
                        for skill in skills {
                            if !active_skills.iter().any(|s| s.name == skill.name) {
                                active_skills.push(skill);
                            }
                        }
                    }
                }
            }
References
  1. Before suggesting I/O optimizations such as merging multiple data processing passes, verify whether the functions actually perform redundant disk access. If a function processes data already loaded into memory by upstream components, merging it with an I/O-bound function is unnecessary and adds complexity without performance gain.

Comment thread docs/spec-skills-preview.md Outdated

- Editing, generating, or recommending skill manifests
- Co-activation analytics across a corpus (`bench skill-coverage` is future)
- Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link to spec-skill-coverage.md uses an absolute local file URI (file:///c:/Users/markm/...). This should be a relative link so that it works correctly when viewed on GitHub or other environments.

Suggested change
- Co-activation analytics across a corpus (implemented in [bench skill-coverage](file:///c:/Users/markm/rust_swe_agent/docs/spec-skill-coverage.md))
- Co-activation analytics across a corpus (implemented in [bench skill-coverage](spec-skill-coverage.md))

Comment thread src/run/skill_coverage.rs
let mut sorted_ids: Vec<String> = sweep.instances.keys().cloned().collect();
sorted_ids.sort();

let mut instance_data: Vec<InstanceSkillData> = Vec::new();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Define a skill_descriptions map to cache the descriptions of scanned skills during the first pass. This will allow us to avoid redundant filesystem scans and manifest parsing in the subsequent loop.

Suggested change
let mut instance_data: Vec<InstanceSkillData> = Vec::new();
let mut instance_data: Vec<InstanceSkillData> = Vec::new();
let mut skill_descriptions = HashMap::new();

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +465 to +469
if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) {
for manifest in registry.manifests() {
eligible_skills.insert(manifest.name.clone());
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Populate the skill_descriptions map with the skill names and descriptions during the initial registry scan.

            if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) {
                for manifest in registry.manifests() {
                    eligible_skills.insert(manifest.name.clone());
                    skill_descriptions.insert(manifest.name.clone(), manifest.description.clone());
                }
            }

Comment thread src/run/skill_coverage.rs
Comment on lines +507 to +528
for name in &d.eligible_skills {
all_raw_skills.entry(name.clone()).or_insert_with(|| {
let expanded_paths = d
.eligible_paths
.iter()
.map(crate::skills::expand_skill_path_pub)
.collect::<Vec<_>>();
let description = crate::skills::SkillRegistry::scan_paths(expanded_paths)
.ok()
.and_then(|r| {
r.manifests()
.iter()
.find(|m| &m.name == name)
.map(|m| m.description.clone())
})
.unwrap_or_else(|| "Configured skill".to_owned());
SkillUniverseEntry {
name: name.clone(),
description,
}
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the cached skill_descriptions map to retrieve descriptions instead of scanning the filesystem and parsing the skill manifests repeatedly for each eligible skill.

        for name in &d.eligible_skills {
            all_raw_skills.entry(name.clone()).or_insert_with(|| {
                let description = skill_descriptions
                    .get(name)
                    .cloned()
                    .unwrap_or_else(|| "Configured skill".to_owned());
                SkillUniverseEntry {
                    name: name.clone(),
                    description,
                }
            });
        }

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1905aa5b39

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/mod.rs Outdated
per_instance: t.per_instance,
})?;
if is_json {
println!("{}", serde_json::to_string_pretty(&report)?);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the artifact header in JSON stdout

When --format json is used, this prints the bare SkillCoverageReport, so stdout omits the artifact_kind and schema_version fields promised by the new JSON schema in docs/spec-skill-coverage.md. The file artifact is written through artifact::to_writer_pretty, but CLI consumers that parse stdout cannot validate or route the report as a skill_coverage artifact; serialize stdout with the same artifact wrapper or include those fields in the report type.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
.entry(redacted.clone())
.or_insert_with(|| SkillUniverseEntry {
name: redacted,
description: entry.description.clone(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact skill descriptions before writing coverage artifacts

For configured-but-inactive skills this value comes from scanning the on-disk SKILL.md description and is copied into both stdout and skill-coverage.json without passing through the redactor. In sweeps where a skill description contains a token, key, or other sensitive literal, this re-exposes data that active skill provenance normally redacts in trajectories; redact the description with the same trajectory surface before inserting it into the report.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +449 to +450
if !active_skills.iter().any(|s| s.name == skill.name) {
active_skills.push(skill);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count repeated activations across retry trajectories

When an instance has multiple trajectory files (for example run-*.traj.json from retries or resumes) and the same skill activates in more than one run, this de-duplicates by skill name before aggregation. As a result total_activations and the reason counts report one activation for that instance even though the field is documented as the total number of activations; keep per-run activations for those counters while still counting instances_activated distinctly.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
// Group configurations to detect drift
let mut config_groups: HashMap<String, (Vec<String>, Vec<String>)> = HashMap::new(); // fingerprint -> (instance_ids, paths)
for d in &instance_data {
if d.skills_enabled {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include disabled skill configs in drift detection

When a sweep mixes instances with skills enabled and disabled, this condition skips the disabled instances entirely, so config_groups.len() can remain 1 and skill_set_drift is omitted even though the effective skill configuration changed. This hides a drift scenario that materially affects activation rates; include the enabled/disabled state in the fingerprint rather than dropping disabled configurations.

Useful? React with 👍 / 👎.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

❌ Patch coverage is 81.77215% with 144 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.43%. Comparing base (3e19815) to head (7c1aa50).
⚠️ Report is 2 commits behind head on trunk.

Files with missing lines Patch % Lines
src/run/skill_coverage.rs 84.28% 118 Missing ⚠️
src/cli/mod.rs 3.84% 25 Missing ⚠️
src/artifact.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            trunk     #617      +/-   ##
==========================================
+ Coverage   85.35%   85.43%   +0.08%     
==========================================
  Files         115      118       +3     
  Lines       66346    68039    +1693     
==========================================
+ Hits        56627    58127    +1500     
- Misses       9719     9912     +193     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a75667898

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs
Comment on lines +555 to +558
let mut sorted_paths = d.eligible_paths.clone();
sorted_paths.sort();
let paths_str = sorted_paths.join(",");
let fingerprint = format!("enabled={}:{}", d.skills_enabled, paths_str);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Redact skill paths before serializing drift groups

When a sweep has skill-set drift, these path strings come from the raw resolved [skills] config and are copied into the drift fingerprint (and the same unredacted vector is later emitted as groups[].paths). If an operator's configured skill directory contains a token or other sensitive literal in its path, bench skill-coverage writes it to stdout and skill-coverage.json, unlike the redacted active-skill provenance; run these paths through the trajectory redactor before building the fingerprint/output.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +425 to +426
for traj_path in traj_paths {
if let Ok(traj) = load_trajectory(&traj_path) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail instead of silently dropping unreadable trajectories

When a trajectory file exists but cannot be parsed or classified, this if let Ok skips it and still pushes the instance as though it had no active skills. For completed sweeps with a truncated run-*.traj.json or schema mismatch, the command exits 0 while underreporting activations, activation rates, and drift; propagate the load error instead of treating corrupted trajectories as inactivity.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
}
}

let redactor = Redactor::default_enabled();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use the sweep's redaction policy for scanned skills

When the original sweep configured custom redaction literals or patterns, this fresh default redactor does not know about them. The fields later redacted from on-disk SKILL.md manifests, such as inactive skill names and descriptions, can therefore leak values that the trajectories' own redaction policy would have masked; construct the redactor from the sweep config before serializing coverage output.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +607 to +609
for d in &instance_data {
let is_active = d.active_skills_unique.iter().any(|s| s.name == raw_name);
let is_eligible = d.eligible_skills.contains(&raw_name);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match active skills after applying trajectory redaction

For a skill name that is redacted in trajectory provenance, active_skills_unique contains the redacted marker while raw_name comes from the unredacted scanned skill registry. This comparison then treats the activated skill as inactive (or nondeterministically overwrites it with a duplicate redacted key), so secret-bearing skill names produce incorrect activation counts even though the report is supposed to aggregate by redacted name; compare against the redacted skill name or normalize both sides consistently.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8f5becae1b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs Outdated
.iter()
.map(crate::skills::expand_skill_path_pub)
.collect::<Vec<_>>();
if let Ok(registry) = crate::skills::SkillRegistry::scan_paths(expanded_paths) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate skill registry scan failures

When an enabled skill path contains an unreadable, malformed, or duplicate SKILL.md, SkillRegistry::scan_paths returns an error but this if let Ok treats that configuration as having no eligible skills. The command then exits successfully with configured-but-inactive skills missing from skill_universe/by_skill, which corrupts the coverage report instead of surfacing the broken skill library to the operator.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +641 to +642
if in_scope {
*per_bucket_total.entry(bname).or_default() += 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count every bucket instance in outcome totals

When skill-set drift means a skill is only configured for some instances, this denominator is incremented only for instances where the skill is active or eligible. The report schema documents by_outcome.instances_total as the total instances in the bucket, so excluding drifted-out instances overstates usage_rate and removes those not-active outcomes from resolved_rate_when_not_active, making the per-skill deltas misleading across mixed skill configurations.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs
let mut instance_skill_paths = global_skill_paths.clone();
let mut instance_enabled = enabled;

let traj_paths = resolve_trajectory_paths(&args.sweep_dir, id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when a listed instance has no trajectory

When results.json lists an instance but none of the supported trajectory paths exist, traj_paths is empty and the instance is still pushed with no active skills. For completed sweeps with a missing or unbundled trajectory file, the command exits 0 and underreports activations/rates instead of surfacing the missing artifact, so coverage can look like inactivity rather than incomplete input data.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +219 to +223
m.total_activations.to_string(),
activated.to_string(),
format!("{usage:.4}"),
format!("{:.4}", m.share_of_all_activations),
reasons_str,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope text totals to the selected bucket

When --bucket resolved|unresolved|errored is used, these columns still render global total_activations, global activation share, and global reason counts while only instances_activated/rates come from the selected bucket. That makes the filtered text table internally inconsistent; for example an --bucket unresolved row can show activations and reasons that happened only in resolved instances, misleading operators who use the bucket view to analyze failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06413bd6d1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs
manifest.config_redacted.clone(),
) {
instance_enabled = root_cfg.skills.enabled;
instance_skill_paths.clone_from(&root_cfg.skills.paths);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't scan skills from redacted trajectory paths

When a sweep's redaction config masks part of [skills].paths, each trajectory's info.manifest.config_redacted contains the masked string rather than a real directory. Overwriting instance_skill_paths with that redacted value makes the later SkillRegistry::scan_paths see a nonexistent path and omit configured-but-inactive skills from skill_universe/by_skill (and can report fake drift), even though results.json still has the usable resolved path; keep the unredacted sweep config for scanning and use redacted trajectory config only for display/fingerprints.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +448 to +449
for traj_path in traj_paths {
let traj = load_trajectory(&traj_path)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Attribute retry activations to their own run outcome

When an instance has multiple run-*.traj.json files, this loop merges active skills from every attempt but the instance has only the single aggregate bucket computed above. In a retry/pass@k sweep where a skill appears only in a failed first attempt and a later attempt resolves without it, those failed-attempt activations are still counted under the resolved bucket, which corrupts the by_outcome correlation rates; load each trajectory's outcome alongside its active skills or keep per-run buckets before aggregating.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +462 to +464
if let Ok(skills) = serde_json::from_value::<Vec<crate::skills::ActiveSkillManifest>>(
raw_skills.clone(),
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report malformed active skill manifests

When info.active_skills is present but does not deserialize cleanly, this if let Ok falls through and treats the trajectory as if it had no active skills. For a completed sweep with a truncated active-skill entry or a schema mismatch, skill-coverage exits successfully while underreporting activations and reason counts; propagate the parse error so corrupt provenance is not reported as inactivity.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
.is_some_and(|n| n.starts_with("run-") && n.ends_with(".traj.json"))
})
.collect();
paths.sort();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sort run trajectories by numeric slot

For instances with ten or more run-*.traj.json files, paths.sort() orders lexicographically (run-10 before run-2). Because the later loop overwrites instance_enabled and instance_skill_paths for each trajectory, the report can use an older run's skill configuration as the final one and omit inactive skills/drift from the real last attempt; sort by the parsed run index instead.

Useful? React with 👍 / 👎.

…sort run trajectories numerically, and handle redacted config paths
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f98870dc64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +537 to +541
let expanded_paths = global_skill_paths
.iter()
.map(crate::skills::expand_skill_path_pub)
.collect::<Vec<_>>();
let registry = crate::skills::SkillRegistry::scan_paths(expanded_paths)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scan the per-instance skill paths for drifted configs

When an instance's trajectory has a different skills.paths value and those paths are not redacted, instance_skill_paths is used for drift reporting but ignored for discovery here. This scans only the sweep-level global_skill_paths, so configured-but-inactive skills that exist only in the drifted group are omitted from skill_universe/by_skill, even though the spec says drift should report the union of all configured skills across configurations.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +389 to +390
if let Some(instances) =
val.get("instances").and_then(serde_json::Value::as_array)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle resolved_ids when loading run outcomes

For sb-cli reports that use the supported resolved_ids/submitted_ids shape instead of an instances array, this helper records no per-run outcomes and falls back to the heuristic below. Fresh evidence: this new loader only reads instances, while the existing evaluator parser handles the sb-cli resolved_ids/submitted_ids shape; in pass@k sweeps that can attribute a skill activation to the wrong run bucket whenever the resolved sample is not the guessed one.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +453 to +454
if let Ok(mut root_cfg) =
toml::from_str::<crate::config::schema::RootCfg>(&manifest.config.resolved)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject malformed resolved sweep config

If results.json has a malformed manifest.config.resolved, this branch silently leaves skills disabled, paths empty, and the default redactor in place, so the command can exit successfully with inactive skills missing instead of treating the sweep artifact as corrupted. Since the report depends on that resolved config for the skill universe and redaction policy, propagate the parse error when the manifest is present.

Useful? React with 👍 / 👎.

…ape, and propagate resolved config parser errors
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 245f1b6747

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +784 to +785
if is_active {
instances_activated_global += 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count activated instances by original instance id

For pass@k/retry sweeps where the same skill activates in more than one run-*.traj.json for the same instance, instance_data contains one row per run slot, so this increment reports two instances_activated (and activation_rate later divides by the number of runs) instead of one distinct instance as promised by the schema. Fresh evidence is that this revision now pushes each run file as a separate InstanceSkillData (id-run-N), so retry attempts inflate instance-level rates while total_activations should be the per-run counter.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +386 to +387
let run_index = if let Some(idx_str) = filename.split("-run-").nth(1) {
idx_str.parse::<u32>().unwrap_or(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse run report suffix from the last marker

When bench evaluate was run with a custom run id that itself contains -run- (for example smoke-run-2026), the report file becomes smoke-run-2026-run-2.json but this parses the first marker and records the entries under run 2026 instead of run 2 (or run 1 if the middle segment is nonnumeric). Those explicit per-run outcomes then never match the run-2.traj.json slots, so skill activations fall back to the heuristic and can be bucketed under the wrong outcome; parse the suffix with the last -run- separator.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
Comment on lines +381 to +384
if let Ok(entries) = std::fs::read_dir(report_dir) {
for entry in entries.flatten() {
let path = entry.path();
if path.extension().and_then(|s| s.to_str()) == Some("json") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore stale sb-cli reports from older evaluations

When the same sweep is evaluated more than once, sb_cli_reports can contain report JSON for multiple run ids, but this loader reads every JSON file and inserts all of their per-run outcomes into the same map. If an older report disagrees with the current evaluation.json, whichever directory entry is visited last wins for a given instance/run slot, so retry skill activations can be bucketed nondeterministically using stale evaluator results; restrict the scan to the report paths/run id recorded in the current evaluation provenance.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39a7fdd276

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs Outdated
.active_skills_unique
.iter()
.any(|s| redactor.redact_text(&s.name, surface::TRAJECTORY).text == redacted_name);
let is_parent_resolved = resolved_set.contains(&d.parent_instance_id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use run outcomes for bucket resolved rates

In retry/pass@k sweeps, this uses the parent instance's final resolution for every run row, so an earlier failed run-1.traj.json from an instance that later resolves is counted as resolved inside the unresolved bucket. The run is now correctly placed in d.bucket, but by_outcome.unresolved.resolved_rate_when_active/resolved_rate_when_not_active can still become non-zero (or 1.0) for those failed attempts, corrupting the per-bucket correlation metrics; bucket-specific resolved totals should be based on the run bucket rather than resolved_set for the parent.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
}
}

let _ = parse_report_outcomes(&path, run_index, &mut resolved_by_run);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate sb-cli report parse failures

When a current sb_cli_reports/*.json file for a retry/pass@k evaluation is unreadable or malformed, this drops the error and leaves resolved_by_run incomplete, so the later heuristic can silently bucket run activations under the wrong outcome. Because these reports are the only precise source for per-run outcomes after bench evaluate, skill-coverage should fail on parse/read errors for matching reports instead of producing correlation metrics from guessed outcomes.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
let entry = config_groups
.entry(fingerprint)
.or_insert_with(|| (Vec::new(), d.eligible_paths.clone()));
entry.0.push(d.id.clone());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count drift groups by parent instance

In retry/pass@k sweeps instance_data contains one row per run-*.traj.json, but this records the synthetic run id (inst-run-N) for drift groups. A single instance with two attempts and the same skill config is therefore reported as two instance_counts, so the drift summary can exceed the number of sweep instances; use distinct parent_instance_ids per fingerprint for the instance count.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs Outdated
) -> Result<(), Error> {
let text = std::fs::read_to_string(path)?;
let val = serde_json::from_str::<serde_json::Value>(&text)?;
if let Some(instances) = val.get("instances").and_then(serde_json::Value::as_array) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse array-shaped evaluator reports

The existing evaluator loader accepts sb-cli/generic reports that are a top-level array of {instance_id, resolved} rows, but this per-run loader only checks an instances object field and then the resolved_ids shape. For pass@k sweeps whose sb_cli_reports/*-run-N.json use the supported array shape, resolved_by_run stays empty and the command falls back to the heuristic, which can attribute skill activations to the wrong run outcome.

Useful? React with 👍 / 👎.

…te parse failures, count drift by parent instances, parse array reports
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c1aa5033b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/skill_coverage.rs
} else if run_index == 1 {
pass_at_1
} else {
resolved_count == runs || (run_index == runs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid double-counting resolved retry slots

When per-run evaluator reports are unavailable and a pass@k sweep has pass_at_1=true, resolved_count=1, and runs>1, this fallback marks run 1 resolved via pass_at_1 and also marks the last run resolved via run_index == runs. That contradicts the aggregate resolved_count and puts activations from an actually unresolved later retry into the resolved bucket, corrupting the per-outcome skill correlation for sweeps that only have results.json metadata.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs
Comment on lines +312 to +313
if nested.exists() {
return vec![nested];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include run files beside legacy trajectories

When an instance directory contains a legacy trajectory.json plus newer run-N.traj.json files, this early return loads only the legacy file and drops the retry/pass@k slots. The codebase supports that mixed layout elsewhere by treating trajectory.json as run 1 and run-N.traj.json as later slots, so skill activations and per-run outcome buckets are underreported for those sweeps.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs
Comment on lines +445 to +448
if let (Some(inst_id), Some(resolved)) = (
row.get("instance_id").and_then(serde_json::Value::as_str),
row.get("resolved").and_then(serde_json::Value::as_bool),
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat missing per-run resolved flags as false

For current sb_cli_reports rows that use the generic evaluator shape with instance_id/eval_exit_reason but omit resolved, the evaluator parser treats the row as unresolved, while this loader drops it because resolved must be a bool. In pass@k sweeps that leaves resolved_by_run incomplete and the later heuristic can move activations from that failed run into the wrong outcome bucket.

Useful? React with 👍 / 👎.

Comment thread src/run/skill_coverage.rs

let run_bucket = if is_run_resolved {
OutcomeBucket::Resolved
} else if traj.info.outcome.as_deref() == Some(crate::trajectory::outcome::ERROR) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve errored buckets for legacy trajectories

When a single-run/legacy trajectory omits info.outcome but the corresponding results.json row has outcome: "error", this check falls through to Unresolved instead of using the result-row outcome as a fallback. Those errored instances then inflate the unresolved bucket and hide skill usage in the errored bucket for older or minimally populated trajectory artifacts.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add bench skill-coverage to attribute per-skill activation vs resolution

1 participant