diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index 94e3dc9..fd4d5b3 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -55,7 +55,7 @@ Tessl-authenticated checks: - [ ] `bash scripts/check_publish_dry_run.sh .` - [ ] `tessl plugin publish --dry-run --bump patch .` -- [ ] `tessl skill review --threshold 100 skills/java-streams/SKILL.md`, if skill text or references changed +- [ ] `tessl review run --workspace martinfrancois --threshold 100 skills/java-streams/SKILL.md`, if skill text or references changed - [ ] Targeted main/reference `scripts/run_eval_suite.sh `, if skill behavior or those evals changed - [ ] Targeted regression `scripts/run_eval_suite.sh regression `, if regression evals changed - [ ] Every substantively changed eval scenario was rerun targeted and reached 100% with context, or the PR explains the Tessl blocker and remaining work @@ -65,7 +65,7 @@ Tessl-authenticated checks: - [ ] `scripts/classify_eval_result.py --scenario-dir `, if a scenario was added or moved between suites - [ ] Full/main `scripts/run_eval_suite.sh main`, if benchmark claims changed or targeted with-context results are clean -`bash scripts/check_publish_dry_run.sh .`, `tessl skill review`, and hosted Tessl evals require +`bash scripts/check_publish_dry_run.sh .`, `tessl review run`, and hosted Tessl evals require Tessl authentication. Hosted evals also require a linked Tessl project. If you can't run one of them, leave it unchecked and explain why in the details. diff --git a/.github/workflows/skill-review.yml b/.github/workflows/skill-review.yml index 18da655..d1b465a 100644 --- a/.github/workflows/skill-review.yml +++ b/.github/workflows/skill-review.yml @@ -44,4 +44,4 @@ jobs: - name: Review skill if: ${{ env.TESSL_TOKEN_AVAILABLE == 'true' }} - run: tessl skill review --threshold 100 skills/java-streams/SKILL.md + run: tessl review run --workspace martinfrancois --threshold 100 skills/java-streams/SKILL.md diff --git a/.gitignore b/.gitignore index e1cff3b..e6d15e0 100644 --- a/.gitignore +++ b/.gitignore @@ -4,6 +4,7 @@ __pycache__/ *.py[cod] .tessl/cache/ .tessl/tmp/ +.tessl/eval-evidence/ .codex/ .mcp.json .vscode/ diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4a2e8c3..e4eaafc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -78,7 +78,7 @@ tessl plugin lint . If you change the skill text or reference files, also run: ```bash -tessl skill review --threshold 100 skills/java-streams/SKILL.md +tessl review run --threshold 100 skills/java-streams/SKILL.md ``` If you have Tessl access, run the publish dry-run: @@ -88,10 +88,12 @@ bash scripts/check_publish_dry_run.sh . tessl plugin publish --dry-run --bump patch . ``` -Hosted evals require Tessl authentication and a linked Tessl project. Use Sonnet 4.6 for this -repository's main eval checks. Prefer `scripts/run_eval_suite.sh`; it runs from a temporary plugin -copy so with-context variants can see the skill bundle. Start with targeted runs when possible to -conserve Tessl daily rate-limit budget: +Hosted evals require Tessl authentication and a linked Tessl project. Prefer +`scripts/run_eval_suite.sh`; it runs from a temporary plugin copy so with-context variants can see +the skill bundle. By default, it uses Tessl's default solver so contributors without model-selection +entitlement can still run evals. If your Tessl plan allows explicit model selection, Sonnet 4.6 or a +better frontier model is recommended for a more representative real-world check. Start with targeted +runs when possible to conserve Tessl daily rate-limit budget: ```bash scripts/run_eval_suite.sh main @@ -99,12 +101,12 @@ scripts/run_eval_suite.sh main Run hosted eval variants by suite purpose: -- `evals/`: run both `without-context` and `with-context`; these runs support public lift +- `evals/`: run both baseline control and `with-context`; these runs support public lift reporting. Use `scripts/run_eval_suite.sh main`. -- `evals-reference/`: run both `without-context` and `with-context`; these runs decide whether a +- `evals-reference/`: run both baseline control and `with-context`; these runs decide whether a scenario has meaningful lift or should move suites. Use `scripts/run_eval_suite.sh reference`. - `evals-regression/`: run `with-context` only by default; these runs are safety checks, not lift - discovery. Run regression `without-context` only when deliberately checking whether a scenario + discovery. Run regression baseline control only when deliberately checking whether a scenario should move back to `evals-reference/`. Use `scripts/run_eval_suite.sh regression`. ## Commit Messages diff --git a/docs/agents/eval-risk-probes.txt b/docs/agents/eval-risk-probes.txt new file mode 100644 index 0000000..18e225b --- /dev/null +++ b/docs/agents/eval-risk-probes.txt @@ -0,0 +1,26 @@ +# suite:scenario entries that historically exposed scored with-context failures, ordered by +# observed failure count per hosted solution cost from the paginated Tessl/local run-history +# corpus. Missing-score/incomplete runs are excluded from this ranking. These are scheduling +# probes only. They do not change final evidence requirements. +reference:26-uppercase-side-effect-review +reference:08-primary-contact-review +main:02-delivery-appointments-mapconcurrent +regression:04-primary-address-review +regression:22-java8-version-scan +main:01-offer-availability-mapconcurrent +main:03-payment-screening-gatherer-review +regression:16-java11-report-review +regression:19-null-collector-review +reference:15-session-roster-indexes +reference:28-overdue-shipment-notices +reference:05-parallel-cpu-review +regression:09-order-collector-report +regression:23-collector-order-scan +regression:01-permission-and-orders +regression:13-training-and-packets +regression:14-parallel-mutation-review +regression:17-java8-optional-prefix-review +regression:24-mutable-batch-modernization +regression:25-hard-stop-scan-audit +main:04-invoice-bounds-and-temperature-windows +reference:27-uppercase-names-implementation diff --git a/docs/agents/evals.md b/docs/agents/evals.md index ca0a594..375ed30 100644 --- a/docs/agents/evals.md +++ b/docs/agents/evals.md @@ -8,8 +8,23 @@ benchmark claims, or scoring rules. ## Rules - Don't cheat. Don't leak the diagnosis or desired fix in eval prompts. +- Run quality review first, and if it is below 100%, stop and fix all quality issues before any new + hosted eval rerun. Then execute targeted evals for every changed scenario, and only progressively + broaden suites after targeted runs are clean. Preserve the daily budget by stopping at each stage + unless failures require another targeted rerun; only then proceed to broader hosted checks. If a + broad run shows any with-context below 100%, stop that run and return to targeted reruns for failed + scenarios only. +- Use the pre-submit gate before your first hosted command: + + ```bash + scripts/pre_submit_gate.sh --plan-only + ``` + + Then execute only the printed targeted stages. The gate now enforces that each stage reaches + 100% with-context before allowing expansion to the next stage. - Keep natural activation prompts neutral. Explicit invocation prompts may name `$java-streams`, but - should not leak the desired fix beyond invoking the skill. + should not leak the desired fix beyond invoking the skill. Mark prompts that name `$java-streams` + as `metadata.invocation: "explicit"` and do not report them as natural activation evidence. - The main eval should focus on realistic failure modes where this skill should change the answer: Java stream code that may materialize unnecessarily, count for existence, sort for one extreme, mishandle order, misuse collectors, @@ -40,7 +55,14 @@ benchmark claims, or scoring rules. - Every `criteria.json` must classify `metadata.invocation` and `metadata.task_type`. - Use `metadata.evidence_type` when scenario placement needs to be explicit: - `ordinary_lift`: an ordinary main or reference scenario where both variants are fair to compare. - This value is invalid in `evals-regression/`. + This value is invalid in `evals-regression/`, and it must not be used when the task overlaps + same-domain runtime skill references. + - `focused_main`: a main-suite scenario that intentionally covers a specific skill behavior and + may share bounded, documented runtime-reference overlap. Report it separately from ordinary + broad natural lift. + - `focused_reference`: a reference-suite scenario that intentionally emphasizes a specific skill + behavior or behavior delta. It may carry high weight on that focused behavior, but it is not + ordinary broad lift or unseen generalization evidence. - `solved_regression`: a regression scenario that hosted history shows both variants solve at 100%. This value is invalid outside `evals-regression/`. - `skill_context_dependent`: a regression scenario that requires exact skill-provided text, @@ -54,9 +76,17 @@ benchmark claims, or scoring rules. maintainability points per 100-point scenario unless a scenario has a documented reason to differ. - Runtime skill references must not contain eval inventories, expected answers, score rubrics, hosted run IDs, or fixed score claims. -- Active evals must not be near-copies of runtime reference examples. It is fine for an eval to test - `Gatherers.mapConcurrent`, `Collectors.teeing`, `takeWhile`, or `dropWhile`; it is not fine to - reuse the same domain class, method, record, constant, and carrier pattern from a runtime example. +- Don't cheat by leaking the desired fix in task prompts, and don't treat runtime-reference overlap + as ordinary lift. Same-domain or near-solution overlap between runtime skill references and eval + tasks is allowed only for explicitly classified focused main, focused reference, or regression + evidence. +- Metadata rationale documents why an overlap classification is honest; it does not bypass + validation by itself. Vague words such as "focused", "coverage", "kept", or "intentional" are not + enough unless the suite placement and `metadata.evidence_type` agree. +- Active main/general evals must stay neutral. It is fine for an eval to test `Gatherers.mapConcurrent`, + `Collectors.teeing`, `takeWhile`, or `dropWhile`; it is not fine for ordinary broad lift evidence + to reuse the same domain identifiers, record names, method names, constants, thresholds, long + phrasing, or near-solution helper shapes from a runtime example. - Active and reference/regression tasks must not be exact or near-exact duplicates. If a reference scenario is promoted or replaced by active coverage, delete or materially rewrite the reference copy and document the numbering gap. @@ -114,9 +144,12 @@ benchmark claims, or scoring rules. legitimate coverage just to improve lift. - Track raw score, percentage-point lift, raw score ratio, missed-point reduction, and the `stream_quality` subtotal when updating benchmark claims. -- Use `scripts/run_eval_suite.sh` for hosted evals. It runs from a temporary plugin copy so - with-context variants can see the skill bundle, and it enforces the suite variant policy. Use - Sonnet 4.6 unless intentionally comparing another model. +- Use `scripts/run_eval_suite.sh` for hosted evals. It runs from a temporary plugin copy, passes + `--skill java-streams` so with-context runs actually exercise this skill, passes `--force` so + post-fix checks cannot reuse stale hosted solutions, and enforces the suite variant policy. Use the + Tessl default solver unless intentionally comparing another model. If the account has + model-selection entitlement, Sonnet 4.6 or a better frontier model is recommended for a more + representative real-world check. ```bash scripts/run_eval_suite.sh main @@ -127,20 +160,30 @@ benchmark claims, or scoring rules. - Direct equivalent for this repository's main eval runs: ```bash - tessl eval run --agent claude:claude-sonnet-4-6 --variant without-context --variant with-context . + tessl eval run --skill java-streams --force . ``` - The workflow-pinned Tessl CLI version accepts this plugin eval workflow from the repository root. - Public docs may still show tile-oriented examples; for this repository, use the pinned CLI and - `scripts/run_eval_suite.sh` as the source of truth. + The Tessl CLI runs the baseline control by default when plugin context is present. Use + `--skip-baseline` only for context-only regression runs. Public docs may still show tile-oriented + examples; for this repository, use the pinned CLI and `scripts/run_eval_suite.sh` as the source of + truth. + Tessl's public changelog notes that model and agent selection are plan-entitlement-gated: + . Tessl also documents why the default eval solver is not pinned + to Sonnet 4.6 for ordinary skill-development checks: + . Check + `tessl eval run --list-agents` for the current default because Tessl can change it over time. - Run variants by suite purpose: - - `evals/` main: always run both `without-context` and `with-context`, because it supports public + - `evals/` main: always run both baseline control and `with-context`, because it supports public lift reporting. - - `evals-reference/`: always run both `without-context` and `with-context`, because it is used to + - `evals-reference/`: always run both baseline control and `with-context`, because it is used to find meaningful lift and promotion candidates. - `evals-regression/`: run `with-context` only by default, because it is safety coverage rather than lift discovery. Run `without-context` for regression only when intentionally checking whether a scenario should move back to reference. - Keep hosted eval usage minimal while preserving confidence and Tessl daily rate-limit budget: + - Freeze runtime skill text before hosted spending whenever possible. The expensive failure mode is + not the final all-suite requirement itself; it is rerunning required evidence after later edits to + `skills/java-streams/SKILL.md` or bundled runtime references change the skill fingerprint. Do the + local scenario/criteria crosswalk and obvious skill wording fixes before starting hosted runs. - A pure suite move does not require a hosted rerun when `task.md`, `criteria.json`, and `capability.txt` content are unchanged except for suite-placement metadata or numbering notes. Run local validators and update suite totals/numbering instead. If the move also changes task @@ -155,19 +198,43 @@ benchmark claims, or scoring rules. the blocker and exact remaining targeted runs in the PR; benchmark and release-readiness claims remain blocked until those runs pass. - For runtime skill text or runtime reference changes, start with the affected scenario - directories most likely to move, using the suite variant rule above. + directories most likely to move, using the suite variant rule above. Use the pre-submit gate's + impact-analysis suggestions for runtime-only changes when no maintainer-specified focus is + obvious, and keep historical risk probes in the early targeted pass. + - After quality is 100 and targeted probes are clean, switch to balanced broad chunks for the + remaining required evidence. Do not keep broadening one scenario at a time unless a fresh broad + failure is likely and conserving eval-solutions is more important than elapsed time. - If any affected with-context result is below 100%, keep rerunning only those targeted scenarios after fixes until they are clean. - - Then run `evals/` for the main score. + - Then run the remaining `evals/` scenarios for the main score, excluding scenarios already proven + clean after the last skill bundle change. - Run relevant `evals-reference/` scenarios with both variants when deciding promotion or checking nearby behavior. - Before final release/open-source-ready claims after a runtime skill change, run every reference - scenario with both variants and every regression scenario with context only. Do this after the - targeted failures are clean and the main suite has run. + scenario with both variants and every regression scenario with context only. This evidence may be + split across targeted and broad runs, and already-proven scenarios should be excluded from later + broadening as long as they passed after the last change to the skill bundle. - Run `evals-regression/` with context only as a final safety check before release or after broad changes, not on every tuning loop. - If a broad run exposes isolated failures, fix those exact scenarios and rerun them targeted - before spending rate-limit budget on another broad suite run. + before spending rate-limit budget on another broad suite run. Preserve successful scenario-level + evidence from the same final skill state instead of rerunning it only because another scenario in + the suite failed. + - Never rerun a hosted eval merely because it looks stuck, slow, pending, or temporarily missing + scoring. Tessl scoring can lag after scenario execution. Keep polling the existing run with + `tessl eval view --json` and wait for completion or a hard service failure; only rerun + after a completed scored failure and a relevant fix, or after Tessl reports a non-recoverable run + failure. + - Never poll hosted evals with an unbounded loop or a bare `tessl eval run`. Poll the specific + existing run ID with bounded attempts, visible output, and a stop condition; if unexpected + background eval work appears, inspect process ancestry and Codex session logs before explaining + where it came from. + - For budgeting sanity, run the full suite stages incrementally and confirm each stage with + 100% with-context before the next one: + budget-aware remaining suites from `scripts/pre_submit_gate.sh`, or a maintainer-approved + explicit order via `--broad-order`. + - If the change is runtime-wide and no scenario edit exists, prefer a scoped focus run first (via + `--focus :`) before any broad suite rerun. ## Current Suite Composition @@ -179,21 +246,26 @@ Update this section whenever active eval membership or scoring changes. - Java 24 bounded remote-call / `Gatherers.mapConcurrent` coverage: 3 scenarios, 1200 checklist points. This dominates the current main score because hosted evidence previously showed strong deltas in that family; do not over-read it as broad Java Streams coverage. -- Scenario `01-offer-availability-mapconcurrent` is intentionally focused Java 24 runtime-guidance - coverage. It should remain a different domain and result-carrier pattern from the bundled bounded - `Gatherers.mapConcurrent` example in `stream-examples.md`; report it as focused skill-use coverage - rather than broad independent lift evidence. +- Scenarios `01-offer-availability-mapconcurrent` and `02-delivery-appointments-mapconcurrent` are + intentionally focused Java 24 runtime-guidance coverage. They should remain different domains and + result-carrier patterns from the bundled bounded `Gatherers.mapConcurrent` example in + `stream-examples.md`; report them as focused skill-use coverage rather than broad independent lift + evidence. - Java 17 collector and prefix-operation coverage: 1 scenario, 200 checklist points. - Uppercase side-effect review moved from main number `07` back to reference number `26` because it - remains useful ordinary reference lift evidence for external mutation and parallelism advice, but + remains useful explicit reference lift evidence for external mutation and parallelism advice, but the main suite should stay focused on the strongest evidence-weighted coverage. Keep it in reference unless future current-suite evidence shows that it meets the 30 pp floor and improves main coverage. - Session roster indexing moved from main number `06` to reference number `15` because hosted - evidence showed the without-context result was already high while with-context was clean. It - remains useful natural Java 17 collector coverage, but it is weak main-lift evidence. + evidence showed the without-context result was already high while with-context was clean, and the + runtime references contain same-domain session-registration examples. It remains useful focused + natural Java 17 collector coverage, but it is not ordinary broad lift evidence. +- Overdue shipment notices is focused reference coverage for extracting non-trivial stream lambda + bodies into helpers. Its high multi-line-lambda criterion weight is intentional focused + behavior-delta coverage, not ordinary broad lift evidence. - Hard-stop scan audits: regression explicit workflow-use only. -- Reference suite: 5 scenarios, 460 total checklist points. Deleted reference number 12 and +- Reference suite: 6 scenarios, 560 total checklist points. Deleted reference number 12 and regression-moved scenarios are not counted. - Regression suite: 19 scenarios, 1820 total checklist points. - Hosted benchmark evidence is pending rerun for the current active suite. Do not publish exact @@ -212,3 +284,4 @@ the criteria JSON check listed there. - [Workflow](workflow.md) - [Skill Behavior](skill-behavior.md) +- [Pre-Submit Gate](pre-submit-gate.md) diff --git a/docs/agents/maintaining-agent-docs.md b/docs/agents/maintaining-agent-docs.md index a0b7916..c3a4080 100644 --- a/docs/agents/maintaining-agent-docs.md +++ b/docs/agents/maintaining-agent-docs.md @@ -34,6 +34,16 @@ Use this when changing `AGENTS.md` or files under `docs/agents/`. If a requested change conflicts with existing instructions, stop and ask the user which version to keep. Don't auto-resolve. +## CLI Deprecation Response + +- When a CLI command mentioned in this repo starts warning as deprecated, switch runtime workflows to the + replacement command path in `scripts/` and keep the deprecated usage only behind a compatibility shim or + deprecation-triggered fallback. +- Update agent-facing docs (`CONTRIBUTING.md`, `docs/agents/workflow.md`, `docs/agents/pre-submit-gate.md`) at the + same time so future contributors run the current command. +- If uncertain whether a fallback is needed, preserve current behavior first, and add `--help`-based detection or + one-time test coverage instead of deleting the previous behavior immediately. + ## Remove Vague Rules When editing these docs, delete or rewrite anything redundant, vague, or too obvious to be diff --git a/docs/agents/pre-submit-gate.md b/docs/agents/pre-submit-gate.md new file mode 100644 index 0000000..2f8417a --- /dev/null +++ b/docs/agents/pre-submit-gate.md @@ -0,0 +1,229 @@ +# Pre-Submit Gate + +## Scope + +This page captures the internal pre-submission procedure for skill and eval changes so contributors can +run the same budget-aware sequence every time. + +## Rules + +- Run `scripts/pre_submit_gate.sh` with `--plan-only` (or `scripts/internal_pr_readiness.sh --plan`) + before major edits to inspect the planned sequence. +- Run `tessl review run --threshold 100 skills/java-streams/SKILL.md` first for any content change in + runtime references, runtime checks, or eval logic. If this fails, stop immediately and fix quality. +- `scripts/pre_submit_gate.sh` executes eval runs with JSON capture and validates that every scenario is + 100% with-context before allowing the sequence to advance. If any scenario is below 100% or + missing with-context scoring, fix those scenarios and rerun only the failing scope. +- Wait for Tessl to mark the whole eval run `completed` before advancing, even if the with-context + variant has already scored 100%. Main/reference runs may still have baseline solves or scores in + flight, and starting the next stage early can create avoidable overlapping hosted work. +- The hard goal for runtime skill changes is: after the last change to `skills/java-streams/SKILL.md` + or any file in that skill bundle, quality review is 100 and every retained scenario in `evals/`, + `evals-reference/`, and `evals-regression/` has 100% with-context evidence for that exact skill + bundle state. Evidence can be split across targeted and suite runs. +- Track evidence per scenario, not per suite label. Once a scenario has already passed 100% + with-context for the current skill bundle fingerprint, do not rerun it merely because its suite is + being broadened. Run only the remaining scenarios that lack valid evidence. +- For runtime/skill-text changes, run only the affected scope first, then continue to full `main`, + `reference`, and `regression` suites once targeted checks pass. +- For non-runtime edits, use targeted scope only. +- Do not spend budget on broad suites while a targeted with-context failure is known. If a targeted + run is below 100% with-context, fix that issue and rerun only the failed targeted scenario(s). +- Run ordered targeted probes in tiny batches by default. One scenario per targeted hosted run is + usually the best eval-solution budget tradeoff: if the probe fails, the gate stops after the + smallest possible hosted surface; if it passes, the scenario still counts toward final evidence. + The default order is manual focus, history-ranked risk probes, impact-analysis probes, then + directly changed scenario directories that were not already covered. +- Only after targeted with-context is clean should broader checks be run: + 1. remaining `main` + 2. remaining `reference` + 3. remaining `regression` (context only). +- After each hosted run stage, require an explicit confirmation of 100% with-context before moving on. +- Treat pending or slow hosted eval runs as normal. Do not rerun a hosted eval because it appears + stuck, slow, or missing scoring; patiently wait and poll `tessl eval view --json` until + Tessl reports completed scoring or a hard service failure. A slow pending run can later produce + valid scoring, while a rerun burns duplicate eval budget and can create confusing overlapping + evidence. +- Never use unbounded loops for hosted evals, reviews, or other external/quota-consuming commands. + Poll a specific run ID with bounded attempts, a sleep interval, visible output, and a clear stop + condition. Do not suppress all output from long-running hosted commands, because hidden output can + mask runaway behavior. +- If unexpected background eval/review work appears, audit both local process ancestry and Codex + session logs before explaining the cause. When a Codex tool call started the process, state that + directly with the timestamp, command, and evidence path instead of softening it as merely + unintentional. +- If Tessl eval access is blocked (auth/rate-limit/service error), stop at that stage and record exact + blocked commands; do not claim release-readiness without completion evidence or a blocker note. + +## Cost-aware sequencing + +To avoid repeated broad failures: + +- Treat runtime skill text as frozen before hosted eval spending begins. The biggest budget waste is + collecting partial or broad evidence, then changing `skills/java-streams/SKILL.md` or bundled + runtime references and invalidating that evidence. Do local criteria review, obvious wording fixes, + and quality cleanup first; then start hosted runs. +- Start with the minimum targeted surface first. +- Before hosted evals, do a local criteria crosswalk for changed runtime guidance: read the affected + scenarios' `task.md` and `criteria.json`, verify the skill text explicitly names the required + artifact, Java baseline, API, and rejection pattern, and fix obvious gaps before spending hosted + budget. Hosted evals are for validating behavior, not discovering avoidable wording omissions. +- The default strategy is phase-based, not endlessly one-by-one: local crosswalk and quality first, + small targeted probes while failure risk is high, then balanced broad chunks once targeted + evidence is clean and the runtime fingerprint is intentionally frozen. +- Batch only one expansion step per correction cycle (targeted → main → reference → regression). +- Preserve the final code state between stages; if code changes again, restart the sequence. +- Reuse successful run IDs and skip duplicate scenario checks whenever they were produced after the + last skill bundle change. +- If targeted failures were already rerun and are now clean, subtract those scenarios from the later + broad suite stage. Example: if `reference/05` and `regression/22` were fixed and rerun at 100%, + and then full main passes, the next hosted command should run the remaining reference scenarios + except `05`; the regression stage should likewise exclude `22`. +- If a later stage fails and the fix changes skill bundle content, the fingerprint changes and prior + evidence is stale. Rerun targeted failures first, then rebuild final evidence from the new state. +- When a mixed broad run has isolated failures, the gate records passing scenarios from that run + before stopping, so the next cycle can rerun only the failed scenarios and later finish the still + missing coverage. +- Broad stages should use a middle-ground batch size after targeted probes are clean. The default + gate chunks remaining scenarios in batches of up to 6 per suite, so small main/reference remainders + run together and larger regression sweeps do not spend a separate long scoring wait per scenario. + This does not reduce the all-pass floor, but it balances eval-solution risk against wall-clock + time. Use `--broad-batch-mode progressive` only when a fresh broad failure is likely and eval + budget matters more than elapsed time. +- Broad-stage order is budget-aware and does not change the final requirement. The default order is + `main,reference,regression`; most known failures are already pulled forward by historical risk + probes, and the remaining broad stage is only for still-unproven scenarios. Use `--broad-order` + only when current evidence points to a different risk distribution. +- For runtime-only changes, use impact analysis to choose early targeted probes before broadening. + The analyzer ranks scenarios by overlap between changed skill-bundle text and scenario + task/capability/criteria text. These are only scheduling hints: they may reduce wasted broad runs, + but they do not reduce the final evidence floor. +- If the local evidence cache is missing but a valid hosted JSON run exists for the same final skill + bundle state, ingest it instead of rerunning: + + ```bash + python3 scripts/eval_evidence.py ingest \ + --file .tessl/eval-evidence/java-streams-pre-submit.json \ + --fingerprint "$(python3 scripts/eval_evidence.py fingerprint --skill-dir skills/java-streams)" \ + --repo-root . \ + --suite reference \ + --run-json /tmp/reference-run.json + ``` + +Budget estimate for this repo (useful for planning): + +- `scripts/run_eval_suite.sh main` usually submits two variants (baseline + with-context). +- `scripts/run_eval_suite.sh reference` usually submits two variants. +- `scripts/run_eval_suite.sh regression` is typically context-only. +- A full final sequence therefore should be expected to be larger than a single targeted run, so run it + only when earlier stages are clean. +- Do not optimize below the required final evidence floor. With the current suite composition, a + fully clean final state requires 39 scenario-solutions: `main` 4 scenarios times 2 variants, + `reference` 6 times 2, and `regression` 19 times 1. Budget optimizations may remove duplicate or + stale reruns, but must not remove required scenario coverage or required variants. +- Before a reset-day run, calculate the remaining floor from current evidence and compare it with + available budget. If the remaining floor is small and targeted evidence is already clean, prefer + balanced batches to finish quickly instead of continuing with single-scenario broad probes. + +Postmortem rule: + +- If a tuning cycle consumes far more than the final evidence floor, pause and audit before starting + more hosted work. Separate unavoidable cost from avoidable burn: + - unavoidable: final evidence for the current fingerprint, including required variants; + - avoidable: evidence invalidated by later runtime edits, duplicate runs for slow scoring, + unbounded loops, broad runs before quality is 100, and hosted runs that reveal wording gaps a + local criteria crosswalk should have caught. +- Update this process and the gate defaults when the audit finds a better strategy. Prefer durable + scripted or documented changes over relying on memory. + +Run-count examples: + +- Three failing scenarios in one suite: fix, rerun them in ordered tiny targeted batches unless + preserving wall-clock time matters more than solution budget, then run only scenarios still + missing evidence. +- Two failing scenarios split across reference and regression: fix, rerun one targeted reference run + and one targeted regression run. If both pass and main later passes, run remaining reference + scenarios excluding the already-proven reference scenario, then remaining regression scenarios + excluding the already-proven regression scenario. +- If quality review is below 100, run zero hosted evals until quality is fixed. +- Re-run the parameter sweep when the failure distribution changes materially. Prefer paginated + Tessl run history over the CLI's default one-page list: the JSON response includes `links.next`, + which can be followed with the Tessl access token from `~/.tessl/api-credentials.json` without + starting any new eval runs. Exclude missing-score/incomplete runs from scenario risk ranking; use + scored below-100 with-context failures as the risk signal. +- Impact-analysis probes save budget when they catch a failure before broad suites. When they pass, + they still count toward final scenario evidence, so they should not add scenario-solutions beyond + the required final floor. +- Keep a historical risk-probe list in `docs/agents/eval-risk-probes.txt`. These are scenarios that + previously had scored with-context failures during skill tuning, ordered by observed failure count + per hosted solution cost. Run the first six by default for runtime changes, then continue with + balanced broad chunks. Increase `--risk-limit` only when local criteria crosswalk or recent + failures indicate a high chance of another broad failure; otherwise too many one-scenario probes + save little eval budget and cost too much wall-clock time across repeated fingerprints. + +## Scenario Evidence Cache + +`scripts/pre_submit_gate.sh` stores local scenario evidence in: + +```bash +.tessl/eval-evidence/java-streams-pre-submit.json +``` + +The cache is keyed by a SHA-256 fingerprint of the files under `skills/java-streams/`, including +`SKILL.md` and bundled references. When the skill bundle changes, cached evidence for the previous +fingerprint is ignored automatically. + +Useful options: + +- `--evidence-file `: use a different evidence cache, useful for simulations. +- `--reset-evidence`: clear the cache before planning. +- `--ignore-evidence`: force all planned scenarios to rerun. +- `--broad-order `: override automatic budget-aware ordering. +- `--impact-limit `: cap runtime-change impact-analysis probes. +- `--no-impact-analysis`: disable impact-analysis probes. +- `--risk-limit `: cap historical risk probes; default is 6. +- `--risk-probe-file `: override the historical risk-probe list. +- `--no-risk-probes`: disable historical risk probes. +- `--target-batch-size `: increase targeted same-suite batch size when wall-clock time is more + important than minimizing wasted eval-solutions. +- `--broad-batch-mode balanced|progressive|suite`: use balanced broad chunks by default. Choose + `progressive` when conserving eval solutions matters more than wall-clock time, or `suite` when + current evidence is very strong and you intentionally want the old full-remaining-suite behavior. + +## Internal command + +The internal pre-submit tool is: + +```bash +scripts/pre_submit_gate.sh +``` + +Use this internal non-plugin skill wrapper for concise local workflow: + +```bash +scripts/internal_pr_readiness.sh --plan +``` + +Modes: + +- `--plan` (default): print the staged run plan. +- `--targeted`: run targeted eval checks. +- `--full`: run targeted checks plus required full `main`/`reference`/`regression` stages (goal mode). + +Common usage: + +```bash +scripts/pre_submit_gate.sh --plan-only +scripts/pre_submit_gate.sh --focus main:02-delivery-appointments-mapconcurrent +scripts/pre_submit_gate.sh --plan-only --run-broad +scripts/pre_submit_gate.sh --run-broad --base-ref origin/main +scripts/internal_pr_readiness.sh --targeted --focus reference:05-parallel-cpu-review +scripts/internal_pr_readiness.sh --full --base-ref origin/main --auto-continue +``` + +Add `--auto-continue` only for automation after you have manually verified each stage yourself. + +## References + +- [Eval Guidance](evals.md) +- [Workflow](workflow.md) diff --git a/docs/agents/skill-behavior.md b/docs/agents/skill-behavior.md index ad4524a..7a101f7 100644 --- a/docs/agents/skill-behavior.md +++ b/docs/agents/skill-behavior.md @@ -39,6 +39,9 @@ guidance, or auto-selection wording. that name. - The skill should not force streams over clear stateful loops. Stateful sequence output, checked IO, prompts, mutation-heavy code, or complex early exits can remain imperative. +- Keep stream lambdas as short glue. Prefer method references or one-expression lambdas whose body + stays on the same line as `->`, and extract named helpers for branching, loops, temporary + variables, formatting, merge rules, or nested stream chains that would continue on later lines. - Runtime guidance should keep internal workflow language out of ordinary user-facing reviews. Avoid terms such as "hard stop", "marker", "scan", "checklist", and skill names unless the user asked for an explicit skill workflow, audit, or scan command. diff --git a/docs/agents/workflow.md b/docs/agents/workflow.md index df44a13..fb6bd9e 100644 --- a/docs/agents/workflow.md +++ b/docs/agents/workflow.md @@ -31,9 +31,23 @@ release-readiness. patch-bump dry-run for PR safety. For skill, eval, package, or release changes that should publish a new version, let Release Please bump the version before the exact-version publish check. -- For skill behavior or eval changes, run hosted evals with Sonnet 4.6, but start with the smallest - useful set to conserve Tessl daily rate-limit budget. Use `scripts/run_eval_suite.sh` so the run - uses plugin context and the right variant policy. +- For skill behavior or eval changes, run hosted evals with Tessl's default solver by default, but + start with the smallest useful set to conserve Tessl daily rate-limit budget. Use + `scripts/run_eval_suite.sh` so the run uses plugin context and the right variant policy. If the + account has model-selection entitlement, Sonnet 4.6 or a better frontier model is recommended for + a more representative real-world check. + + Before any manual sequence, run the internal pre-submit gate plan: + + ```bash + scripts/pre_submit_gate.sh --plan-only + ``` + + The gate enforces the quality-first + targeted-first ordering, validates that each stage is 100% with-context, + and helps avoid accidental broad-suite reruns between code changes. + Before the first hosted run, freeze intended runtime skill edits and do a local crosswalk against + affected scenario `task.md` and `criteria.json` files. Hosted evals should validate behavior, not + discover obvious missing artifact, Java-version, API, or rejection-pattern wording. If any eval scenario's `task.md`, `criteria.json`, or `capability.txt` changed, run that exact scenario before finishing the PR. A pure move between `evals/`, `evals-reference/`, and @@ -68,13 +82,30 @@ release-readiness. For runtime skill text or runtime reference changes, progressively widen the hosted checks before calling the PR done: first affected scenarios, then the full main suite, then every reference scenario with both variants, then every regression scenario with context only. The final post-change - evidence must show 100% with context for every retained scenario in every suite. Run regression + evidence must show 100% with context for every retained scenario in every suite, but the evidence + is scenario-level: targeted scenarios that already passed after the last skill bundle change count + toward the final requirement and should be excluded from later broad stages. Run regression without-context only when intentionally checking whether a scenario should move back to reference. If a broad run finds isolated failures, fix and rerun those scenarios targeted after the fix before - spending rate-limit budget on another broad suite run; once targeted failures are clean, finish the - remaining broad suites that have not yet run against the final skill state. If Tessl hosted evals - are unavailable or rate-limited, document the exact missing runs and do not call the PR - release-ready. + spending rate-limit budget on another broad suite run; once targeted failures are clean, finish only + the scenarios that still lack evidence against the final skill state. If Tessl hosted evals are + unavailable or rate-limited, document the exact missing runs and do not call the PR release-ready. + + For a scripted execution of this order, use: + + ```bash + scripts/pre_submit_gate.sh --run-broad + ``` + + It pauses after each stage, enforces 100% with-context checks before proceeding, uses + impact-analysis probes plus historical risk probes for runtime-only changes, and uses the + run-history-tuned broad order while preserving the same final evidence requirement. The default + targeted stage runs ordered probes in tiny batches. Once those are clean, the broad stage uses + balanced chunks so small remainders run together and larger regression sweeps avoid one long + scoring wait per scenario. Override `--broad-order`, `--target-batch-size`, or + `--broad-batch-mode` only when current evidence points to a different risk or wall-clock tradeoff. + If a cycle consumes far more than the final evidence floor, pause to audit why before starting more + hosted work, then update the gate or agent docs with any better strategy found. Release evals only cover the published main suite. After any runtime skill text or runtime reference change, a successful publish run is not enough by itself: before saying the release or @@ -95,10 +126,10 @@ release-readiness. Follow the recommendation unless the pull request documents a maintainer-approved override. -- Run the Tessl skill review at threshold 100 when changing runtime skill content: +- Run the Tessl skill quality review at threshold 100 when changing runtime skill content: ```bash - tessl skill review --threshold 100 skills/java-streams/SKILL.md + tessl review run --threshold 100 skills/java-streams/SKILL.md ``` - Pull request titles and commits must use Conventional Commits. Release Please uses them to update @@ -199,3 +230,4 @@ release-readiness. - [Project Identity](project-identity.md) - [Eval Guidance](evals.md) - [Public Metadata And OSS Readiness](public-metadata.md) +- [Pre-Submit Gate](pre-submit-gate.md) diff --git a/evals-reference/05-parallel-cpu-review/criteria.json b/evals-reference/05-parallel-cpu-review/criteria.json index ae6116f..80e9366 100644 --- a/evals-reference/05-parallel-cpu-review/criteria.json +++ b/evals-reference/05-parallel-cpu-review/criteria.json @@ -17,18 +17,27 @@ { "name": "Mentions overhead and measurement", "category": "stream_quality", - "max_score": 20, + "max_score": 15, "description": "Says the performance should be measured because parallel stream overhead can outweigh benefits." }, { "name": "Mentions common pool", "category": "stream_quality", - "max_score": 15, + "max_score": 10, "description": "Notes that parallel streams use the common fork-join pool by default and may contend with other work." + }, + { + "name": "Avoids multi-line lambda examples", + "category": "maintainability", + "max_score": 10, + "description": "If the review includes replacement or optional example code, keeps lambda bodies on the same line as `->` or uses named helpers/method references instead of showing block lambdas, arrows whose body starts on the next line, or callback bodies that continue on later lines." } ], "metadata": { "invocation": "explicit", - "task_type": "review" + "task_type": "review", + "evidence_type": "focused_reference", + "reference_selection": "Focused reference coverage for CPU-heavy stateless parallel stream advice.", + "runtime_reference_overlap_rationale": "Allowed only because this is reference-suite focused coverage, not ordinary broad lift." } } diff --git a/evals-reference/08-primary-contact-review/criteria.json b/evals-reference/08-primary-contact-review/criteria.json index 31108f5..bfd6ca1 100644 --- a/evals-reference/08-primary-contact-review/criteria.json +++ b/evals-reference/08-primary-contact-review/criteria.json @@ -41,6 +41,9 @@ ], "metadata": { "invocation": "natural", - "task_type": "review" + "task_type": "review", + "evidence_type": "focused_reference", + "reference_selection": "Focused reference coverage for preserving priority ordering when reviewing findFirst versus findAny changes.", + "runtime_reference_overlap_rationale": "Allowed only because this is reference-suite focused coverage, not ordinary broad lift." } } diff --git a/evals-reference/15-session-roster-indexes/criteria.json b/evals-reference/15-session-roster-indexes/criteria.json index 41df991..769ee05 100644 --- a/evals-reference/15-session-roster-indexes/criteria.json +++ b/evals-reference/15-session-roster-indexes/criteria.json @@ -1,5 +1,5 @@ { - "context": "Reference natural implementation: Java 17 collector scenario for flattening nested data, grouping, duplicate-key handling, sorted unique downstream results, and short-circuit existence checks. Demoted from main because hosted history showed a high without-context score, so it is useful broad stream coverage but weak main-lift evidence.", + "context": "Reference natural implementation: focused Java 17 collector scenario for flattening nested data, grouping, duplicate-key handling, sorted unique downstream results, and short-circuit existence checks. Demoted from main because hosted history showed a high without-context score and runtime references contain same-domain session examples.", "type": "weighted_checklist", "checklist": [ { @@ -17,15 +17,21 @@ { "name": "Flattens nested conference data directly", "category": "stream_quality", - "max_score": 25, + "max_score": 20, "description": "Traverses conferences to sessions to registrations with direct flatMap-style composition or an equivalently clear stream chain, without building nested temporary collections or mutating shared result maps inside forEach." }, { "name": "Uses collector semantics for track email lists", "category": "stream_quality", - "max_score": 25, + "max_score": 20, "description": "Builds opted-in emails by track with collector semantics such as groupingBy plus downstream mapping/filtering/collection, producing sorted unique email lists without a post-hoc shared mutable accumulation pass." }, + { + "name": "Avoids multi-line lambdas", + "category": "maintainability", + "max_score": 10, + "description": "Keeps stream and collector lambdas as short same-line glue. Extracts nested registration streams, duplicate-room comparison logic, or other multi-step callback bodies into named helpers or method references instead of using block lambdas, arrows whose body starts on the next line, or lambda bodies that continue their stream chain on later lines." + }, { "name": "Handles duplicate rooms explicitly", "category": "stream_quality", @@ -48,7 +54,8 @@ "metadata": { "invocation": "natural", "task_type": "implementation", - "evidence_type": "ordinary_lift", - "reference_selection": "demoted from active main eval coverage because release run 019ea20b-cf1b-73da-955f-d782db861b86 scored 92/100 without context and 100/100 with context; keep as reference coverage unless future hosted evidence shows stronger delta" + "evidence_type": "focused_reference", + "reference_selection": "Focused reference coverage for nested session-registration collector behavior and helper extraction.", + "runtime_reference_overlap_rationale": "Allowed only because this is reference-suite focused coverage, not ordinary broad lift." } } diff --git a/evals-reference/26-uppercase-side-effect-review/criteria.json b/evals-reference/26-uppercase-side-effect-review/criteria.json index 16a08ef..72fd946 100644 --- a/evals-reference/26-uppercase-side-effect-review/criteria.json +++ b/evals-reference/26-uppercase-side-effect-review/criteria.json @@ -1,5 +1,5 @@ { - "context": "Reference natural review: performance advice for a stream that mutates an external ArrayList and tempts agents to suggest parallelStream for a 10-million-element input.", + "context": "Reference explicit review: performance advice for a stream that mutates an external ArrayList and tempts agents to suggest parallelStream for a 10-million-element input.", "type": "weighted_checklist", "checklist": [ { @@ -40,7 +40,7 @@ } ], "metadata": { - "invocation": "natural", + "invocation": "explicit", "task_type": "review", "evidence_type": "ordinary_lift", "issue": "https://github.com/martinfrancois/java-streams-skill/issues/4" diff --git a/evals-reference/26-uppercase-side-effect-review/task.md b/evals-reference/26-uppercase-side-effect-review/task.md index d9f35fb..e6e008e 100644 --- a/evals-reference/26-uppercase-side-effect-review/task.md +++ b/evals-reference/26-uppercase-side-effect-review/task.md @@ -1,6 +1,6 @@ # Review uppercase stream performance advice -Create `review.md`. Assume Java 17. +Use `$java-streams` to create `review.md`. Assume Java 17. A company has code like this in its codebase. The real `names` list has about 10 million items; the snippet uses a smaller list only for brevity. The team noticed this operation is slow and wants diff --git a/evals-reference/28-overdue-shipment-notices/capability.txt b/evals-reference/28-overdue-shipment-notices/capability.txt new file mode 100644 index 0000000..20cdfe1 --- /dev/null +++ b/evals-reference/28-overdue-shipment-notices/capability.txt @@ -0,0 +1,2 @@ +Implement a Java 17 stream transformation that filters overdue shipments and maps them to notices +while keeping stream lambdas concise and extracting non-trivial derivation logic. diff --git a/evals-reference/28-overdue-shipment-notices/criteria.json b/evals-reference/28-overdue-shipment-notices/criteria.json new file mode 100644 index 0000000..94d1610 --- /dev/null +++ b/evals-reference/28-overdue-shipment-notices/criteria.json @@ -0,0 +1,50 @@ +{ + "context": "Reference natural implementation: focused issue #36 behavior-delta coverage for overdue shipment notices with filtering and non-trivial mapping logic that should stay out of multi-line stream lambdas.", + "type": "weighted_checklist", + "checklist": [ + { + "name": "Creates coherent Java 17 artifact", + "category": "safety", + "max_score": 5, + "description": "Creates OverdueShipmentNotices.java with the requested overdueNotices(List shipments, Clock clock) method, minimal Shipment and ShipmentNotice records, necessary imports, and Java 17-compatible code." + }, + { + "name": "Preserves filtering behavior", + "category": "safety", + "max_score": 5, + "description": "Includes only shipments with no deliveredAt value and a dueDate before LocalDate.now(clock), and preserves the encounter order of the input list." + }, + { + "name": "Computes notice fields correctly", + "category": "safety", + "max_score": 5, + "description": "Each notice contains the shipment id, customer email, ChronoUnit.DAYS days-late value from dueDate to today, and severity critical for at least 14 days late or late otherwise." + }, + { + "name": "Uses direct stream result production", + "category": "stream_quality", + "max_score": 3, + "description": "Uses a direct stream chain that filters overdue shipments and collects or returns the mapped notices without external mutable accumulation, manual indexing, background workers, or unrelated caching." + }, + { + "name": "Avoids multi-line stream lambdas", + "category": "stream_quality", + "max_score": 80, + "description": "Keeps lambdas passed to stream operations as concise one-expression glue or method references, with no block lambdas using braces inside the stream chain. Non-trivial overdue checks, days-late calculation, severity selection, or notice construction are extracted to named helper methods or equivalent reusable methods." + }, + { + "name": "Keeps API focused", + "category": "maintainability", + "max_score": 2, + "description": "Keeps the solution limited to the requested class, method, records, and private helpers without broad public API redesign or unnecessary abstractions." + } + ], + "metadata": { + "invocation": "natural", + "task_type": "implementation", + "evidence_type": "focused_reference", + "issue": "https://github.com/martinfrancois/java-streams-skill/issues/36", + "reference_selection": "Focused issue #36 behavior-delta coverage for extracting non-trivial stream lambda bodies into helpers.", + "runtime_reference_overlap_rationale": "Allowed only because this is reference-suite focused coverage, not ordinary broad lift or main-suite generalization evidence." + } +} diff --git a/evals-reference/28-overdue-shipment-notices/task.md b/evals-reference/28-overdue-shipment-notices/task.md new file mode 100644 index 0000000..34b60ee --- /dev/null +++ b/evals-reference/28-overdue-shipment-notices/task.md @@ -0,0 +1,29 @@ +# Implement overdue shipment notices + +Create `OverdueShipmentNotices.java`. Assume Java 17. + +Implement: + +```java +List overdueNotices(List shipments, Clock clock) +``` + +Use these records in the same file: + +```java +record Shipment(String id, String customerEmail, LocalDate dueDate, Optional deliveredAt) {} +record ShipmentNotice(String id, String customerEmail, long daysLate, String severity) {} +``` + +Rules: + +- Include only shipments whose `deliveredAt` is empty and whose `dueDate` is before + `LocalDate.now(clock)`. +- Preserve the encounter order of `shipments`. +- For each overdue shipment, return a notice containing: + - the shipment id, + - the customer email, + - the number of days late, + - severity `"critical"` when the shipment is at least 14 days late, otherwise `"late"`. +- Do not mutate the input list, add external dependencies, start background work, or introduce new + public APIs beyond the requested class, method, and records. diff --git a/evals-reference/NUMBERING.md b/evals-reference/NUMBERING.md index 1469e44..c1fafa5 100644 --- a/evals-reference/NUMBERING.md +++ b/evals-reference/NUMBERING.md @@ -24,6 +24,11 @@ Number `27` covers high-volume uppercase implementation from context. Keep it in `evals-reference/` unless future hosted history shows it should move to main or regression. +Number `28` covers implementation readability for stream chains that need non-trivial mapping logic: +the stream should use short glue lambdas or method references while extracted helpers own +multi-step derivation. Keep it in `evals-reference/` until targeted hosted evidence shows whether it +belongs in main or regression. + Number `25` contains the explicit hard-stop scan workflow audit that was demoted from the main eval set and later moved to `evals-regression/`. It requires exact skill-provided text, so report it as with-context regression coverage rather than as main or reference Java stream reasoning lift. diff --git a/evals-regression/22-java8-version-scan/criteria.json b/evals-regression/22-java8-version-scan/criteria.json index 21e01c0..318a533 100644 --- a/evals-regression/22-java8-version-scan/criteria.json +++ b/evals-regression/22-java8-version-scan/criteria.json @@ -17,9 +17,15 @@ { "name": "Classifies Java 8 version drift", "category": "stream_quality", - "max_score": 60, + "max_score": 50, "description": "Flags Optional::stream, Stream.toList, Stream.ofNullable, takeWhile, dropWhile, Collectors.flatMapping, Collectors.teeing, and mapMulti as unavailable on Java 8 and gives Java 8-compatible directions." }, + { + "name": "Avoids multi-line lambda replacements", + "category": "maintainability", + "max_score": 10, + "description": "Keeps Java 8 replacement snippets free of multi-line lambdas. Uses helper methods, method references, or same-line one-expression lambdas instead of block lambdas, arrows whose body starts on the next line, or lambda bodies that continue a nested stream chain on later lines." + }, { "name": "Classifies allowed count usage", "category": "stream_quality", diff --git a/evals/01-offer-availability-mapconcurrent/criteria.json b/evals/01-offer-availability-mapconcurrent/criteria.json index d248a61..dd5ec1d 100644 --- a/evals/01-offer-availability-mapconcurrent/criteria.json +++ b/evals/01-offer-availability-mapconcurrent/criteria.json @@ -42,7 +42,9 @@ "metadata": { "invocation": "explicit", "task_type": "implementation", + "evidence_type": "focused_main", "main_eval_weight_multiplier": 5, - "main_eval_selection": "kept for bounded remote-check coverage; rerun hosted evals before updating benchmark claims" + "main_eval_selection": "kept for bounded remote-check coverage; rerun hosted evals before updating benchmark claims", + "runtime_reference_overlap_rationale": "Focused main-suite coverage for bounded mapConcurrent service-stub usage; not ordinary broad lift." } } diff --git a/evals/02-delivery-appointments-mapconcurrent/criteria.json b/evals/02-delivery-appointments-mapconcurrent/criteria.json index d5ebb21..8d430b8 100644 --- a/evals/02-delivery-appointments-mapconcurrent/criteria.json +++ b/evals/02-delivery-appointments-mapconcurrent/criteria.json @@ -42,7 +42,9 @@ "metadata": { "invocation": "natural", "task_type": "implementation", + "evidence_type": "focused_main", "main_eval_weight_multiplier": 4, - "main_eval_selection": "active replacement for a runtime-reference-overlapping bounded remote-check scenario; rerun hosted evals before updating benchmark claims" + "main_eval_selection": "active replacement for a runtime-reference-overlapping bounded remote-check scenario; rerun hosted evals before updating benchmark claims", + "runtime_reference_overlap_rationale": "Focused main-suite coverage for bounded mapConcurrent service-stub usage; not ordinary broad natural lift." } } diff --git a/scripts/assert_eval_with_context.py b/scripts/assert_eval_with_context.py new file mode 100755 index 0000000..2046731 --- /dev/null +++ b/scripts/assert_eval_with_context.py @@ -0,0 +1,164 @@ +#!/usr/bin/env python3 +"""Validate Tessl eval-run JSON and require with-context 100% across all scenarios.""" + +from __future__ import annotations + +import argparse +import json +import sys +from pathlib import Path +from typing import Any + + +def parse_json_payload(raw: str) -> dict[str, Any]: + try: + return json.loads(raw) + except json.JSONDecodeError: + pass + + # Fallback for mixed stdout where JSON is appended after text logs. + decoder = json.JSONDecoder() + for index, char in enumerate(raw): + if char != "{": + continue + try: + value, _ = decoder.raw_decode(raw[index:]) + except json.JSONDecodeError: + continue + if isinstance(value, dict): + return value + + raise ValueError("No valid JSON object found in Tessl output") + + +def get_scenarios(payload: dict[str, Any]) -> list[dict[str, Any]]: + attrs = None + data = payload.get("data") + if isinstance(data, dict): + attrs = data.get("attributes") + if isinstance(attrs, dict): + scenarios = attrs.get("scenarios") + if isinstance(scenarios, list): + return scenarios + + scenarios = payload.get("scenarios") + if isinstance(scenarios, list): + return scenarios + + raise ValueError("Could not locate eval scenarios in Tessl JSON") + + +def variant_solution(scenario: dict[str, Any], preferred: list[str]) -> dict[str, Any] | None: + solutions = scenario.get("solutions") + if not isinstance(solutions, list): + return None + + by_variant = {solution.get("variant"): solution for solution in solutions if isinstance(solution, dict)} + for variant in preferred: + candidate = by_variant.get(variant) + if candidate is not None: + return candidate + return None + + +def solved_with_context(solution: dict[str, Any]) -> tuple[float, float]: + if "score" in solution or "max_score" in solution or "maxScore" in solution: + score = solution.get("score", 0) + max_score = solution.get("max_score", solution.get("maxScore", 0)) + return float(score or 0), float(max_score or 0) + + assessment_results = solution.get("assessmentResults") + if isinstance(assessment_results, list): + score = 0.0 + max_score = 0.0 + for result in assessment_results: + if not isinstance(result, dict): + continue + score += float(result.get("score") or 0) + max_score += float(result.get("max_score", result.get("maxScore", 0)) or 0) + return score, max_score + + return 0.0, 0.0 + + +def with_context_score(scenario: dict[str, Any]) -> tuple[float, float] | None: + solution = variant_solution(scenario, ["usage-spec", "with-context"]) + if not solution: + return None + return solved_with_context(solution) + + +def scenario_name(scenario: dict[str, Any], index: int) -> str: + return ( + scenario.get("shortDescription") + or scenario.get("name") + or scenario.get("scenarioId") + or f"scenario-{index}" + ) + + +def format_pct(earned: float, maximum: float) -> str: + if maximum <= 0: + return "n/a" + return f"{(earned / maximum * 100):.2f}%" + + +def evaluate(payload: dict[str, Any], *, suite: str | None = None) -> int: + scenarios = get_scenarios(payload) + missing: list[str] = [] + failing: list[str] = [] + + for index, scenario in enumerate(scenarios, start=1): + if not isinstance(scenario, dict): + continue + name = scenario_name(scenario, index) + scores = with_context_score(scenario) + if scores is None: + missing.append(name) + continue + + earned, maximum = scores + if maximum <= 0: + missing.append(name) + continue + if earned < maximum: + failing.append(name) + + suite_label = f"{suite} " if suite else "" + if not scenarios: + print(f"FAIL: {suite_label}run payload contains no scenarios", file=sys.stderr) + return 1 + if missing: + print(f"FAIL: {suite_label}run is missing with-context scoring for {len(missing)} scenario(s):", file=sys.stderr) + for name in missing: + print(f" - {name}", file=sys.stderr) + return 1 + if failing: + print( + f"FAIL: {suite_label}with-context is below 100% for {len(failing)} scenario(s):", + file=sys.stderr, + ) + for name in failing: + print(f" - {name}", file=sys.stderr) + return 1 + + print(f"PASS: {suite_label}all with-context scores are 100% ({len(scenarios)} scenarios).") + return 0 + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("run_json", type=Path, help="Path to `tessl eval run --json` output") + parser.add_argument("--suite", help="Suite label for logging clarity", default=None) + args = parser.parse_args() + + try: + payload = parse_json_payload(args.run_json.read_text(encoding="utf-8")) + return evaluate(payload, suite=args.suite) + except (OSError, ValueError, json.JSONDecodeError) as exc: + print(f"FAIL: unable to read eval JSON '{args.run_json}': {exc}", file=sys.stderr) + return 2 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/eval_evidence.py b/scripts/eval_evidence.py new file mode 100755 index 0000000..8b9e5b8 --- /dev/null +++ b/scripts/eval_evidence.py @@ -0,0 +1,314 @@ +#!/usr/bin/env python3 +"""Track hosted eval scenarios already proven clean for a skill bundle fingerprint.""" + +from __future__ import annotations + +import argparse +import hashlib +import json +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +from assert_eval_with_context import get_scenarios, parse_json_payload, with_context_score + + +SUITE_DIRS = { + "main": "evals", + "reference": "evals-reference", + "regression": "evals-regression", +} + + +def scenario_names(repo_root: Path, suite: str) -> list[str]: + suite_dir = repo_root / SUITE_DIRS[suite] + if not suite_dir.is_dir(): + raise SystemExit(f"Missing suite directory: {suite_dir}") + return sorted(path.name for path in suite_dir.iterdir() if path.is_dir()) + + +def fingerprint_scenario(repo_root: Path, suite: str, scenario: str) -> str | None: + scenario_dir = repo_root / SUITE_DIRS[suite] / scenario + if not scenario_dir.is_dir(): + return None + + digest = hashlib.sha256() + for path in sorted(scenario_dir.rglob("*")): + if not path.is_file(): + continue + relative = path.relative_to(scenario_dir).as_posix() + digest.update(relative.encode("utf-8")) + digest.update(b"\0") + digest.update(path.read_bytes()) + digest.update(b"\0") + return digest.hexdigest() + + +def scenario_fingerprints(repo_root: Path, suite: str, scenarios: list[str]) -> dict[str, str]: + result: dict[str, str] = {} + for scenario in normalize(scenarios): + fingerprint = fingerprint_scenario(repo_root, suite, scenario) + if fingerprint: + result[scenario] = fingerprint + return result + + +def scenario_task_map(repo_root: Path, suite: str) -> dict[str, str]: + mapping: dict[str, str] = {} + suite_dir = repo_root / SUITE_DIRS[suite] + if not suite_dir.is_dir(): + raise SystemExit(f"Missing suite directory: {suite_dir}") + + for scenario_dir in sorted(path for path in suite_dir.iterdir() if path.is_dir()): + task_path = scenario_dir / "task.md" + if task_path.is_file(): + mapping[hashlib.sha256(task_path.read_text(encoding="utf-8").encode("utf-8")).hexdigest()] = ( + scenario_dir.name + ) + return mapping + + +def fingerprint_skill(skill_dir: Path) -> str: + if not skill_dir.is_dir(): + raise SystemExit(f"Missing skill directory: {skill_dir}") + + digest = hashlib.sha256() + for path in sorted(skill_dir.rglob("*")): + if not path.is_file(): + continue + relative = path.relative_to(skill_dir).as_posix() + digest.update(relative.encode("utf-8")) + digest.update(b"\0") + digest.update(path.read_bytes()) + digest.update(b"\0") + return digest.hexdigest() + + +def load_state(path: Path) -> dict[str, Any]: + if not path.exists(): + return {} + try: + raw = json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): + return {} + return raw if isinstance(raw, dict) else {} + + +def empty_state(fingerprint: str) -> dict[str, Any]: + return { + "fingerprint": fingerprint, + "validated": {suite: [] for suite in SUITE_DIRS}, + "scenario_fingerprints": {suite: {} for suite in SUITE_DIRS}, + "updated_at": None, + } + + +def state_for(path: Path, fingerprint: str) -> dict[str, Any]: + state = load_state(path) + if state.get("fingerprint") != fingerprint: + return empty_state(fingerprint) + + validated = state.setdefault("validated", {}) + fingerprints = state.setdefault("scenario_fingerprints", {}) + for suite in SUITE_DIRS: + values = validated.get(suite) + if not isinstance(values, list): + validated[suite] = [] + current = fingerprints.get(suite) + if not isinstance(current, dict): + fingerprints[suite] = {} + return state + + +def save_state(path: Path, state: dict[str, Any]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(json.dumps(state, indent=2, sort_keys=True) + "\n", encoding="utf-8") + + +def normalize(values: list[str]) -> list[str]: + seen: set[str] = set() + result: list[str] = [] + for value in values: + if value and value not in seen: + seen.add(value) + result.append(value) + return result + + +def cmd_fingerprint(args: argparse.Namespace) -> int: + print(fingerprint_skill(args.skill_dir)) + return 0 + + +def cmd_list(args: argparse.Namespace) -> int: + for scenario in scenario_names(args.repo_root, args.suite): + print(scenario) + return 0 + + +def cmd_remaining(args: argparse.Namespace) -> int: + state = state_for(args.file, args.fingerprint) + universe = normalize(args.scenarios or scenario_names(args.repo_root, args.suite)) + validated = set(state["validated"].get(args.suite, [])) + fingerprints = state.get("scenario_fingerprints", {}).get(args.suite, {}) + for scenario in universe: + if scenario not in validated: + print(scenario) + continue + + current = fingerprint_scenario(args.repo_root, args.suite, scenario) + recorded = fingerprints.get(scenario) + if current and recorded and current == recorded: + continue + print(scenario) + return 0 + + +def cmd_mark(args: argparse.Namespace) -> int: + state = state_for(args.file, args.fingerprint) + validated = set(state["validated"].get(args.suite, [])) + scenarios = normalize(args.scenarios) + validated.update(scenarios) + state["validated"][args.suite] = sorted(validated) + state.setdefault("scenario_fingerprints", {})[args.suite] = { + **state.get("scenario_fingerprints", {}).get(args.suite, {}), + **scenario_fingerprints(args.repo_root, args.suite, scenarios), + } + state["updated_at"] = datetime.now(timezone.utc).isoformat() + save_state(args.file, state) + return 0 + + +def passing_scenarios(repo_root: Path, suite: str, run_json: Path) -> list[str]: + payload = parse_json_payload(run_json.read_text(encoding="utf-8")) + by_task_hash = scenario_task_map(repo_root, suite) + passed: list[str] = [] + + for scenario in get_scenarios(payload): + if not isinstance(scenario, dict): + continue + task = scenario.get("task") + if not isinstance(task, str): + continue + + scenario_name = by_task_hash.get(hashlib.sha256(task.encode("utf-8")).hexdigest()) + if not scenario_name: + continue + + scores = with_context_score(scenario) + if scores is None: + continue + + earned, maximum = scores + if maximum > 0 and earned >= maximum: + passed.append(scenario_name) + return normalize(passed) + + +def cmd_passing(args: argparse.Namespace) -> int: + for scenario in passing_scenarios(args.repo_root, args.suite, args.run_json): + print(scenario) + return 0 + + +def cmd_ingest(args: argparse.Namespace) -> int: + passed = passing_scenarios(args.repo_root, args.suite, args.run_json) + if not passed: + return 0 + + state = state_for(args.file, args.fingerprint) + validated = set(state["validated"].get(args.suite, [])) + validated.update(passed) + state["validated"][args.suite] = sorted(validated) + state.setdefault("scenario_fingerprints", {})[args.suite] = { + **state.get("scenario_fingerprints", {}).get(args.suite, {}), + **scenario_fingerprints(args.repo_root, args.suite, passed), + } + state["updated_at"] = datetime.now(timezone.utc).isoformat() + save_state(args.file, state) + + for scenario in passed: + print(scenario) + return 0 + + +def cmd_status(args: argparse.Namespace) -> int: + state = state_for(args.file, args.fingerprint) + print(f"fingerprint: {args.fingerprint}") + for suite in SUITE_DIRS: + all_scenarios = set(scenario_names(args.repo_root, suite)) + validated = set(state["validated"].get(suite, [])) + fingerprints = state.get("scenario_fingerprints", {}).get(suite, {}) + current_validated = { + scenario + for scenario in all_scenarios & validated + if fingerprints.get(scenario) == fingerprint_scenario(args.repo_root, suite, scenario) + } + valid_count = len(current_validated) + missing = sorted(all_scenarios - current_validated) + print(f"{suite}: {valid_count}/{len(all_scenarios)} validated") + if missing: + print(" missing: " + " ".join(missing)) + return 0 + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description=__doc__) + subparsers = parser.add_subparsers(dest="command", required=True) + + fingerprint = subparsers.add_parser("fingerprint") + fingerprint.add_argument("--skill-dir", type=Path, required=True) + fingerprint.set_defaults(func=cmd_fingerprint) + + list_cmd = subparsers.add_parser("list") + list_cmd.add_argument("--repo-root", type=Path, default=Path(".")) + list_cmd.add_argument("--suite", choices=sorted(SUITE_DIRS), required=True) + list_cmd.set_defaults(func=cmd_list) + + remaining = subparsers.add_parser("remaining") + remaining.add_argument("--file", type=Path, required=True) + remaining.add_argument("--fingerprint", required=True) + remaining.add_argument("--repo-root", type=Path, default=Path(".")) + remaining.add_argument("--suite", choices=sorted(SUITE_DIRS), required=True) + remaining.add_argument("scenarios", nargs="*") + remaining.set_defaults(func=cmd_remaining) + + mark = subparsers.add_parser("mark") + mark.add_argument("--file", type=Path, required=True) + mark.add_argument("--fingerprint", required=True) + mark.add_argument("--repo-root", type=Path, default=Path(".")) + mark.add_argument("--suite", choices=sorted(SUITE_DIRS), required=True) + mark.add_argument("scenarios", nargs="+") + mark.set_defaults(func=cmd_mark) + + passing = subparsers.add_parser("passing") + passing.add_argument("--run-json", type=Path, required=True) + passing.add_argument("--repo-root", type=Path, default=Path(".")) + passing.add_argument("--suite", choices=sorted(SUITE_DIRS), required=True) + passing.set_defaults(func=cmd_passing) + + ingest = subparsers.add_parser("ingest") + ingest.add_argument("--file", type=Path, required=True) + ingest.add_argument("--fingerprint", required=True) + ingest.add_argument("--run-json", type=Path, required=True) + ingest.add_argument("--repo-root", type=Path, default=Path(".")) + ingest.add_argument("--suite", choices=sorted(SUITE_DIRS), required=True) + ingest.set_defaults(func=cmd_ingest) + + status = subparsers.add_parser("status") + status.add_argument("--file", type=Path, required=True) + status.add_argument("--fingerprint", required=True) + status.add_argument("--repo-root", type=Path, default=Path(".")) + status.set_defaults(func=cmd_status) + + return parser + + +def main() -> int: + parser = build_parser() + args = parser.parse_args() + return args.func(args) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/eval_impact.py b/scripts/eval_impact.py new file mode 100755 index 0000000..1cac536 --- /dev/null +++ b/scripts/eval_impact.py @@ -0,0 +1,256 @@ +#!/usr/bin/env python3 +"""Suggest targeted eval scenarios affected by skill bundle changes.""" + +from __future__ import annotations + +import argparse +import json +import math +import re +import subprocess +from collections import Counter +from dataclasses import dataclass +from pathlib import Path + + +SUITE_DIRS = { + "main": "evals", + "reference": "evals-reference", + "regression": "evals-regression", +} + +STOPWORDS = { + "about", + "after", + "again", + "against", + "also", + "and", + "any", + "are", + "because", + "before", + "being", + "between", + "but", + "can", + "cannot", + "code", + "does", + "each", + "example", + "file", + "for", + "from", + "has", + "have", + "into", + "its", + "java", + "keep", + "make", + "must", + "not", + "only", + "or", + "other", + "over", + "preserve", + "provided", + "requested", + "result", + "return", + "same", + "should", + "than", + "that", + "the", + "their", + "then", + "this", + "use", + "when", + "where", + "with", + "without", +} + + +@dataclass +class Scenario: + suite: str + name: str + path: Path + tokens: Counter[str] + + +def run_git(args: list[str], repo_root: Path) -> str: + completed = subprocess.run( + ["git", *args], + cwd=repo_root, + check=False, + text=True, + stdout=subprocess.PIPE, + stderr=subprocess.DEVNULL, + ) + return completed.stdout + + +def split_identifier(value: str) -> list[str]: + parts = re.sub(r"([a-z0-9])([A-Z])", r"\1 \2", value) + parts = re.sub(r"[^A-Za-z0-9]+", " ", parts) + return parts.lower().split() + + +def tokenize(text: str) -> Counter[str]: + tokens: Counter[str] = Counter() + for raw in re.findall(r"[A-Za-z][A-Za-z0-9_]*", text): + for token in split_identifier(raw): + if len(token) < 3 or token in STOPWORDS: + continue + tokens[token] += 1 + return tokens + + +def changed_files(repo_root: Path, base_ref: str, head_ref: str, skill_dir: str) -> list[Path]: + files = set() + diff_output = run_git(["diff", "--name-only", f"{base_ref}...{head_ref}", "--", skill_dir], repo_root) + for line in diff_output.splitlines(): + if line: + files.add(Path(line)) + + if head_ref == "HEAD": + status_output = run_git(["status", "--short", "--", skill_dir], repo_root) + for line in status_output.splitlines(): + if len(line) > 3: + files.add(Path(line[3:])) + + return sorted(files) + + +def git_file_text(repo_root: Path, ref: str, path: Path) -> str: + if ref == "HEAD": + full_path = repo_root / path + if full_path.exists(): + return full_path.read_text(encoding="utf-8") + return run_git(["show", f"{ref}:{path.as_posix()}"], repo_root) + + +def changed_text(repo_root: Path, base_ref: str, head_ref: str, paths: list[Path]) -> str: + chunks: list[str] = [] + for path in paths: + diff = run_git(["diff", "--unified=0", f"{base_ref}...{head_ref}", "--", str(path)], repo_root) + added_lines = [] + for line in diff.splitlines(): + if line.startswith("+++") or not line.startswith("+"): + continue + added_lines.append(line[1:]) + + if added_lines: + chunks.append(path.as_posix()) + chunks.extend(added_lines) + else: + chunks.append(path.as_posix()) + chunks.append(git_file_text(repo_root, head_ref, path)) + return "\n".join(chunks) + + +def scenario_text(path: Path) -> str: + parts: list[str] = [path.name] + for filename in ("task.md", "capability.txt", "criteria.json"): + file_path = path / filename + if not file_path.is_file(): + continue + if filename.endswith(".json"): + try: + parts.append(json.dumps(json.loads(file_path.read_text(encoding="utf-8")), sort_keys=True)) + except json.JSONDecodeError: + parts.append(file_path.read_text(encoding="utf-8")) + else: + parts.append(file_path.read_text(encoding="utf-8")) + return "\n".join(parts) + + +def load_scenarios(repo_root: Path) -> list[Scenario]: + scenarios: list[Scenario] = [] + for suite, directory in SUITE_DIRS.items(): + suite_dir = repo_root / directory + if not suite_dir.is_dir(): + continue + for scenario_dir in sorted(path for path in suite_dir.iterdir() if path.is_dir()): + scenarios.append( + Scenario( + suite=suite, + name=scenario_dir.name, + path=scenario_dir, + tokens=tokenize(scenario_text(scenario_dir)), + ) + ) + return scenarios + + +def idf_by_token(scenarios: list[Scenario]) -> dict[str, float]: + document_counts: Counter[str] = Counter() + for scenario in scenarios: + document_counts.update(scenario.tokens.keys()) + + total = len(scenarios) + return { + token: math.log((1 + total) / (1 + count)) + 1 + for token, count in document_counts.items() + } + + +def score_scenario(change_tokens: Counter[str], scenario: Scenario, idf: dict[str, float]) -> float: + score = 0.0 + for token, count in change_tokens.items(): + if token not in scenario.tokens: + continue + score += (1 + math.log(count)) * idf.get(token, 1.0) + return score + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--repo-root", type=Path, default=Path(".")) + parser.add_argument("--base-ref", default="origin/main") + parser.add_argument("--head-ref", default="HEAD") + parser.add_argument("--skill-dir", default="skills/java-streams") + parser.add_argument("--limit", type=int, default=4) + parser.add_argument("--min-score", type=float, default=1.0) + parser.add_argument("--explain", action="store_true") + args = parser.parse_args() + + repo_root = args.repo_root.resolve() + files = changed_files(repo_root, args.base_ref, args.head_ref, args.skill_dir) + if not files: + return 0 + + change_tokens = tokenize(changed_text(repo_root, args.base_ref, args.head_ref, files)) + if not change_tokens: + return 0 + + scenarios = load_scenarios(repo_root) + idf = idf_by_token(scenarios) + ranked = sorted( + ( + (score_scenario(change_tokens, scenario, idf), scenario) + for scenario in scenarios + ), + key=lambda item: (-item[0], item[1].suite, item[1].name), + ) + + emitted = 0 + for score, scenario in ranked: + if emitted >= args.limit or score < args.min_score: + break + if args.explain: + print(f"{scenario.suite}:{scenario.name}\t{score:.2f}\t{scenario.path.relative_to(repo_root)}") + else: + print(f"{scenario.suite}:{scenario.name}") + emitted += 1 + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/internal_pr_readiness.sh b/scripts/internal_pr_readiness.sh new file mode 100755 index 0000000..ab88a77 --- /dev/null +++ b/scripts/internal_pr_readiness.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +set -euo pipefail +set -o pipefail + +usage() { + cat <<'USAGE' +Usage: + scripts/internal_pr_readiness.sh [mode] [pre_submit_gate options...] + +Modes: + --plan print the staged check plan only (default) + --targeted run the suggested targeted pre-submit checks + --full run targeted checks and required full suites in sequence + +This is an internal, non-plugin skill for maintainer workflow: +- it delegates to scripts/pre_submit_gate.sh +- keeps quality-first ordering intact +- avoids accidental broad hosted-ramp starts by default + +Examples: + scripts/internal_pr_readiness.sh --plan + scripts/internal_pr_readiness.sh --targeted --focus reference:05-... + scripts/internal_pr_readiness.sh --full +USAGE +} + +mode="plan" + +case "${1-}" in + --plan|--targeted|--full) + mode="$1" + shift + ;; + --help|-h) + usage + exit 0 + ;; + "") + ;; + *) + : + ;; +esac + +case "$mode" in + plan) + args=(--plan-only) + ;; + targeted) + args=(--auto-continue --targeted-only) + ;; + full) + args=(--auto-continue --run-broad) + ;; +esac + +exec bash "$(dirname "$0")/pre_submit_gate.sh" "${args[@]}" "$@" diff --git a/scripts/pre_submit_gate.sh b/scripts/pre_submit_gate.sh new file mode 100755 index 0000000..1c90fcd --- /dev/null +++ b/scripts/pre_submit_gate.sh @@ -0,0 +1,1289 @@ +#!/usr/bin/env bash +set -euo pipefail +set -o pipefail + +usage() { + cat <<'USAGE' +Usage: + scripts/pre_submit_gate.sh [options] + +Run the pre-submit hosted-eval gate used for final skill/runtime changes. + +Behavior: +- Always runs quality review first (or exits early if it is below threshold). +- Computes changed files against BASE_REF and builds targeted eval commands for changed + eval scenario directories. +- Uses lightweight impact analysis for runtime-only changes to choose early targeted probes. +- Runs targeted evals first. +- Records scenario-level passing evidence for the current skill bundle fingerprint and skips + already-proven scenarios when broadening. +- For runtime/reference changes, continues to full suite checks after targeted evals pass unless + explicitly constrained to targeted-only mode. + +Options: + --base-ref Git ref to diff against for changed files (default: + origin/main) + --skill-dir Skill directory for quality review + (default: skills/java-streams) + --plan-only Print the staged plan and do not run hosted eval commands. + --run-broad After targeted evals clean, run main then reference then + regression. + --targeted-only Skip full-suite expansion; run only explicitly targeted scope. + --impact-limit Maximum impact-analysis scenarios to add for runtime + changes (default: 4). + --no-impact-analysis Disable runtime-change impact-analysis focus suggestions. + --risk-limit Maximum historical risk probes to add for runtime + changes (default: 6). + --risk-probe-file + Historical risk probe list (default: + docs/agents/eval-risk-probes.txt). + --no-risk-probes Disable historical risk probes. + --target-batch-size + Maximum same-suite targeted scenarios per hosted run + (default: 1). Smaller batches discover failures with + fewer wasted eval-solutions. + --broad-batch-mode + Broad-stage batching: balanced, progressive, or suite + (default: balanced). + --broad-order Comma-separated broad-stage order, or auto + (default: auto). Example: regression,main,reference. + --evidence-file + Scenario evidence cache (default: + .tessl/eval-evidence/java-streams-pre-submit.json) + --reset-evidence Clear the evidence cache before planning. + --ignore-evidence Do not skip scenarios already recorded as passing. + --focus + Add a manual focus scenario, e.g. main:02-delivery... + Useful when runtime text changed without scenario-scoped + eval edits. + --auto-continue Skip confirm prompts between hosted stages. + -h, --help Show this help. + +Examples: + scripts/pre_submit_gate.sh --base-ref origin/main --plan-only + scripts/pre_submit_gate.sh --run-broad + scripts/pre_submit_gate.sh --focus main:02-delivery-appointments-mapconcurrent +USAGE +} + +need_confirmation=true +run_broad=false +targeted_only=false +plan_only=false +base_ref="origin/main" +skill_dir="skills/java-streams" +evidence_file=".tessl/eval-evidence/java-streams-pre-submit.json" +use_evidence=true +reset_evidence=false +runtime_fingerprint="" +broad_order_arg="auto" +impact_limit=4 +use_impact_analysis=true +risk_limit=6 +risk_probe_file="docs/agents/eval-risk-probes.txt" +use_risk_probes=true +target_batch_size=1 +broad_batch_mode="balanced" +review_workspace="${TESSL_REVIEW_WORKSPACE:-}" +focus_entries=() +impact_focus_main=() +impact_focus_reference=() +impact_focus_regression=() +risk_focus_main=() +risk_focus_reference=() +risk_focus_regression=() +explicit_focus_ordered=() +impact_focus_ordered=() +risk_focus_ordered=() +target_queue_suites=() +target_queue_scenarios=() + +while [[ $# -gt 0 ]]; do + case "$1" in + --base-ref) + if [[ $# -lt 2 ]]; then + echo "--base-ref requires an argument" >&2 + exit 2 + fi + base_ref="$2" + shift 2 + ;; + --skill-dir) + if [[ $# -lt 2 ]]; then + echo "--skill-dir requires an argument" >&2 + exit 2 + fi + skill_dir="$2" + shift 2 + ;; + --plan-only) + plan_only=true + shift + ;; + --run-broad) + run_broad=true + shift + ;; + --targeted-only) + targeted_only=true + shift + ;; + --impact-limit) + if [[ $# -lt 2 ]]; then + echo "--impact-limit requires an argument" >&2 + exit 2 + fi + if [[ ! "$2" =~ ^[0-9]+$ ]]; then + echo "--impact-limit must be a non-negative integer" >&2 + exit 2 + fi + impact_limit="$2" + shift 2 + ;; + --no-impact-analysis) + use_impact_analysis=false + shift + ;; + --risk-limit) + if [[ $# -lt 2 ]]; then + echo "--risk-limit requires an argument" >&2 + exit 2 + fi + if [[ ! "$2" =~ ^[0-9]+$ ]]; then + echo "--risk-limit must be a non-negative integer" >&2 + exit 2 + fi + risk_limit="$2" + shift 2 + ;; + --risk-probe-file) + if [[ $# -lt 2 ]]; then + echo "--risk-probe-file requires an argument" >&2 + exit 2 + fi + risk_probe_file="$2" + shift 2 + ;; + --no-risk-probes) + use_risk_probes=false + shift + ;; + --target-batch-size) + if [[ $# -lt 2 ]]; then + echo "--target-batch-size requires an argument" >&2 + exit 2 + fi + if [[ ! "$2" =~ ^[1-9][0-9]*$ ]]; then + echo "--target-batch-size must be a positive integer" >&2 + exit 2 + fi + target_batch_size="$2" + shift 2 + ;; + --broad-batch-mode) + if [[ $# -lt 2 ]]; then + echo "--broad-batch-mode requires an argument" >&2 + exit 2 + fi + case "$2" in + balanced|progressive|suite) + broad_batch_mode="$2" + ;; + *) + echo "--broad-batch-mode must be balanced, progressive, or suite" >&2 + exit 2 + ;; + esac + shift 2 + ;; + --broad-order) + if [[ $# -lt 2 ]]; then + echo "--broad-order requires an argument" >&2 + exit 2 + fi + broad_order_arg="$2" + shift 2 + ;; + --evidence-file) + if [[ $# -lt 2 ]]; then + echo "--evidence-file requires an argument" >&2 + exit 2 + fi + evidence_file="$2" + shift 2 + ;; + --reset-evidence) + reset_evidence=true + shift + ;; + --ignore-evidence) + use_evidence=false + shift + ;; + --focus) + if [[ $# -lt 2 ]]; then + echo "--focus requires a argument" >&2 + exit 2 + fi + focus_entries+=("$2") + shift 2 + ;; + --auto-continue) + need_confirmation=false + shift + ;; + -h|--help) + usage + exit 0 + ;; + *) + echo "Unknown option: $1" >&2 + usage >&2 + exit 2 + ;; + esac +done + +if ! command -v tessl >/dev/null 2>&1; then + echo "tessl CLI is required for pre-submit checks." >&2 + exit 127 +fi + +repo_root="$(git rev-parse --show-toplevel)" +cd "$repo_root" + +if [[ ! -d "$skill_dir" ]]; then + echo "Missing skill directory: $skill_dir" >&2 + exit 1 +fi + +detect_local_eval_runners() { + local proc pid cmd proc_cwd found=false + + for proc in /proc/[0-9]*; do + pid="${proc##*/}" + [[ "$pid" == "$$" || "$pid" == "$PPID" ]] && continue + + [[ -r "$proc/cmdline" ]] || continue + cmd="$( (tr '\0' ' ' < "$proc/cmdline") 2>/dev/null || true)" + [[ -z "$cmd" ]] && continue + case "$cmd" in + "tessl eval run"*|*"/tessl eval run"*|*"scripts/run_eval_suite.sh"*|*"scripts/pre_submit_gate.sh"*) + proc_cwd="$(readlink "$proc/cwd" 2>/dev/null || true)" + if [[ "$proc_cwd" == "$repo_root" || "$proc_cwd" == "$repo_root"/* ]]; then + if [[ "$found" == false ]]; then + echo "Another local eval runner is already active for this repo:" >&2 + found=true + fi + echo " pid=$pid cwd=$proc_cwd cmd=$cmd" >&2 + fi + ;; + esac + done + + if [[ "$found" == true ]]; then + echo "Stop the existing runner before starting a new pre-submit gate." >&2 + return 1 + fi +} + +detect_local_eval_runners + +if [[ -z "$review_workspace" && -f tessl.json ]]; then + review_workspace="$( + python3 - <<'PY' +import json +from pathlib import Path + +try: + name = json.loads(Path("tessl.json").read_text()).get("name", "") +except Exception: + name = "" +print(name.split("/", 1)[0] if "/" in name else "") +PY + )" +fi + +changed_main=() +changed_reference=() +changed_regression=() +explicit_focus_main=() +explicit_focus_reference=() +explicit_focus_regression=() +planned_passed_main=() +planned_passed_reference=() +planned_passed_regression=() + +collect_changed() { + local file scope scenario + local changed_list + + changed_list="$(mktemp)" + git diff --name-only "$base_ref"...HEAD -- > "$changed_list" || true + git diff --name-only -- >> "$changed_list" || true + git diff --cached --name-only -- >> "$changed_list" || true + git ls-files -o --exclude-standard >> "$changed_list" || true + + while IFS= read -r file; do + [[ -z "$file" ]] && continue + + case "$file" in + evals/*/*) + scope="${file%%/*}" + scenario="$(echo "$file" | awk -F/ '{print $2}')" + if [[ "$scope" == "evals" ]]; then + changed_main+=("$scenario") + fi + ;; + evals-reference/*/*) + scope="${file%%/*}" + scenario="$(echo "$file" | awk -F/ '{print $2}')" + if [[ "$scope" == "evals-reference" ]]; then + changed_reference+=("$scenario") + fi + ;; + evals-regression/*/*) + scope="${file%%/*}" + scenario="$(echo "$file" | awk -F/ '{print $2}')" + if [[ "$scope" == "evals-regression" ]]; then + changed_regression+=("$scenario") + fi + ;; + esac + done < "$changed_list" + + rm -f "$changed_list" +} + +normalize_targets() { + local src=("$@") + local uniq + local -A seen + + uniq=() + for item in "${src[@]}"; do + [[ -z "$item" ]] && continue + if [[ -n "${seen[$item]:-}" ]]; then + continue + fi + seen["$item"]=1 + uniq+=("$item") + done + printf '%s\n' "${uniq[@]}" +} + +suite_scenarios() { + local suite="$1" + python3 scripts/eval_evidence.py list --repo-root "$repo_root" --suite "$suite" +} + +remaining_scenarios() { + local suite="$1" + shift + local candidates=() + local planned=() + local scenario + local -A planned_seen=() + + if [[ "$use_evidence" != true ]]; then + if [[ $# -gt 0 ]]; then + candidates=("$@") + else + mapfile -t candidates < <(suite_scenarios "$suite") + fi + else + mapfile -t candidates < <(python3 scripts/eval_evidence.py remaining \ + --file "$evidence_file" \ + --fingerprint "$runtime_fingerprint" \ + --repo-root "$repo_root" \ + --suite "$suite" \ + "$@") + fi + + if [[ "$plan_only" == true && "$use_evidence" == true ]]; then + case "$suite" in + main) planned=("${planned_passed_main[@]}") ;; + reference) planned=("${planned_passed_reference[@]}") ;; + regression) planned=("${planned_passed_regression[@]}") ;; + esac + + for scenario in "${planned[@]}"; do + planned_seen["$scenario"]=1 + done + fi + + for scenario in "${candidates[@]}"; do + if [[ -n "${planned_seen[$scenario]:-}" ]]; then + continue + fi + printf '%s\n' "$scenario" + done +} + +remaining_targeted_scenarios() { + local suite="$1" + shift + + if [[ $# -eq 0 ]]; then + return 0 + fi + remaining_scenarios "$suite" "$@" +} + +mark_scenarios_passed() { + local suite="$1" + shift + + if [[ "$plan_only" == true || "$use_evidence" != true || $# -eq 0 ]]; then + return + fi + + python3 scripts/eval_evidence.py mark \ + --file "$evidence_file" \ + --fingerprint "$runtime_fingerprint" \ + --repo-root "$repo_root" \ + --suite "$suite" \ + "$@" +} + +mark_run_passing_scenarios() { + local suite="$1" + local run_log="$2" + local passed=() + + if [[ "$plan_only" == true || "$use_evidence" != true ]]; then + return + fi + + mapfile -t passed < <(python3 scripts/eval_evidence.py passing \ + --run-json "$run_log" \ + --repo-root "$repo_root" \ + --suite "$suite") + + if [[ "${#passed[@]}" -eq 0 ]]; then + return + fi + + echo "Recording passing scenario evidence for $suite: ${passed[*]}" + mark_scenarios_passed "$suite" "${passed[@]}" +} + +mark_scenarios_planned() { + local suite="$1" + shift + + if [[ "$plan_only" != true || $# -eq 0 ]]; then + return + fi + + case "$suite" in + main) + planned_passed_main+=("$@") + mapfile -t planned_passed_main < <(normalize_targets "${planned_passed_main[@]}") + ;; + reference) + planned_passed_reference+=("$@") + mapfile -t planned_passed_reference < <(normalize_targets "${planned_passed_reference[@]}") + ;; + regression) + planned_passed_regression+=("$@") + mapfile -t planned_passed_regression < <(normalize_targets "${planned_passed_regression[@]}") + ;; + esac +} + +run_quality_review() { + local skill_path="$1" + local threshold="${2:-100}" + local args=() + local output + local status + + if [[ -n "$review_workspace" ]]; then + args+=(--workspace "$review_workspace") + fi + + set +e + output="$(tessl review run "${args[@]}" --threshold "$threshold" "$skill_path" 2>&1)" + status=$? + set -e + + if [[ $status -eq 0 ]]; then + echo "$output" + return 0 + fi + + if echo "$output" | grep -qiE "unknown command|unknown flag|No help topic"; then + echo "Top-level tessl review command is unavailable. Falling back to legacy tessl skill review." + set +e + output="$(tessl skill review --threshold "$threshold" "$skill_path" 2>&1)" + status=$? + set -e + fi + + echo "$output" + return "$status" +} + +collect_focus() { + local entry scope scenario + + for entry in "${focus_entries[@]}"; do + scope="${entry%%:*}" + scenario="${entry#*:}" + [[ -z "$scope" || -z "$scenario" || "$scope" == "$entry" ]] && { + echo "Invalid --focus value '$entry' (expected :)" >&2 + exit 2 + } + + case "$scope" in + main) + explicit_focus_main+=("$scenario") + explicit_focus_ordered+=("$scope:$scenario") + ;; + reference) + explicit_focus_reference+=("$scenario") + explicit_focus_ordered+=("$scope:$scenario") + ;; + regression) + explicit_focus_regression+=("$scenario") + explicit_focus_ordered+=("$scope:$scenario") + ;; + *) + echo "Unknown focus scope '$scope' in '$entry'." >&2 + exit 2 + ;; + esac + done +} + +collect_impact_focus() { + local entry scope scenario + + if [[ "$changed_runtime" != true || "$use_impact_analysis" != true || "$impact_limit" -le 0 ]]; then + return + fi + + while IFS= read -r entry; do + [[ -z "$entry" ]] && continue + scope="${entry%%:*}" + scenario="${entry#*:}" + case "$scope" in + main) + impact_focus_main+=("$scenario") + impact_focus_ordered+=("$scope:$scenario") + ;; + reference) + impact_focus_reference+=("$scenario") + impact_focus_ordered+=("$scope:$scenario") + ;; + regression) + impact_focus_regression+=("$scenario") + impact_focus_ordered+=("$scope:$scenario") + ;; + esac + done < <(python3 scripts/eval_impact.py \ + --repo-root "$repo_root" \ + --base-ref "$base_ref" \ + --skill-dir "$skill_dir" \ + --limit "$impact_limit") +} + +collect_risk_probes() { + local entry scope scenario + local count=0 + + if [[ "$changed_runtime" != true || "$use_risk_probes" != true || "$risk_limit" -le 0 ]]; then + return + fi + if [[ ! -f "$risk_probe_file" ]]; then + return + fi + + while IFS= read -r entry; do + entry="${entry%%#*}" + entry="$(echo "$entry" | xargs)" + [[ -z "$entry" ]] && continue + + scope="${entry%%:*}" + scenario="${entry#*:}" + [[ -z "$scope" || -z "$scenario" || "$scope" == "$entry" ]] && continue + + case "$scope" in + main) + risk_focus_main+=("$scenario") + risk_focus_ordered+=("$scope:$scenario") + ;; + reference) + risk_focus_reference+=("$scenario") + risk_focus_ordered+=("$scope:$scenario") + ;; + regression) + risk_focus_regression+=("$scenario") + risk_focus_ordered+=("$scope:$scenario") + ;; + *) + continue + ;; + esac + + count=$((count + 1)) + if [[ "$count" -ge "$risk_limit" ]]; then + break + fi + done < "$risk_probe_file" +} + +run_command() { + local label="$1" + shift + echo + echo ">>> $label" + if [[ "$plan_only" == true ]]; then + echo "PLAN ONLY: $*" + return 0 + fi + "$@" +} + +extract_eval_run_id() { + local run_log="$1" + + python3 - "$run_log" <<'PY' +import json +import sys + +raw = open(sys.argv[1], encoding="utf-8").read() +decoder = json.JSONDecoder() +payload = None +for index, char in enumerate(raw): + if char not in "[{": + continue + try: + payload, _ = decoder.raw_decode(raw[index:]) + except json.JSONDecodeError: + continue + break + +if isinstance(payload, list): + for item in payload: + if isinstance(item, dict) and item.get("evalRunId"): + print(item["evalRunId"]) + raise SystemExit(0) +elif isinstance(payload, dict): + data = payload.get("data") + if isinstance(data, dict) and data.get("id"): + print(data["id"]) + raise SystemExit(0) +PY +} + +eval_view_state() { + local view_log="$1" + + python3 - "$view_log" <<'PY' +import json +import sys + +raw = open(sys.argv[1], encoding="utf-8").read() +decoder = json.JSONDecoder() +payload = None +for index, char in enumerate(raw): + if char != "{": + continue + try: + payload, _ = decoder.raw_decode(raw[index:]) + except json.JSONDecodeError: + continue + break + +if not isinstance(payload, dict): + print("pending") + raise SystemExit(0) + +attrs = payload.get("data", {}).get("attributes", {}) +status = attrs.get("status") +progress = attrs.get("progress", {}) +summary = progress.get("summary", {}) if isinstance(progress, dict) else {} +if status in {"failed", "error", "cancelled", "canceled"} or int(summary.get("failed") or 0) > 0: + print("failed") + raise SystemExit(0) +if status != "completed": + print("pending") + raise SystemExit(0) + +scenarios = attrs.get("scenarios") +if not isinstance(scenarios, list) or not scenarios: + print("pending") + raise SystemExit(0) + +def scored(solution: dict) -> bool: + if any(key in solution for key in ("score", "max_score", "maxScore")): + return True + return isinstance(solution.get("assessmentResults"), list) + +for scenario in scenarios: + if not isinstance(scenario, dict): + print("pending") + raise SystemExit(0) + solutions = scenario.get("solutions") + if not isinstance(solutions, list): + print("pending") + raise SystemExit(0) + by_variant = {solution.get("variant"): solution for solution in solutions if isinstance(solution, dict)} + solution = by_variant.get("usage-spec") or by_variant.get("with-context") + if not solution or not scored(solution): + print("pending") + raise SystemExit(0) + +print("ready") +PY +} + +wait_for_eval_view() { + local eval_run_id="$1" + local view_log="$2" + local attempt + local state + + for attempt in $(seq 1 180); do + if ! tessl eval view "$eval_run_id" --json > "$view_log"; then + return 1 + fi + + state="$(eval_view_state "$view_log")" + case "$state" in + ready) + return 0 + ;; + failed) + cat "$view_log" + return 1 + ;; + esac + + echo "Waiting for eval run $eval_run_id to finish scoring (attempt $attempt/180)..." + sleep 20 + done + + echo "Timed out waiting for eval run $eval_run_id to finish scoring." >&2 + return 1 +} + +run_suite_checked() { + local suite="$1" + shift + local scenarios=("$@") + local -a run_cmd + local eval_run_id + local run_log + local scored_log + local label + local scenario + local validated_scenarios=() + + if [[ "${#scenarios[@]}" -eq 0 ]]; then + label="scripts/run_eval_suite.sh $suite" + run_cmd=(scripts/run_eval_suite.sh "$suite" -- --json) + mapfile -t validated_scenarios < <(suite_scenarios "$suite") + else + label="scripts/run_eval_suite.sh $suite ${scenarios[*]}" + run_cmd=(scripts/run_eval_suite.sh "$suite") + for scenario in "${scenarios[@]}"; do + run_cmd+=("$scenario") + validated_scenarios+=("$scenario") + done + run_cmd+=(-- --json) + fi + + if [[ "$plan_only" == true ]]; then + echo + echo ">>> $label" + echo "PLAN ONLY: ${run_cmd[*]}" + mark_scenarios_planned "$suite" "${validated_scenarios[@]}" + return 0 + fi + + echo + echo ">>> $label" + run_log="$(mktemp)" + if ! ("${run_cmd[@]}" > "$run_log"); then + rm -f "$run_log" + return 1 + fi + + cat "$run_log" + scored_log="$run_log" + eval_run_id="$(extract_eval_run_id "$run_log" || true)" + if [[ -n "$eval_run_id" ]]; then + scored_log="$(mktemp)" + if ! wait_for_eval_view "$eval_run_id" "$scored_log"; then + rm -f "$run_log" "$scored_log" + return 1 + fi + cat "$scored_log" + fi + + if python3 scripts/assert_eval_with_context.py "$scored_log" --suite "$suite"; then + mark_run_passing_scenarios "$suite" "$scored_log" + rm -f "$run_log" + if [[ "$scored_log" != "$run_log" ]]; then + rm -f "$scored_log" + fi + return 0 + else + local status=$? + mark_run_passing_scenarios "$suite" "$scored_log" + rm -f "$run_log" + if [[ "$scored_log" != "$run_log" ]]; then + rm -f "$scored_log" + fi + return $status + fi +} + +confirm_checkpoint() { + local stage="$1" + if [[ "$need_confirmation" == false || "$plan_only" == true ]]; then + return + fi + echo + read -r -p "Confirm ${stage} with-context is 100% before continuing [y/N]: " ok + if [[ "$ok" != "y" && "$ok" != "Y" && "$ok" != "yes" && "$ok" != "YES" ]]; then + echo "Stopping. Document blocked run and rerun this stage before broadening." + exit 0 + fi +} + +run_suite() { + local suite="$1" + shift + local scenarios=("$@") + + run_suite_checked "$suite" "${scenarios[@]}" + if [[ "${#scenarios[@]}" -eq 0 ]]; then + confirm_checkpoint "$suite broad run" + else + confirm_checkpoint "$suite targeted run (${#scenarios[@]} scenario(s))" + fi +} + +run_if_nonempty() { + local suite="$1" + shift + local scenarios=("$@") + + if [[ "${#scenarios[@]}" -eq 0 ]]; then + return 0 + fi + + run_suite "$suite" "${scenarios[@]}" +} + +array_contains() { + local needle="$1" + shift + local item + + for item in "$@"; do + if [[ "$item" == "$needle" ]]; then + return 0 + fi + done + return 1 +} + +is_missing_target() { + local suite="$1" + local scenario="$2" + + case "$suite" in + main) + array_contains "$scenario" "${combined_main[@]}" + ;; + reference) + array_contains "$scenario" "${combined_reference[@]}" + ;; + regression) + array_contains "$scenario" "${combined_regression[@]}" + ;; + *) + return 1 + ;; + esac +} + +append_target_queue() { + local suite="$1" + local scenario="$2" + local key="$suite:$scenario" + + [[ -z "$suite" || -z "$scenario" ]] && return + if [[ -n "${target_queue_seen[$key]:-}" ]]; then + return + fi + if ! is_missing_target "$suite" "$scenario"; then + return + fi + + target_queue_seen["$key"]=1 + target_queue_suites+=("$suite") + target_queue_scenarios+=("$scenario") +} + +append_target_entry() { + local entry="$1" + local suite scenario + + suite="${entry%%:*}" + scenario="${entry#*:}" + [[ -z "$suite" || -z "$scenario" || "$suite" == "$entry" ]] && return + append_target_queue "$suite" "$scenario" +} + +build_target_queue() { + local scenario entry + declare -gA target_queue_seen=() + target_queue_suites=() + target_queue_scenarios=() + + for entry in "${explicit_focus_ordered[@]}"; do + append_target_entry "$entry" + done + for entry in "${risk_focus_ordered[@]}"; do + append_target_entry "$entry" + done + for entry in "${impact_focus_ordered[@]}"; do + append_target_entry "$entry" + done + + for scenario in "${changed_main[@]}"; do + append_target_queue main "$scenario" + done + for scenario in "${changed_reference[@]}"; do + append_target_queue reference "$scenario" + done + for scenario in "${changed_regression[@]}"; do + append_target_queue regression "$scenario" + done + + # Safety net for any targets not represented in the ordered source lists. + for scenario in "${combined_main[@]}"; do + append_target_queue main "$scenario" + done + for scenario in "${combined_reference[@]}"; do + append_target_queue reference "$scenario" + done + for scenario in "${combined_regression[@]}"; do + append_target_queue regression "$scenario" + done +} + +run_targeted_suites() { + local index=0 + local count="${#target_queue_suites[@]}" + local suite scenario + local batch=() + + if [[ "$count" -eq 0 ]]; then + return 0 + fi + + echo + echo "Targeted probe order:" + while [[ "$index" -lt "$count" ]]; do + echo " ${target_queue_suites[$index]}:${target_queue_scenarios[$index]}" + index=$((index + 1)) + done + + index=0 + while [[ "$index" -lt "$count" ]]; do + suite="${target_queue_suites[$index]}" + batch=() + + while [[ "$index" -lt "$count" && "${target_queue_suites[$index]}" == "$suite" && "${#batch[@]}" -lt "$target_batch_size" ]]; do + scenario="${target_queue_scenarios[$index]}" + batch+=("$scenario") + index=$((index + 1)) + done + + run_if_nonempty "$suite" "${batch[@]}" + done +} + +run_remaining_suite() { + local suite="$1" + local scenarios=() + local index=0 + local count + local batch_size=1 + local take remaining + local batch=() + local balanced_batch_size=6 + + mapfile -t scenarios < <(remaining_scenarios "$suite") + count="${#scenarios[@]}" + if [[ "$count" -eq 0 ]]; then + echo + echo "Skipping $suite broad stage: all scenarios already have 100% with-context evidence for this skill fingerprint." + return 0 + fi + + if [[ "$broad_batch_mode" == "suite" || "$count" -eq 1 ]]; then + run_suite "$suite" "${scenarios[@]}" + return + fi + + if [[ "$broad_batch_mode" == "balanced" ]]; then + while [[ "$index" -lt "$count" ]]; do + remaining=$((count - index)) + take="$remaining" + if [[ "$take" -gt "$balanced_batch_size" ]]; then + take="$balanced_batch_size" + fi + + batch=("${scenarios[@]:index:take}") + run_suite "$suite" "${batch[@]}" + index=$((index + take)) + done + return + fi + + while [[ "$index" -lt "$count" ]]; do + remaining=$((count - index)) + take="$batch_size" + if [[ "$remaining" -le $((batch_size * 2)) ]]; then + take="$remaining" + elif [[ "$take" -gt "$remaining" ]]; then + take="$remaining" + fi + + batch=("${scenarios[@]:index:take}") + run_suite "$suite" "${batch[@]}" + index=$((index + take)) + batch_size=$((batch_size * 2)) + done +} + +append_unique_suite() { + local suite="$1" + local existing + + for existing in "${broad_order[@]}"; do + if [[ "$existing" == "$suite" ]]; then + return + fi + done + broad_order+=("$suite") +} + +resolve_broad_order() { + local -n out="$1" + local raw suite + local normalized=() + out=() + + if [[ "$broad_order_arg" != "auto" ]]; then + IFS=',' read -r -a out <<< "$broad_order_arg" + if [[ "${#out[@]}" -ne 3 ]]; then + echo "--broad-order must contain main,reference,regression exactly once." >&2 + exit 2 + fi + else + broad_order=() + # The observed history for this skill is dominated by main/reference misses. Keep the broad + # order aligned with the history-derived expected-cost sweep unless explicitly overridden. + append_unique_suite main + append_unique_suite reference + append_unique_suite regression + out=("${broad_order[@]}") + fi + + local -A seen=() + for raw in "${out[@]}"; do + suite="$(echo "$raw" | xargs)" + case "$suite" in + main|reference|regression) + if [[ -n "${seen[$suite]:-}" ]]; then + echo "--broad-order contains duplicate suite '$suite'." >&2 + exit 2 + fi + seen["$suite"]=1 + normalized+=("$suite") + ;; + *) + echo "--broad-order contains unknown suite '$suite'." >&2 + exit 2 + ;; + esac + done + for suite in main reference regression; do + if [[ -z "${seen[$suite]:-}" ]]; then + echo "--broad-order must include $suite." >&2 + exit 2 + fi + done + out=("${normalized[@]}") +} + +changed_runtime=false +changed_runtime_paths="$( + git diff --name-only "$base_ref"...HEAD -- "$skill_dir" 2>/dev/null || true + git diff --name-only -- "$skill_dir" 2>/dev/null || true + git diff --cached --name-only -- "$skill_dir" 2>/dev/null || true + git ls-files -o --exclude-standard -- "$skill_dir" 2>/dev/null || true +)" +if [[ -n "$changed_runtime_paths" ]]; then + changed_runtime=true +fi + +collect_changed +collect_focus +collect_impact_focus +collect_risk_probes + +mapfile -t changed_main < <(normalize_targets "${changed_main[@]}") +mapfile -t changed_reference < <(normalize_targets "${changed_reference[@]}") +mapfile -t changed_regression < <(normalize_targets "${changed_regression[@]}") +mapfile -t explicit_focus_main < <(normalize_targets "${explicit_focus_main[@]}") +mapfile -t explicit_focus_reference < <(normalize_targets "${explicit_focus_reference[@]}") +mapfile -t explicit_focus_regression < <(normalize_targets "${explicit_focus_regression[@]}") +mapfile -t impact_focus_main < <(normalize_targets "${impact_focus_main[@]}") +mapfile -t impact_focus_reference < <(normalize_targets "${impact_focus_reference[@]}") +mapfile -t impact_focus_regression < <(normalize_targets "${impact_focus_regression[@]}") +mapfile -t risk_focus_main < <(normalize_targets "${risk_focus_main[@]}") +mapfile -t risk_focus_reference < <(normalize_targets "${risk_focus_reference[@]}") +mapfile -t risk_focus_regression < <(normalize_targets "${risk_focus_regression[@]}") + +if [[ "$changed_runtime" == false && "${#changed_main[@]}" -eq 0 && "${#changed_reference[@]}" -eq 0 && "${#changed_regression[@]}" -eq 0 && "${#explicit_focus_main[@]}" -eq 0 && "${#explicit_focus_reference[@]}" -eq 0 && "${#explicit_focus_regression[@]}" -eq 0 && "$run_broad" == false ]]; then + echo + echo "No changed files detected in skill/eval scope." + echo "No targeted eval scope detected for the current diff." + if [[ "$changed_runtime" == true ]]; then + echo "If this was a runtime change, run a scoped smoke first, then rerun this command:" + echo " scripts/pre_submit_gate.sh --focus :" + else + echo "Use --focus : or --run-broad if this change still requires hosted validation." + fi + echo + echo "If this is ready for a final hosted sweep, add --run-broad." + exit 0 +fi + +if [[ "$changed_runtime" == true && "$targeted_only" == false ]]; then + run_broad=true +fi + +runtime_fingerprint="$(python3 scripts/eval_evidence.py fingerprint --skill-dir "$skill_dir")" +if [[ "$reset_evidence" == true ]]; then + rm -f "$evidence_file" +fi + +echo "Running quality gate first (required by process)." +quality_display="tessl review run" +if [[ -n "$review_workspace" ]]; then + quality_display+=" --workspace $review_workspace" +fi +quality_display+=" --threshold 100 $skill_dir/SKILL.md" +run_command "$quality_display" \ + run_quality_review "$skill_dir/SKILL.md" 100 + +combined_main=("${changed_main[@]}" "${explicit_focus_main[@]}" "${impact_focus_main[@]}" "${risk_focus_main[@]}") +combined_reference=("${changed_reference[@]}" "${explicit_focus_reference[@]}" "${impact_focus_reference[@]}" "${risk_focus_reference[@]}") +combined_regression=("${changed_regression[@]}" "${explicit_focus_regression[@]}" "${impact_focus_regression[@]}" "${risk_focus_regression[@]}") +mapfile -t combined_main < <(normalize_targets "${combined_main[@]}") +mapfile -t combined_reference < <(normalize_targets "${combined_reference[@]}") +mapfile -t combined_regression < <(normalize_targets "${combined_regression[@]}") + +echo +echo "Detected targeted eval scope:" +echo " main: ${combined_main[*]:-}" +echo " reference: ${combined_reference[*]:-}" +echo " regression: ${combined_regression[*]:-}" +echo " impact: main=${impact_focus_main[*]:-} reference=${impact_focus_reference[*]:-} regression=${impact_focus_regression[*]:-}" +echo " risk: main=${risk_focus_main[*]:-} reference=${risk_focus_reference[*]:-} regression=${risk_focus_regression[*]:-}" + +if [[ "$use_evidence" == true ]]; then + echo + echo "Using scenario evidence cache: $evidence_file" + echo "Skill bundle fingerprint: $runtime_fingerprint" +fi + +_cleaned=() +for item in "${combined_main[@]}"; do + [[ -n "$item" ]] && _cleaned+=("$item") +done +combined_main=("${_cleaned[@]}") + +_cleaned=() +for item in "${combined_reference[@]}"; do + [[ -n "$item" ]] && _cleaned+=("$item") +done +combined_reference=("${_cleaned[@]}") + +_cleaned=() +for item in "${combined_regression[@]}"; do + [[ -n "$item" ]] && _cleaned+=("$item") +done +combined_regression=("${_cleaned[@]}") +unset _cleaned + +missing_main=() +missing_reference=() +missing_regression=() +mapfile -t missing_main < <(remaining_targeted_scenarios main "${combined_main[@]}") +mapfile -t missing_reference < <(remaining_targeted_scenarios reference "${combined_reference[@]}") +mapfile -t missing_regression < <(remaining_targeted_scenarios regression "${combined_regression[@]}") + +# Scenario evidence includes both the skill bundle fingerprint and the scenario file fingerprint. +# Changed eval definitions are rerun only when they lack current scenario-fingerprint evidence. +combined_main=("${missing_main[@]}") +combined_reference=("${missing_reference[@]}") +combined_regression=("${missing_regression[@]}") +mapfile -t combined_main < <(normalize_targets "${combined_main[@]}") +mapfile -t combined_reference < <(normalize_targets "${combined_reference[@]}") +mapfile -t combined_regression < <(normalize_targets "${combined_regression[@]}") + +echo +echo "Targeted eval scope still missing evidence:" +echo " main: ${combined_main[*]:-}" +echo " reference: ${combined_reference[*]:-}" +echo " regression: ${combined_regression[*]:-}" + +build_target_queue + +if [[ "${#combined_main[@]}" -eq 0 && "${#combined_reference[@]}" -eq 0 && "${#combined_regression[@]}" -eq 0 && "$run_broad" == false ]]; then + echo + echo "No targeted eval scope detected for the current diff." + if [[ "$changed_runtime" == true ]]; then + echo "This is a runtime/reference change and therefore requires full suite evidence before final PR" + echo "submission. Add --auto-continue for automated progression." + else + echo "If this was a runtime change, run a scoped smoke first, then rerun this command:" + echo " scripts/pre_submit_gate.sh --focus :" + echo + echo "If this is ready for a final hosted sweep, add --run-broad." + fi + exit 0 +fi + +run_targeted_suites + +if [[ "$run_broad" == false ]]; then + echo + echo "Plan complete for targeted-only mode. Add --run-broad for final full-suite readiness." + exit 0 +fi + +broad_order=() +resolve_broad_order broad_order + +echo +echo "Broad stage order: ${broad_order[*]}" + +for suite in "${broad_order[@]}"; do + run_remaining_suite "$suite" +done + +echo +echo "All planned hosted checks completed." diff --git a/scripts/run_eval_suite.sh b/scripts/run_eval_suite.sh index 5d0c191..ca03d1d 100755 --- a/scripts/run_eval_suite.sh +++ b/scripts/run_eval_suite.sh @@ -7,16 +7,32 @@ Usage: scripts/run_eval_suite.sh [scenario ...] [-- tessl eval run args...] Runs hosted Tessl evals with the repository's variant policy: - main -> without-context and with-context - reference -> without-context and with-context + main -> baseline control and with-context + reference -> baseline control and with-context regression -> with-context only Examples: scripts/run_eval_suite.sh main -- --label "main check" scripts/run_eval_suite.sh reference 05-parallel-cpu-review -- --label "targeted reference" scripts/run_eval_suite.sh regression -- --label "regression safety" - -Do not pass --variant. This script chooses variants from the suite purpose. + scripts/run_eval_suite.sh main -- --agent claude:claude-sonnet-4-6 --label "representative model check" + +Model-selection note: + Do not pin Sonnet in default commands; the script runs with the current Tessl default solver. + On accounts without model-selection entitlements (including many free plans), passing + `--agent` for a specific model (for example, `claude:claude-sonnet-4-6`) can return + a "Missing required entitlement" error. Prefer default commands for routine checks and + save explicit model pins for accounts where modelSelection is enabled. + If model-selection is available, Sonnet 4.6 or a better model is a good representative check. + See Tessl model-selection and default-model discussions: + - https://docs.tessl.io/changelog + - https://tessl.io/blog/why-were-changing-our-default-eval-model/ + +Do not pass --variant or --skip-baseline. This script chooses variants from the suite purpose. +The default Tessl solver is used unless an explicit --agent is passed after --. +The runner passes --skill java-streams so with-context runs exercise this skill instead of relying on +solver auto-selection for final readiness evidence. It also passes --force so runs after a skill or +runner fix cannot reuse stale hosted solutions. USAGE } @@ -42,15 +58,15 @@ shift case "$suite" in main) source_dir="evals" - variants=(--variant without-context --variant with-context) + variant_label="baseline control + with-context" ;; reference) source_dir="evals-reference" - variants=(--variant without-context --variant with-context) + variant_label="baseline control + with-context" ;; regression) source_dir="evals-regression" - variants=(--variant with-context) + variant_label="with-context only" ;; -h|--help|help) usage @@ -72,8 +88,8 @@ while [[ $# -gt 0 ]]; do extra_args=("$@") break ;; - --variant|--variant=*) - echo "Do not pass --variant; scripts/run_eval_suite.sh chooses variants by suite." >&2 + --variant|--variant=*|--skip-baseline|--skip-baseline=*) + echo "Do not pass --variant or --skip-baseline; scripts/run_eval_suite.sh chooses variants by suite." >&2 exit 2 ;; *) @@ -85,8 +101,8 @@ done for arg in "${extra_args[@]}"; do case "$arg" in - --variant|--variant=*) - echo "Do not pass --variant; scripts/run_eval_suite.sh chooses variants by suite." >&2 + --variant|--variant=*|--skip-baseline|--skip-baseline=*) + echo "Do not pass --variant or --skip-baseline; scripts/run_eval_suite.sh chooses variants by suite." >&2 exit 2 ;; esac @@ -97,36 +113,48 @@ if ! command -v tessl >/dev/null 2>&1; then exit 127 fi +eval_run_help="$(tessl eval run --help 2>&1 || true)" +if grep -q -- "--variant" <<<"$eval_run_help"; then + case "$suite" in + main|reference) + variant_args=(--variant without-context --variant with-context) + ;; + regression) + variant_args=(--variant with-context) + ;; + esac +elif grep -q -- "--skip-baseline" <<<"$eval_run_help"; then + case "$suite" in + main|reference) + variant_args=() + ;; + regression) + variant_args=(--skip-baseline) + ;; + esac +else + echo "Unsupported tessl eval run CLI: expected --variant or --skip-baseline support." >&2 + exit 2 +fi + repo_root="$(git rev-parse --show-toplevel 2>/dev/null || pwd)" source_path="$repo_root/$source_dir" +skill_args=(--skill java-streams) +freshness_args=(--force) if [[ ! -d "$source_path" ]]; then echo "Missing suite directory: $source_path" >&2 exit 1 fi -has_agent=false -for arg in "${extra_args[@]}"; do - case "$arg" in - --agent|--agent=*) - has_agent=true - ;; - esac -done - -agent_args=() -if [[ "$has_agent" == false ]]; then - agent_args=(--agent claude:claude-sonnet-4-6) -fi - if [[ "$suite" == "main" && "${#scenarios[@]}" -eq 0 ]]; then echo "Running main eval suite from the linked plugin path." echo "Scenarios:" print_suite_scenarios "$source_path" - echo "Variants: ${variants[*]}" + echo "Eval mode: $variant_label" ( cd "$repo_root" - tessl eval run "${agent_args[@]}" "${variants[@]}" "${extra_args[@]}" . + tessl eval run "${variant_args[@]}" "${skill_args[@]}" "${freshness_args[@]}" "${extra_args[@]}" . ) exit 0 fi @@ -194,9 +222,9 @@ fi echo "Running $suite eval suite from the linked plugin path with a temporary evals/ staging area." echo "Scenarios:" print_suite_scenarios "$staged_evals" -echo "Variants: ${variants[*]}" +echo "Eval mode: $variant_label" ( cd "$repo_root" - tessl eval run "${agent_args[@]}" "${variants[@]}" "${extra_args[@]}" . + tessl eval run "${variant_args[@]}" "${skill_args[@]}" "${freshness_args[@]}" "${extra_args[@]}" . ) diff --git a/scripts/test_validate_eval_criteria.py b/scripts/test_validate_eval_criteria.py new file mode 100644 index 0000000..4b08154 --- /dev/null +++ b/scripts/test_validate_eval_criteria.py @@ -0,0 +1,295 @@ +#!/usr/bin/env python3 +"""Fixture tests for validate_eval_criteria.py.""" + +from __future__ import annotations + +import json +import subprocess +import sys +import tempfile +import unittest +from pathlib import Path + + +REPO_ROOT = Path(__file__).resolve().parents[1] +VALIDATOR = REPO_ROOT / "scripts" / "validate_eval_criteria.py" + + +def write_runtime_reference(root: Path) -> None: + references = root / "skills" / "java-streams" / "references" + references.mkdir(parents=True) + (references / "stream-examples.md").write_text( + """# Runtime examples + +```java +List overdueNotices(List shipments, Clock clock) { + LocalDate today = LocalDate.now(clock); + return shipments.stream() + .filter(shipment -> isOverdue(shipment, today)) + .map(shipment -> toNotice(shipment, today)) + .toList(); +} + +private static ShipmentNotice toNotice(Shipment shipment, LocalDate today) { + long daysLate = ChronoUnit.DAYS.between(shipment.dueDate(), today); + return new ShipmentNotice(shipment.id(), shipment.customerEmail(), daysLate, + daysLate >= 14 ? "critical" : "late"); +} +``` + +```java +Map> emailsByTrack = conferences.stream() + .flatMap(conference -> conference.sessions().stream()) + .collect(Collectors.groupingBy( + Session::track, + Collectors.flatMapping(SessionReports::optedInEmails, Collectors.toList()))); + +private static Stream optedInEmails(Session session) { + return session.registrations().stream() + .filter(Registration::optedIn) + .map(Registration::email); +} +``` +""", + encoding="utf-8", + ) + + +def write_scenario( + root: Path, + suite: str, + name: str, + task: str, + *, + invocation: str = "natural", + task_type: str = "implementation", + evidence_type: str | None = "ordinary_lift", + rationale: str | None = None, + extra_metadata: dict[str, object] | None = None, + checklist: list[dict[str, object]] | None = None, +) -> Path: + scenario = root / suite / name + scenario.mkdir(parents=True) + (scenario / "task.md").write_text(task, encoding="utf-8") + (scenario / "capability.txt").write_text("java-streams\n", encoding="utf-8") + metadata: dict[str, object] = { + "invocation": invocation, + "task_type": task_type, + } + if evidence_type is not None: + metadata["evidence_type"] = evidence_type + if rationale is not None: + metadata["runtime_reference_overlap_rationale"] = rationale + if extra_metadata: + metadata.update(extra_metadata) + criteria = { + "context": "Fixture scenario.", + "type": "weighted_checklist", + "checklist": checklist + or [ + { + "name": "Creates artifact", + "category": "safety", + "max_score": 5, + "description": "Creates Example.java.", + }, + { + "name": "Preserves behavior", + "category": "safety", + "max_score": 5, + "description": "Returns the requested output.", + }, + { + "name": "Uses stream quality", + "category": "stream_quality", + "max_score": 90, + "description": "Uses clear stream code.", + }, + ], + "metadata": metadata, + } + (scenario / "criteria.json").write_text(json.dumps(criteria, indent=2) + "\n", encoding="utf-8") + return scenario + + +def shipment_task(prefix: str = "Create `OverdueShipmentNotices.java`.") -> str: + return f"""# Implement overdue shipment notices + +{prefix} Assume Java 17. + +Implement: + +```java +List overdueNotices(List shipments, Clock clock) +record Shipment(String id, String customerEmail, LocalDate dueDate, Optional deliveredAt) {{}} +record ShipmentNotice(String id, String customerEmail, long daysLate, String severity) {{}} +``` + +Severity is `"critical"` when `daysLate` is at least 14, otherwise `"late"`. +""" + + +def session_task() -> str: + return """# Implement session roster indexes + +Create `SessionRosterIndexes.java`. Assume Java 17. + +```java +Map> optedInEmailsByTrack(List conferences) +record Conference(List sessions) {} +record Session(String id, String room, String track, int minutes, List registrations) {} +record Registration(String email, boolean optedIn, boolean waitlisted) {} +``` +""" + + +class ValidateEvalCriteriaTests(unittest.TestCase): + def run_validator(self, root: Path, *paths: Path) -> subprocess.CompletedProcess[str]: + return subprocess.run( + [sys.executable, str(VALIDATOR), *(str(path.relative_to(root)) for path in paths)], + cwd=root, + text=True, + capture_output=True, + check=False, + ) + + def with_repo(self): + temp = tempfile.TemporaryDirectory() + root = Path(temp.name) + write_runtime_reference(root) + return temp, root + + def test_main_ordinary_lift_overlap_fails(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario(root, "evals", "01-overlap", shipment_task()) + result = self.run_validator(root, scenario) + self.assertNotEqual(result.returncode, 0) + self.assertIn("ordinary_lift is incompatible", result.stderr) + + def test_main_ordinary_lift_overlap_rationale_does_not_bypass(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario( + root, + "evals", + "01-overlap", + shipment_task(), + rationale="Focused coverage kept intentionally.", + ) + result = self.run_validator(root, scenario) + self.assertNotEqual(result.returncode, 0) + self.assertIn("ordinary_lift is incompatible", result.stderr) + + def test_reference_ordinary_lift_overlap_fails(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario(root, "evals-reference", "28-overdue", shipment_task()) + result = self.run_validator(root, scenario) + self.assertNotEqual(result.returncode, 0) + self.assertIn("ordinary_lift is incompatible", result.stderr) + + def test_reference_focused_overlap_passes_with_rationale(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario( + root, + "evals-reference", + "28-overdue", + shipment_task(), + evidence_type="focused_reference", + rationale="Allowed only as focused reference coverage.", + ) + result = self.run_validator(root, scenario) + self.assertEqual(result.returncode, 0, result.stderr) + + def test_regression_overlap_passes_when_classified_as_regression(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario( + root, + "evals-regression", + "20-shipment-review", + shipment_task(), + evidence_type="solved_regression", + ) + result = self.run_validator(root, scenario) + self.assertEqual(result.returncode, 0, result.stderr) + + def test_task_names_skill_but_metadata_says_natural_fails(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario( + root, + "evals-reference", + "26-explicit", + shipment_task("Use `$java-streams` to create `OverdueShipmentNotices.java`."), + invocation="natural", + evidence_type="focused_reference", + rationale="Allowed only as focused reference coverage.", + ) + result = self.run_validator(root, scenario) + self.assertNotEqual(result.returncode, 0) + self.assertIn("natural scenario explicitly invokes the skill", result.stderr) + + def test_task_names_skill_and_metadata_says_explicit_passes(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario( + root, + "evals-reference", + "26-explicit", + shipment_task("Use `$java-streams` to create `OverdueShipmentNotices.java`."), + invocation="explicit", + evidence_type="focused_reference", + rationale="Allowed only as focused reference coverage.", + ) + result = self.run_validator(root, scenario) + self.assertEqual(result.returncode, 0, result.stderr) + + def test_scenario_28_focused_overlap_keeps_80_point_lambda_criterion(self) -> None: + temp, root = self.with_repo() + checklist = [ + { + "name": "Creates artifact", + "category": "safety", + "max_score": 5, + "description": "Creates OverdueShipmentNotices.java.", + }, + { + "name": "Avoids multi-line stream lambdas", + "category": "stream_quality", + "max_score": 80, + "description": "Extracts non-trivial stream lambda bodies into helpers.", + }, + ] + with temp: + scenario = write_scenario( + root, + "evals-reference", + "28-overdue-shipment-notices", + shipment_task(), + evidence_type="focused_reference", + rationale="Allowed only as focused reference coverage.", + checklist=checklist, + ) + result = self.run_validator(root, scenario) + self.assertEqual(result.returncode, 0, result.stderr) + + def test_scenario_15_session_overlap_passes_only_when_focused(self) -> None: + temp, root = self.with_repo() + with temp: + scenario = write_scenario( + root, + "evals-reference", + "15-session-roster-indexes", + session_task(), + evidence_type="focused_reference", + rationale="Allowed only as focused reference coverage.", + ) + result = self.run_validator(root, scenario) + self.assertEqual(result.returncode, 0, result.stderr) + + +if __name__ == "__main__": + raise SystemExit(unittest.main()) diff --git a/scripts/validate_eval_criteria.py b/scripts/validate_eval_criteria.py index 29f9d82..758bdb9 100755 --- a/scripts/validate_eval_criteria.py +++ b/scripts/validate_eval_criteria.py @@ -45,7 +45,14 @@ "redact", ) CRITERION_CATEGORIES = {"safety", "stream_quality", "maintainability"} -EVIDENCE_TYPES = {"ordinary_lift", "solved_regression", "skill_context_dependent"} +EVIDENCE_TYPES = { + "ordinary_lift", + "focused_main", + "focused_reference", + "solved_regression", + "skill_context_dependent", +} +REGRESSION_EVIDENCE_TYPES = {"solved_regression", "skill_context_dependent"} INTERNAL_LABEL_ALLOW_PATTERNS = ( r"\bdo not deduct\b.{0,120}\b(?:hard[- ]stop|checklist|scan|marker|skill)\b", r"\baward full credit\b.{0,120}\b(?:hard[- ]stop|checklist|scan|marker|skill)\b", @@ -77,6 +84,7 @@ "hashmap", "integer", "intstream", + "foreach", "java", "list", "long", @@ -88,11 +96,15 @@ "parallel", "parallelstream", "predicate", + "private", + "public", "record", "runtimeexception", "set", "simpleimmutableentry", + "size", "sorted", + "static", "string", "stream", "streams", @@ -101,6 +113,8 @@ "tolist", "total", "null", + "return", + "this", "unsupportedoperationexception", "void", } @@ -356,10 +370,7 @@ def validate_scenario(scenario: Path, main_eval_root: Path | None) -> list[str]: failures.append( f"{criteria_file}: metadata.evidence_type must be one of {sorted(EVIDENCE_TYPES)}" ) - if scenario.parent.name == "evals-regression" and evidence_type not in { - "solved_regression", - "skill_context_dependent", - }: + if scenario.parent.name == "evals-regression" and evidence_type not in REGRESSION_EVIDENCE_TYPES: failures.append( f"{criteria_file}: regression scenarios must set metadata.evidence_type to " "solved_regression or skill_context_dependent" @@ -374,6 +385,14 @@ def validate_scenario(scenario: Path, main_eval_root: Path | None) -> list[str]: failures.append( f"{criteria_file}: metadata.evidence_type=solved_regression must live in evals-regression" ) + if evidence_type == "focused_reference" and scenario.parent.name != "evals-reference": + failures.append( + f"{criteria_file}: metadata.evidence_type=focused_reference must live in evals-reference" + ) + if evidence_type == "focused_main" and scenario.parent.name != "evals": + failures.append( + f"{criteria_file}: metadata.evidence_type=focused_main must live in evals" + ) if evidence_type == "ordinary_lift" and scenario.parent.name == "evals-regression": failures.append( f"{criteria_file}: metadata.evidence_type=ordinary_lift must live in evals or evals-reference" @@ -452,83 +471,86 @@ def validate_runtime_reference_overlap(dirs: list[Path]) -> list[str]: path.read_text(encoding="utf-8") for path in sorted(references_root.glob("*.md")) ) runtime_identifiers = domain_identifiers(runtime_text) - runtime_words = normalized_words(runtime_text) - runtime_grams = ngrams(runtime_words, 12) + runtime_grams = ngrams(normalized_words(runtime_text), 12) + runtime_api_markers = api_markers(runtime_text) for scenario in dirs: - if scenario.parent.name != "evals": + suite = scenario.parent.name + if suite not in {"evals", "evals-reference", "evals-regression"}: continue task_file = scenario / "task.md" + criteria_file = scenario / "criteria.json" if not task_file.exists(): continue task_text = task_file.read_text(encoding="utf-8") - if is_skill_context_dependent_text(f"{scenario.name}\n{task_text}"): - continue + metadata: dict[str, Any] = {} + if criteria_file.exists(): + data, _ = load_json(criteria_file) + if isinstance(data, dict) and isinstance(data.get("metadata"), dict): + metadata = data["metadata"] task_identifiers = domain_identifiers(task_text) shared_identifiers = sorted(task_identifiers & runtime_identifiers) task_words = normalized_words(task_text) task_grams = ngrams(task_words, 12) long_overlap_count = len(task_grams & runtime_grams) + shared_api_markers = sorted(api_markers(task_text) & runtime_api_markers) + has_distinctive_api_overlap = bool(shared_api_markers) and len(shared_identifiers) >= 2 - if len(shared_identifiers) >= 4 or (len(shared_identifiers) >= 3 and long_overlap_count): - failures.append( - f"{scenario}: task.md overlaps runtime references too closely; shared identifiers: " - f"{', '.join(shared_identifiers[:12])}" - ) - return failures - - -def runtime_reference_overlap_warnings(dirs: list[Path]) -> list[str]: - warnings: list[str] = [] - references_root = Path("skills/java-streams/references") - if not references_root.exists(): - return warnings - - runtime_text = "\n".join( - path.read_text(encoding="utf-8") for path in sorted(references_root.glob("*.md")) - ) - runtime_identifiers = domain_identifiers(runtime_text) - runtime_api_markers = api_markers(runtime_text) - runtime_words = normalized_words(runtime_text) - runtime_grams = ngrams(runtime_words, 12) - - for scenario in dirs: - if scenario.parent.name != "evals": - continue - task_file = scenario / "task.md" - criteria_file = scenario / "criteria.json" - if not task_file.exists() or not criteria_file.exists(): - continue - task_text = task_file.read_text(encoding="utf-8") - criteria_text = criteria_file.read_text(encoding="utf-8") - combined_text = f"{task_text}\n{criteria_text}" - if is_skill_context_dependent_text(f"{scenario.name}\n{combined_text}"): + has_overlap = ( + len(shared_identifiers) >= 4 + or (len(shared_identifiers) >= 3 and long_overlap_count) + or has_distinctive_api_overlap + ) + if not has_overlap: continue - shared_identifiers = sorted(domain_identifiers(combined_text) & runtime_identifiers) - shared_api_markers = sorted(api_markers(combined_text) & runtime_api_markers) - combined_grams = ngrams(normalized_words(combined_text), 12) - long_overlap_count = len(combined_grams & runtime_grams) - repeated_api_shape = ( - "blocking" in combined_text.lower() - and "bounded" in combined_text.lower() - and "Gatherers.mapConcurrent" in shared_api_markers - and "Map.entry" in shared_api_markers + evidence_type = metadata.get("evidence_type") + rationale = metadata.get("runtime_reference_overlap_rationale") + has_overlap_rationale = isinstance(rationale, str) and bool(rationale.strip()) + overlap_details = ( + f"shared identifiers: {', '.join(shared_identifiers[:12]) or '(none)'}; " + f"shared API markers: {', '.join(shared_api_markers[:12]) or '(none)'}" ) - if ( - len(shared_identifiers) >= 4 - or (len(shared_identifiers) >= 3 and long_overlap_count) - or repeated_api_shape - ): - warnings.append( - f"{scenario}: task.md plus criteria.json are close to runtime references; " - f"document a focused-coverage rationale if intentional. Shared identifiers: " - f"{', '.join(shared_identifiers[:12]) or '(none)'}; shared API markers: " - f"{', '.join(shared_api_markers[:12]) or '(none)'}" + + if evidence_type == "ordinary_lift": + failures.append( + f"{scenario}: metadata.evidence_type=ordinary_lift is incompatible with " + f"runtime-reference overlap ({overlap_details}); de-domain the task/reference or " + "reclassify the evidence so it is not reported as ordinary broad lift" ) + continue - return warnings + if suite == "evals": + if evidence_type != "focused_main": + failures.append( + f"{scenario}: main eval task overlaps runtime references ({overlap_details}); " + "de-domain the task/reference or set metadata.evidence_type=focused_main so " + "it is not reported as ordinary broad natural lift" + ) + elif not has_overlap_rationale: + failures.append( + f"{criteria_file}: focused main eval with runtime-reference overlap must set " + "metadata.runtime_reference_overlap_rationale" + ) + elif suite == "evals-reference": + if evidence_type != "focused_reference": + failures.append( + f"{scenario}: reference task overlaps runtime references ({overlap_details}); " + "de-domain the task/reference or set metadata.evidence_type=focused_reference " + "so it is not reported as ordinary broad lift" + ) + elif not has_overlap_rationale: + failures.append( + f"{criteria_file}: focused reference eval with runtime-reference overlap must set " + "metadata.runtime_reference_overlap_rationale" + ) + elif suite == "evals-regression" and evidence_type not in REGRESSION_EVIDENCE_TYPES: + failures.append( + f"{scenario}: regression task overlaps runtime references ({overlap_details}); " + "classify it as solved_regression or skill_context_dependent regression evidence" + ) + return failures def validate_scenario_path_references() -> list[str]: @@ -683,14 +705,10 @@ def main() -> int: failures.extend(validate_numbering(path)) failures.extend(validate_cross_suite_duplicates(dirs)) failures.extend(validate_runtime_reference_overlap(dirs)) - warnings = runtime_reference_overlap_warnings(dirs) failures.extend(validate_scenario_path_references()) failures.extend(validate_runtime_references()) failures.extend(validate_agent_docs_self_contained()) - for warning in warnings: - print(f"warning: {warning}", file=sys.stderr) - if failures: for failure in failures: print(f"error: {failure}", file=sys.stderr) diff --git a/scripts/validate_publish_ready.sh b/scripts/validate_publish_ready.sh index 3b09503..e0e4f1f 100755 --- a/scripts/validate_publish_ready.sh +++ b/scripts/validate_publish_ready.sh @@ -6,5 +6,5 @@ python3 scripts/validate_eval_criteria.py evals evals-reference evals-regression python3 -m py_compile scripts/*.py bash -n scripts/*.sh tessl plugin lint . -tessl skill review --threshold 100 skills/java-streams/SKILL.md +tessl review run --threshold 100 skills/java-streams/SKILL.md tessl plugin publish --dry-run . diff --git a/skills/java-streams/SKILL.md b/skills/java-streams/SKILL.md index 59252c9..a655f7f 100644 --- a/skills/java-streams/SKILL.md +++ b/skills/java-streams/SKILL.md @@ -1,14 +1,17 @@ --- name: java-streams license: MIT -description: Write, review, and refactor Java Stream and Collector code using best practices, improving readability and performance while avoiding common stream antipatterns such as materializing just to inspect, sorting before min/max, counting for existence, nested stream collections, unsafe null sorting, and careless findFirst/findAny or parallelStream changes. Use whenever writing, reviewing, or refactoring Java code that uses streams, collectors, primitive streams, Optional-producing stream terminal operations, map/flatMap/mapMulti, grouping, joining, distinct, sorted, limit, takeWhile/dropWhile, teeing, partitioningBy, summarizing, or parallel stream behavior. +description: Review Java stream performance advice, especially slow stream mappings, external collection mutation with forEach/add, and whether parallelStream is safe; clean up mutation and write or refactor Java Stream and Collector code. Avoid common stream antipatterns such as materializing just to inspect, sorting before min/max, counting for existence, nested stream collections, unsafe null sorting, multi-line lambdas, and careless findFirst/findAny changes. Use whenever writing, reviewing, or refactoring Java code that uses Java streams, collectors, stream pipelines, grouping, joining strings, first/any element lookup, sorting, limiting, distinct values, primitive totals, Optional values in streams, or parallel streams, including review prompts asking whether a lookup should use findFirst or findAny. --- # Java Streams Skill -Preserve behavior, requested public API, artifact structure, nested vs top-level type placement, -encounter order, exceptions, null handling, side effects, mutability, and Java-version -compatibility. +Preserve requested behavior, public API/artifact shape, encounter order, exceptions, null handling, +side effects, mutability, and Java-version compatibility. For implementation prompts, write the +requested Java source file before explaining. Keep provided helper/record/service types in that file, +and keep them nested when requested; do not create sibling source files, package-private top-level +replacements, hooks, test seams, delegate fields, overloads, caches, retries, or adapters unless +asked. ## Reference Bundle @@ -18,18 +21,14 @@ compatibility. | [stream-examples.md](references/stream-examples.md) | Worked before/after examples from the reference set | | [java-stream-api.md](references/java-stream-api.md) | Java-version compatibility for stream and collector APIs | -## Hard Stops - -Before finalizing touched stream flow, run the scan and apply the replacement rules in -[hard-stops.md](references/hard-stops.md). -Filtered-list first-element access has first-match semantics; use `findFirst()` unless all matches -are equivalent. - ## Core Workflow -0. Check the Java baseline before choosing APIs. Read build/toolchain docs; if unclear, use Java - 8-compatible code or state the assumption. Use [java-stream-api.md](references/java-stream-api.md) - for minimum Java versions. +When the prompt asks for a named artifact such as `review.md` or a Java source file, create that +exact file. Do not answer only in chat when a file artifact is requested. + +0. Check the Java baseline first. Use [java-stream-api.md](references/java-stream-api.md) for + minimum versions and fallbacks; do not emulate unavailable APIs with stateful or multi-line + lambdas. 1. Identify the requested result and pick the matching terminal or collector: | Goal | Preferred API | @@ -43,51 +42,71 @@ are equivalent. | Two aggregates over same input (Java 12+) | `Collectors.teeing` | | Grouping/indexing | `groupingBy`, `partitioningBy`, or `toMap` with merge/null handling | -2. Prefer terminal operations that encode intent directly: `anyMatch` for existence, `count` - for numeric counts, `joining` for text, `min`/`max` for a single extreme, `teeing` for a Java - 12+ min/max pair over the same input, and primitive stream terminal operations for primitive - totals. + Find-first rule: keep `findFirst()` when code takes element `0`, sorted order, or encounter order + chooses the winner. In these reviews, include the exception before performance claims: `findAny` + fits only if all matching values are equivalent and order does not choose the winner. Keep + `sorted(...).filter(...).findFirst()` when it defines the selected value; suggest `min`/`max` only + when it preserves that winner. + +2. Use intent-encoding terminals: `anyMatch`, `count`, `joining`, `min`/`max`, Java 12+ + `teeing`, and primitive terminals. Do not mutate external containers, arrays, counters, or + builders from `forEach`; let the stream produce the result directly. + + - Implementation prompts: for one-to-one transformations, write the direct stream result unless + mutable output or the Java baseline says otherwise; on Java 16+, prefer + `names.stream().map(String::toUpperCase).toList()` over a manual `ArrayList` loop. + - External mutation/performance reviews: show a sequential result-producing snippet first. For + million-item CPU maps, mention benchmarking a pure parallel variant and include: + "`parallelStream()` can be slower for small lists or call paths that are usually small." + - Parallel reviews: apply [hard-stops.md](references/hard-stops.md); no custom pool snippets + unless asked. + - Java 24 bounded blocking calls: use `Gatherers.mapConcurrent(limit, item -> carrier(item, + stub(item)))`, then filter/map/sort; no test hooks, overloads, `CompletableFuture` fan-out, or + null sentinels. ```java - // Before: counts all matches just to test existence - boolean hasLateOrders = orders.stream().filter(Order::late).count() > 0; - // After: short-circuits when the first match is found - boolean hasLateOrders = orders.stream().anyMatch(Order::late); + import java.util.List; + + final class OrderChecks { + boolean hasOverdue(List orders) { + return orders.stream().anyMatch(Order::isOverdue); + } + + record Order(boolean overdue) { + boolean isOverdue() { + return overdue; + } + } + } ``` +3. Flatten nested sources deliberately. Use `flatMap`, `flatMap(Optional::stream)` on Java 9+, + and `mapMulti` on Java 16+ when clearer. Use helpers when nested lambdas would wrap. For subtype + primitives, filter/cast first, then call `mapToInt`/`mapToLong`/`mapToDouble` directly. For + nested collector callbacks, extract a named `Stream` helper. - External mutation: do not create an `ArrayList`, `HashMap`, array, counter, holder object, or - `StringBuilder` and mutate it from `forEach`; let the stream produce the result directly. - Performance review: say direct collection is the safe baseline, not a guaranteed throughput win; - for large CPU-bound mapping, prominently recommend benchmarking a pure parallel version and warn - that mostly-small inputs can be slower. - -3. Flatten nested sources deliberately. Use `flatMap` for nested collections and - `flatMap(Optional::stream)` on Java 9+. On Java 16+, prefer `mapMulti` for small conditional - reference-value emission. For primitive values from a subtype, filter/cast first then - `mapToInt`/`mapToLong`/`mapToDouble` directly; do not box and immediately unbox. - -4. Use primitive streams for primitive aggregation. Use `reduce(identity, op)` for immutable - non-primitive accumulation such as `BigDecimal`. -5. Choose collectors by result semantics. For `toMap`, specify duplicate-key merge behavior; for - `groupingBy`, prove classifier keys are non-null or handle nulls first; for boolean splits, use - `partitioningBy`; when a later step needs an expensive result, carry `element + result`, never - null sentinels. - -6. Preserve ordering, mutability, and short-circuit behavior: - - **Top-N**: sort before `limit`. - - **Nullable sort keys**: filter nulls or use `Comparator.nullsFirst`/`nullsLast`. - - **Mutable results**: use `Collectors.toList()` or `Collectors.toCollection(ArrayList::new)`, not `Stream.toList()`. - - **Short-circuit**: omit stateful intermediate ops (e.g., `sorted`) before `findFirst`/`anyMatch` when order is irrelevant. - - **Lambda purity**: mapping/filtering lambdas should not mutate state visible outside the - lambda and should not depend on outside state that can change during the stream operation. -7. Keep imperative code when it is the clearer boundary. Prefer a loop for stateful sequence - output, checked IO, mutation-heavy logic, or complex early exits. Where a loop remains, use - stream helpers for real lookups or aggregates when that improves clarity. -8. Verify each changed branch: empty inputs, one element, duplicates, nulls, ordering, - parallel-safety, and Java-baseline compatibility. Run the marker scan from - [hard-stops.md](references/hard-stops.md); fix relevant hits and re-scan. - -Short reviews: decision first, direct stream issues only, one safer stream chain if useful. Avoid -internal workflow labels such as "hard stop", "marker", "scan", "checklist", or skill names in -user-facing output unless the task explicitly asks for that workflow. Omit scan details and -unchanged-code critiques unless asked. + ```java + // Java 9+: flatten Optional values instead of filter(Optional::isPresent).map(Optional::get) + optionals.stream().flatMap(Optional::stream).collect(Collectors.toList()); + ``` +4. Choose accumulation/collectors by result semantics: use `reduce(identity, op)` for immutable + non-primitives, `toMap` with merge behavior and deliberate null-key/value handling, non-null keys + for `groupingBy`, `partitioningBy` for boolean splits, and flattened nested indexes when clearer. + Extract duplicate-key merge rules into named helpers when ties, nulls, or ordering need more than + a same-line expression. Carry `element + result`, never null sentinels. +5. Preserve ordering, mutability, short-circuit behavior, and lambda readability. Keep stream lambdas + as short glue or method references; use named helpers for branching, merge logic, or nested stream + work. +6. Keep imperative code when it is the clearer boundary for stateful output, checked IO, + mutation-heavy logic, or complex early exits. +7. Verify changed branches for empty inputs, one element, duplicates, nulls, ordering, + parallel-safety, and baseline compatibility. Run the marker scan from + [hard-stops.md](references/hard-stops.md), fix hits, and re-scan. In scan audits, keep + hard-stop severities: required hits stay required unless explicitly acceptable. + +Review output: + +- Give a direct behavior-preserving decision plus one safe snippet. +- Create `review.md` when requested, even for rejection-only reviews. +- Explain code behavior, not internal workflow. +- Avoid internal workflow labels such as "per the skill", "hard stop", "marker", "scan", + "checklist", "rubric", or "criteria" unless the user asks about the workflow itself. diff --git a/skills/java-streams/references/hard-stops.md b/skills/java-streams/references/hard-stops.md index 8a354fe..e213847 100644 --- a/skills/java-streams/references/hard-stops.md +++ b/skills/java-streams/references/hard-stops.md @@ -12,7 +12,9 @@ Fix these before finalizing: to decide existence. Use `anyMatch`, `noneMatch`, `allMatch`, `findAny`, or `findFirst`. - A temporary filtered list followed by `get(0)`, `getFirst()`, or equivalent first-element access. Use `findFirst` to preserve encounter-order behavior unless the domain explicitly says all matches - are equivalent. + are equivalent and encounter order does not decide the result. In reviews that compare + `findFirst` and `findAny`, name that `findAny` exception explicitly before discussing parallelism + or performance. - `filter(...).count() > 0` for existence. Use `anyMatch`. - Plain `count()` is appropriate when the requested result is a numeric count; do not replace it with `anyMatch`. @@ -33,17 +35,36 @@ Fix these before finalizing: - `filter(Optional::isPresent).map(Optional::get)` on Java 9+. Use `flatMap(Optional::stream)`. - `toMap` without a merge function when duplicate keys are possible. - `groupingBy` where null classifier keys can reach the collector. Treat this as a required fix, - not a conditional caveat, unless the code already proves non-null before the collector. Also fix - `toMap` where null keys or values would change the existing null-handling contract. Default - `toMap` can preserve one null key in a `HashMap` result, but it rejects null values. + not a conditional caveat, unless the code already proves non-null before the collector. In scan + audits, use unambiguous wording such as "requires fix"; do not downgrade nullable `groupingBy` to + "requires review". +- Also fix `toMap` where null keys or values would change the existing null-handling contract. Do + not say `toMap` throws `NullPointerException` just because a key is null: the default `HashMap` + result may keep one null key, duplicate null keys fail as duplicate keys, and null values are + rejected. If the original loop skipped null keys, filter them before `toMap` so the behavior is + deliberate rather than accidental. In scan audits, classify marker hits against the task's domain notes. If the task states an invariant that makes a marker acceptable, such as globally unique `toMap` keys or non-null `groupingBy` classifiers, list it as acceptable with that invariant. Do not turn a proven invariant into a required fix unless the task asks for defensive hardening. - `sorted()` or `Comparator.naturalOrder()` where null elements or keys can reach the comparator. +- `sorted().distinct()` when the same result can be produced as `distinct().sorted()`. In scan + audits, classify this as a required ordering/efficiency fix rather than an acceptable minor note, + unless the code needs sort-before-de-duplication semantics. - `Stream.toList()` where a mutable result is required or later code mutates the list. Prefer a mutable collector; do not modernize this to `new ArrayList<>(stream.toList())` when the task or surrounding code says `Stream.toList()` is not valid. +- Multi-line stream lambdas: block lambdas (`-> { ... }`), arrows whose body starts on the next + line, lambdas that rely on more than one helperless boolean check, or lambdas where the nested + stream body continues on later lines (for example `item -> item.children().stream()` followed by + `.filter(...)`). Keep lambdas as glue: prefer method references, concise one-expression lambdas + whose body stays on the same line as `->`, or extracted named helpers. For mapping, predicates, + severity/status selection, and record construction, extract helper methods when the lambda does + non-trivial work instead of fitting it into a wrapped lambda. + +In reviews, avoid examples that put substantial work behind one lambda body, including ad-hoc helper +expressions like `-> input -> ...` or callbacks passed into custom parallel executors. If a non-trivial +decision is needed, extract to a named method and keep the stream chain as glue. - `stream().forEach(...)` or `parallelStream().forEach(...)` that mutates an external `Collection`, `Map`, array, counter, holder object, or `StringBuilder`. Make the stream produce the result directly with `toList()`, `collect(...)`, `toMap(...)`, `joining`, `sum`, or another @@ -60,12 +81,20 @@ Fix these before finalizing: - Java-version drift: `toList`, `mapMulti`, `teeing`, `takeWhile`, `dropWhile`, `Optional.stream`, `Collectors.flatMapping`, `Stream.ofNullable`, or gatherers used below their minimum Java version. For a version-drift audit, report these unavailable APIs and explicitly allowed markers only; do - not add unrelated `groupingBy` null-key or collector-safety caveats. + not add unrelated modernization suggestions, import cleanup, `groupingBy` null-key, or + collector-safety caveats. + When giving below-baseline replacement code, do not emulate the missing API with multi-line or + stateful block lambdas. Prefer a named helper or a clear loop for Java 8 replacements such as + `takeWhile`/`dropWhile` prefixes, downstream flat-mapping, or paired min/max aggregation. + For downstream flat-mapping, do not show `flatMap(c -> c.items().stream()` with a nested + `.map(...)`/`.filter(...)` chain continuing on later lines; extract that nested stream to a named + helper and call it with a method reference. + Java 8 replacement snippets must not use Java 9+ helpers such as `Map.entry`. When one stream chain contains multiple unavailable APIs, list each unavailable API separately. Example: `flatMap(Optional::stream).toList()` on a Java 8 baseline has two version-drift hits: `Optional::stream` requires Java 9 and `Stream.toList()` requires Java 16. - Missing imports for stream APIs introduced by the rewrite, such as `Comparator`, `Map`, - `Collectors`, or `Gatherers`. + `Collectors`, `Function`, `BinaryOperator`, or `Gatherers`. ## Ordering Rules @@ -73,10 +102,17 @@ Fix these before finalizing: or user-visible order matters. - For numeric priority sorted with `Comparator.comparing(...priority...)`, describe the contract precisely, for example "lowest priority number wins" when natural ascending order is used. -- Use `findAny()` only when all matches are equivalent. It is often fine after filtering a set of - equivalent flags, IDs, or permissions. -- `distinct().sorted()` is usually better than `sorted().distinct()` when duplicates can be removed - before sorting. +- When reviewing a proposed replacement of ordered selection with `findAny()` or `parallelStream()`, + recommend keeping the existing ordered `sorted(...).filter(...).findFirst()` chain first. Mention + `min(...)`/`max(...)` only as an optional refactor when the task asks for optimization advice. +- Use `findAny()` only when the domain explicitly says all matching domain values, such as + primary/default/preferred values, are equivalent and encounter order does not matter. Use + `findFirst()` for first configured, first listed, chronological, priority, fallback, or + user-visible results. In reviews, name the code's noun in the exception, e.g. "all matching + primary addresses are equivalent." Do not present `findAny()` as a performance shortcut unless + that semantic precondition is already satisfied. +- `distinct().sorted()` is the required rewrite for `sorted().distinct()` when duplicates can be + removed before sorting and no sort-before-de-duplication semantics are required. - `limit(n)` must come after sorting when computing top-N by an ordering. It may come before an expensive map/filter only when that preserves semantics. - `takeWhile` and `dropWhile` are prefix operations. They are not replacements for `filter`. @@ -89,10 +125,12 @@ Use parallel streams only after checking: 2. Operations are stateless and non-interfering. 3. Encounter order is not required, or the ordered stream terminal operation is still worth the cost. 4. The stream chain does not perform blocking IO or remote calls. For Java 24+ blocking per-element - calls, consider `Gatherers.mapConcurrent` only when the baseline supports it and virtual-thread - concurrency is the intended design. Preserve element/result association explicitly with - a baseline-compatible holder rather than null sentinels or side maps. For remote calls, call out - the concurrency limit, timeout handling for slow calls, and error propagation/retry policy. + calls with a requested concurrency limit, use `Gatherers.mapConcurrent` with that bound when the + baseline supports it and virtual-thread concurrency is the intended design. Preserve + element/result association explicitly with a baseline-compatible holder rather than null sentinels + or side maps. Do not replace bounded stream concurrency with unbounded `CompletableFuture` + fan-out. For remote calls, call out the concurrency limit, timeout handling for slow calls, and + error propagation/retry policy. 5. The stream terminal operation or collector is safe under parallel execution. For acceptable CPU-heavy parallel streams, state that the benefit should be measured or benchmarked @@ -100,14 +138,27 @@ because fork-join splitting, merging, and common-pool contention can outweigh th whose main problem is external mutation such as `stream().map(...).forEach(result::add)`, recommend the direct collector/toList form as the correctness/readability baseline. Do not claim that direct collection is guaranteed faster, and do not say `parallelStream()` will be faster merely because the -input is large. In reviews, show the sequential direct-collection fix before any parallel version. +input is large. Explain the rewrite in terms of ownership, correctness, and readability; treat +low-level allocation details as secondary unless measurements make them relevant. Never describe +ordinary `ArrayList` growth as `O(N^2)`; resizing is amortized O(N) total. In reviews, show the +sequential direct-collection fix before any parallel version. +When the workload is not clearly CPU-bound or the input is expected to be small/tiny, explicitly state +that parallelization is not justified here. For large CPU-bound transformations, strongly recommend benchmarking a pure parallel version after the stream chain is side-effect-free; make the benchmark requirement visible next to that -recommendation, and call out that small-list or mostly-small call paths can be slower. +recommendation and before any parallel snippet, and call out that small-list or mostly-small call +paths can be slower. +For external-mutation performance reviews, keep the main snippet sequential and do not add unrelated +style notes about logging, printing, imports, naming, or nearby code. +Do not say the speedup is expected, significant, core-count-scaled, or otherwise likely until +measurements support it. For simple cache/index construction, filtering, or map population, explicitly say when there is no CPU-heavy stateless work to justify `parallelStream()`. Do not suggest `toConcurrentMap` or another parallel collector as the main fix unless the task provides measured need or genuinely CPU-heavy -per-element work. Prefer sequential collector-owned accumulation. +per-element work. Prefer sequential collector-owned accumulation. In shared-map/list mutation +reviews, keep the decision scoped to the race/corruption risk, lack of CPU-heavy work or +measurement, and the collector-owned replacement; do not add encounter-order warnings unless the +task's behavior depends on encounter order. ## Scan Command @@ -122,7 +173,7 @@ multiline mode so it catches normally formatted fluent chains. Some markers are classify legitimate uses instead of deleting them mechanically. ```bash -rg -nUP "count\\(\\)\\s*>\\s*0|collect\\([^;]+\\)\\s*\\.\\s*(?:isEmpty|size|getFirst)\\(|collect\\([^;]+\\)\\s*\\.\\s*get\\(\\s*0\\s*\\)|sorted\\([^;]*\\)\\s*\\.\\s*findFirst\\(|sorted\\(\\)\\s*\\.\\s*findFirst\\(|limit\\([^;]+\\)\\s*\\.\\s*sorted\\(|sorted\\([^;]*\\)\\s*\\.\\s*distinct\\(|sorted\\(\\)\\s*\\.\\s*distinct\\(|String\\.join\\(|filter\\(Optional::isPresent\\)\\s*\\.\\s*map\\(Optional::get\\)|parallelStream\\(|\\.parallel\\(|\\.forEach\\(|Collectors\\.toMap\\(|Collectors\\.groupingBy\\(|Comparator\\.naturalOrder\\(\\)|(? +rg -nUP "count\\(\\)\\s*>\\s*0|collect\\([^;]+\\)\\s*\\.\\s*(?:isEmpty|size|getFirst)\\(|collect\\([^;]+\\)\\s*\\.\\s*get\\(\\s*0\\s*\\)|sorted\\([^;]*\\)\\s*\\.\\s*findFirst\\(|sorted\\(\\)\\s*\\.\\s*findFirst\\(|limit\\([^;]+\\)\\s*\\.\\s*sorted\\(|sorted\\([^;]*\\)\\s*\\.\\s*distinct\\(|sorted\\(\\)\\s*\\.\\s*distinct\\(|String\\.join\\(|filter\\(Optional::isPresent\\)\\s*\\.\\s*map\\(Optional::get\\)|parallelStream\\(|\\.parallel\\(|\\.forEach\\(|Collectors\\.toMap\\(|Collectors\\.groupingBy\\(|Comparator\\.naturalOrder\\(\\)|(?\\s*\\{|->\\s*$|->[^\\n]*\\.stream\\(\\)\\s*$" ``` For each hit, decide whether it is legitimate for the project Java baseline and behavior. Fix @@ -132,8 +183,8 @@ numeric result rather than a `count() > 0` existence check, and state that plain hit for the bundled scan regex. In ordinary code reviews, do not expose internal workflow labels such as "hard stop", "marker", -"scan", or "skill checklist" in the final user-facing recommendation. Use those terms only when the -task explicitly asks for a scan/workflow audit or exact skill-provided command. +"scan", or "skill checklist" in headings, rationale, or recommendations. Use those terms only when +the task explicitly asks for a scan/workflow audit or exact skill-provided command. When the requested audit is specifically about Java-version drift, keep the report scoped to APIs that are unavailable for the stated baseline and to explicitly allowed markers. Do not add unrelated diff --git a/skills/java-streams/references/java-stream-api.md b/skills/java-streams/references/java-stream-api.md index 32a0f6c..a4e5b1f 100644 --- a/skills/java-streams/references/java-stream-api.md +++ b/skills/java-streams/references/java-stream-api.md @@ -33,7 +33,7 @@ If the baseline is unclear, prefer Java 8-compatible stream code or state the as | --- | ---: | --- | | `Collectors.toList`, `toSet` | 8 | `toList` mutability is unspecified; use explicit collection if required. | | `Collectors.joining` | 8 | Join mapped text in one stream terminal operation. | -| `Collectors.toMap` | 8 | Provide a merge function when duplicate keys are possible. Default map results may preserve a null key, but null values are rejected; preserve the existing null contract explicitly. | +| `Collectors.toMap` | 8 | Provide a merge function when duplicate keys are possible. Default map results may preserve one null key in the resulting `HashMap`, duplicate null keys fail as duplicate keys, and null values are rejected; preserve the existing null contract explicitly. | | `Collectors.groupingBy` | 8 | Key maps to a list or downstream aggregate. Null classifier keys are not accepted. | | `Collectors.mapping` | 8 | Project values inside downstream collectors. | | `Collectors.counting` | 8 | Count elements, often downstream of `groupingBy`. | @@ -59,5 +59,61 @@ If the baseline is unclear, prefer Java 8-compatible stream code or state the as Natural sorting and collectors are null-sensitive. If a stream may contain null elements, null values, or null grouping keys, filter them out or use explicit null handling before sorting or -collecting. For `toMap`, preserve existing null-key behavior deliberately instead of filtering or -retaining null keys by accident. +collecting. For `toMap`, do not claim null keys alone cause `NullPointerException`; preserve +existing null-key behavior deliberately instead of filtering or retaining null keys by accident. + +## Java 8 Fallback Guidance + +Use this section only when the project baseline is Java 8 or the task asks for Java 8-compatible +replacement code. Identify APIs that are unavailable and give replacements that preserve behavior +without introducing multi-line lambdas: + +For Java-version drift reviews, keep the audit scoped to unavailable APIs and explicitly allowed +Java 8 stream usage. Do not add unrelated modernization advice such as `Collections.emptyList()` or +general cleanup notes unless the task asks for a broader review. + +- `takeWhile` / `dropWhile`: these are ordered prefix operations with no direct Java 8 stream + equivalent. Prefer a small loop helper that preserves prefix semantics. Do not show stateful + `filter(x -> { ... })` emulations. +- `Collectors.flatMapping`: when empty downstream groups matter, use a loop or helper that creates + the group before adding nested values. Pre-flatten with `flatMap(Type::helper)` before + `groupingBy` only when the contract intentionally omits groups whose nested stream is empty. Do + not put a nested stream chain inside a collector lambda or `flatMap(c -> c.items().stream()` + snippet whose nested chain continues on later lines. If you show pre-flattening, use a named + helper and a Java 8-compatible holder such as `AbstractMap.SimpleImmutableEntry`, not `Map.entry`; + implement the helper as a loop or other concise method body, not as a multi-line nested stream. +- `Collectors.teeing`: use two clear stream passes, a simple loop, or named helper aggregation. Do + not replace it with a complex `reducing` collector whose merge lambda spans multiple lines. +- `mapMulti`: use `map`, `flatMap`, or a helper method for one-to-few emission on Java 8. +- `Stream.toList()`: use `collect(Collectors.toList())`, and choose + `Collectors.toCollection(ArrayList::new)` when mutability is required. + +Example Java 8-compatible downstream flattening that preserves empty groups: + +```java +Map> namesByRegion = new HashMap<>(); +for (Customer customer : customers) { + List aliases = namesByRegion.computeIfAbsent(customer.region(), key -> new ArrayList<>()); + aliases.addAll(customer.aliases()); +} +``` + +Example Java 8-compatible pre-flattening when empty nested groups may be omitted: + +```java +Map> namesByRegion = customers.stream() + .flatMap(this::regionAliases) + .collect(Collectors.groupingBy( + AbstractMap.SimpleImmutableEntry::getKey, + Collectors.mapping( + AbstractMap.SimpleImmutableEntry::getValue, + Collectors.toList()))); + +private Stream> regionAliases(Customer customer) { + List> aliases = new ArrayList<>(); + for (String alias : customer.aliases()) { + aliases.add(new AbstractMap.SimpleImmutableEntry<>(customer.region(), alias)); + } + return aliases.stream(); +} +``` diff --git a/skills/java-streams/references/stream-examples.md b/skills/java-streams/references/stream-examples.md index 36bd13b..d94649c 100644 --- a/skills/java-streams/references/stream-examples.md +++ b/skills/java-streams/references/stream-examples.md @@ -6,15 +6,24 @@ using APIs from [java-stream-api.md](java-stream-api.md). ## Direct Terminals -- Primary address when order does not matter: +- Arbitrary match only when the domain explicitly says all matching domain values, such as + primary/default/preferred values, are equivalent and encounter order does not matter: ```java - Address primaryAddress = account.getAddresses().stream() - .filter(Address::isPrimary) + Address selectedAddress = account.getAddresses().stream() + .filter(Address::isEligible) .findAny() .orElse(null); ``` +When a review asks whether to use `findFirst` or `findAny`, answer the semantic question first: +`findAny` is valid only if the domain explicitly says all matching domain values, such as +`primary`, `default`, or `preferred` values, are equivalent and encounter order does not select the +winner. For those names, write the exception with the code's noun, such as "all matching primary +addresses are equivalent." Do not infer equivalence from the name; if existing code takes element +`0`, preserve encounter order with `findFirst()`. Do not lead with performance or parallel-stream +wording for `findAny`. + - First fallback when order matters: ```java @@ -105,6 +114,10 @@ List discountCodes = orders.stream() Pitfall: natural sorting throws if null reaches the comparator. Filter nulls first or use `Comparator.nullsFirst(...)` / `Comparator.nullsLast(...)`. +Pitfall: when de-duplication and sorting are both required, run `distinct()` before `sorted()` +unless the code depends on sorting before removing duplicates. In scan audits, treat +`sorted().distinct()` as a fix when `distinct().sorted()` preserves the result. + Pitfall: `Stream.toList()` returns an unmodifiable list. Keep `Collectors.toList()` or use `Collectors.toCollection(ArrayList::new)` when the result is sorted, appended to, or otherwise mutated later. If a task says `Stream.toList()` is not valid for that mutable result, do not wrap it @@ -112,6 +125,121 @@ in `new ArrayList<>(...)`; use the mutable collector directly. ## External Mutation And Lambda Purity +Keep lambdas as short glue. If a stream operation needs branching, loops, temporary variables, +formatting, a merge rule, or a nested stream chain that would continue after the lambda line, +extract that work into a named method and pass a method reference or a concise one-expression +lambda whose body stays on the same line as `->`. + +For predicate lambdas, any multi-check filter condition should be extracted to a named helper before +the stream boundary if readability is at stake: + +```java +List overdueNotices(List shipments, Clock clock) { + LocalDate today = LocalDate.now(clock); + return shipments.stream() + .filter(shipment -> isOverdue(shipment, today)) + .map(shipment -> toNotice(shipment, today)) + .toList(); +} + +private static boolean isOverdue(Shipment shipment, LocalDate today) { + return shipment.deliveredAt().isEmpty() && shipment.dueDate().isBefore(today); +} + +private static ShipmentNotice toNotice(Shipment shipment, LocalDate today) { + long daysLate = ChronoUnit.DAYS.between(shipment.dueDate(), today); + return new ShipmentNotice( + shipment.id(), + shipment.customerEmail(), + daysLate, + daysLate >= 14 ? "critical" : "late"); +} +``` + +Avoid burying derivation logic in a block lambda: + +```java +List escalations = tickets.stream() + .filter(Ticket::isOpen) + .map(ticket -> { + Duration age = Duration.between(ticket.openedAt(), now); + EscalationLevel level = levelFor(ticket.priority(), age); + return new TicketEscalation(ticket.id(), ticket.assignee(), level, age); + }) + .toList(); +``` + +Extract the logic so the stream chain stays readable and the helper can be tested directly: + +```java +List escalations = tickets.stream() + .filter(Ticket::isOpen) + .map(ticket -> escalationFor(ticket, now)) + .toList(); + +private static TicketEscalation escalationFor(Ticket ticket, Instant now) { + Duration age = Duration.between(ticket.openedAt(), now); + EscalationLevel level = levelFor(ticket.priority(), age); + return new TicketEscalation(ticket.id(), ticket.assignee(), level, age); +} +``` + +This is required even when the helper only builds one output record: temporary values and branching +inside `map(x -> { ... })` are not glue. + +Avoid wrapping a nested stream body inside a collector callback: + +```java +Map> openCaseIdsByOwner = accounts.stream() + .collect(Collectors.groupingBy( + Account::ownerTeam, + Collectors.flatMapping( + account -> account.supportCases().stream() + .filter(SupportCase::isOpen) + .map(SupportCase::caseId), + Collectors.toList()))); +``` + +Extract the nested stream so the collector reads as composition: + +```java +Map> openCaseIdsByOwner = accounts.stream() + .collect(Collectors.groupingBy( + Account::ownerTeam, + Collectors.flatMapping( + AccountCaseReports::openCaseIds, + Collectors.toList()))); + +private static Stream openCaseIds(Account account) { + return account.supportCases().stream() + .filter(SupportCase::isOpen) + .map(SupportCase::caseId); +} +``` + +The same rule applies to downstream `flatMapping` for sorted or de-duplicated nested data. Keep the +collector callback as a method reference: + +```java +Map> emailsByTrack = conferences.stream() + .flatMap(conference -> conference.sessions().stream()) + .filter(session -> session.track() != null) + .collect(Collectors.groupingBy( + Session::track, + Collectors.flatMapping( + SessionReports::optedInEmails, + Collectors.collectingAndThen( + Collectors.toCollection(TreeSet::new), + ArrayList::new)))); + +private static Stream optedInEmails(Session session) { + return session.registrations().stream() + .filter(Registration::optedIn) + .map(Registration::email) + .filter(Objects::nonNull); +} +``` + Do not build ordinary stream results by mutating external state from `forEach`: ```java @@ -132,7 +260,9 @@ List labels = orders.stream() Use `collect(Collectors.toList())` instead when the Java baseline is below 16 or the result must be mutable. The direct collector/toList form is the default correctness and readability fix, not a guaranteed throughput win. It may only marginally affect throughput by itself, but it removes shared -mutation and gives a safe baseline for benchmarking. +mutation and gives a safe baseline for benchmarking. Explain the rewrite in terms of ownership and +correctness first; discuss low-level allocation details only when measurements make them relevant. +Do not claim the original `ArrayList` resizing is `O(N^2)`; ordinary growth is amortized O(N) total. This parallel version is broken because it mutates a shared `ArrayList` from multiple workers: @@ -143,19 +273,31 @@ orders.parallelStream() .forEach(labels::add); ``` -In reviews, show the sequential direct-collection form first as the safe baseline. Keep the -performance decision separate: for large CPU-bound transformations, strongly recommend benchmarking -a side-effect-free parallel stream as the performance experiment before relying on it for speed. -Put that benchmark requirement next to the parallel recommendation, and warn that small lists or +In reviews, make the snippet the sequential direct-collection form unless the user explicitly asks +for parallel code. This removes shared mutation and creates a safe baseline; it is not proof that +the pipeline is faster. Keep the performance decision separate: for large CPU-bound transformations, +recommend benchmarking a side-effect-free parallel stream in prose before relying on it for speed. +Put that benchmark requirement next to the parallel mention, and warn that small lists or mostly-small call paths can be slower because splitting, merging, ordering, and common-pool -contention can outweigh the benefit: - -```java -List labels = orders.parallelStream() - .map(Order::label) - .toList(); +contention can outweigh the benefit. Do not call the speedup expected, significant, proportional to +available cores, or likely before measurements show it. Do not invent benchmark tables, ratios, +timing numbers, or a parallel code block; tell the reader to benchmark the real workload instead. +For "10 million item" examples, still use this same structure: correctness snippet first, +benchmark-only parallel prose second, and no scaling claims without measured results. + +Useful review wording for this pattern: + +```text +The first fix is to remove the external mutation; let the stream produce the list directly. +For throughput, benchmark the sequential version against a pure side-effect-free parallel stream on +the real workload before relying on parallelStream. `parallelStream()` can be slower for small lists +or mostly-small call paths. ``` +Do not include optional parallelism snippets for this pattern unless the user explicitly requests +parallel code. If requested, avoid multiline lambda bodies and prefer method references or +single-line helpers. + Only keep terminal `forEach` when the side effect is the operation's purpose, such as logging or calling an API, and the side effect is safe for the chosen stream mode. @@ -199,6 +341,36 @@ Map> partitionedProducts = products.stream() If a `groupingBy` classifier can return null, filter nulls or map them to an explicit non-null key before collecting. Do not treat possible null classifier keys as acceptable without proof. +If a loop skipped null keys before building a map, filter those null keys before `toMap`; the issue +is the behavior change, not a guaranteed null-key `NullPointerException`. +When a review includes a collector replacement snippet, include the supporting imports for helper +APIs such as `Function.identity()` or `BinaryOperator.minBy(...)`, and avoid tangential import +commentary unless missing imports are part of the stream issue. + +For nested data indexes with duplicate-key rules, prefer flattening first and expressing the merge +in the collector: + +```java +Map longestSessionByRoom = conferences.stream() + .flatMap(conference -> conference.sessions().stream()) + .filter(session -> session.room() != null) + .collect(Collectors.toMap( + Session::room, + Function.identity(), + SessionIndexes::longerSession)); +``` + +Extract the merge helper when tie-breaking takes more than a same-line expression, especially when +encounter order must decide equal values. + +```java +private static Session longerSession(Session first, Session second) { + return first.minutes() >= second.minutes() ? first : second; +} +``` + +When a task says to use nested records or helper types in the requested file, keep those types in +that file instead of splitting them into separate top-level files. Java 12+ `teeing` can combine two independent reductions over the same input. Prefer this for a min/max pair or price range instead of running two separate stream passes: @@ -218,13 +390,15 @@ Java 16+ `mapMulti` can be clearer for small conditional emission: ```java Set emailsWithoutTraining = companies.stream() .flatMap(company -> company.getEmployees().stream()) - .mapMulti((employee, consumer) -> { - if (employee instanceof Developer developer - && !developer.getSecureCodingTraining().isCompleted()) { - consumer.accept(developer.getEmail()); - } - }) + .mapMulti(TrainingReports::emitDeveloperEmailWithoutTraining) .collect(Collectors.toSet()); + +private static void emitDeveloperEmailWithoutTraining(Employee employee, Consumer emails) { + if (employee instanceof Developer developer + && !developer.getSecureCodingTraining().isCompleted()) { + emails.accept(developer.getEmail()); + } +} ``` For primitive subtype extraction, avoid emitting boxed primitives only to unbox them later: @@ -256,15 +430,15 @@ long result = LongStream.rangeClosed(1, 100_000) .sum(); ``` -For Java 24+ projects, `Gatherers.mapConcurrent` can be more appropriate than `parallelStream` for -blocking per-element calls when the project intentionally uses virtual-thread concurrency. Keep -concurrency bounded and call out timeout/error handling for remote API failures: +For Java 24+ projects, use bounded `Gatherers.mapConcurrent` for blocking per-element calls when the +task gives a concurrency limit and the project intentionally uses virtual-thread concurrency. Keep +existing helper methods that encode behavior: ```java List favoriteProducts = user.getFavoriteProducts().stream() .gather(Gatherers.mapConcurrent( 100, - product -> Map.entry(product, product.isInStock()))) + product -> Map.entry(product, isInStock(product)))) .filter(Map.Entry::getValue) .map(Map.Entry::getKey) .sorted(Comparator.comparing(Product::getName)) @@ -275,3 +449,8 @@ List favoriteProducts = user.getFavoriteProducts().stream() entry is null. Do not return `null` from a `mapConcurrent` mapper to mean "skip"; carry the element with an explicit boolean result, then filter and map afterward. If nulls can reach the carrier or the baseline is Java 8, use a null-tolerant project type or `AbstractMap.SimpleImmutableEntry`. +When the task provides a production service stub such as `AvailabilityApi.lookup(...)` or +`CalendarService.canSchedule(...)`, call that stub directly. Do not add delegate fields, alternate +overloads, package-private switches, or other test hooks unless the task explicitly asks for them. +Do not replace a bounded `mapConcurrent` stream with `CompletableFuture` fan-out when the task asks +for stream APIs.