This repo now uses an autoresearch-style loop for codec work.
Improve gopus with a libopus-parity-first mixed quality+feature loop while keeping the judge fixed.
The default target is:
- improve quality first, using libopus 1.6.1 parity as the reference
- close explicit libopus capability gaps when they are high-value, testable, and backed by a pinned judge
- close the fair
gopusvslibopusspeech encode throughput gap when the change is performance-facing - preserve zero-allocation hot paths
Coordinate all researcher work under three top-level lanes:
performance: measurable throughput, latency, or allocation improvementslibopus parity: closer behavioral, quality, and supported-capability alignment with libopus 1.6.1code quality / maintainability: simpler structure, stronger tests, lower maintenance risk, and clearer ownership
The existing autoresearch.sh focus flags remain useful judge surfaces, but
claiming, queueing, and merge coordination should use these three lanes.
For the lane-to-FOCUS mapping:
performancelane usually usesFOCUS=performancelibopus paritylane usually usesFOCUS=quality; useFOCUS=mixedonly when the slice closes an explicit libopus capability gap with a pinned judgecode quality / maintainabilitylane should stay manual or tightly scoped unless the user provides a measurable target
Start each run on a fresh branch such as autoresearch/<tag>.
Then:
- Read
program.md,AGENTS.md, andREADME.md. - Run
make autoresearch-init. - Run
make autoresearch-preflight. - Open the shared draft PR claim before any editable code change:
./tools/prepare_claim_pr.sh --lane performance --surface encoder --tag perf-try-1 --hypothesis "State the current idea here." --push --create-draftIf GitHub requires branch history before opening the draft PR, the helper creates a single empty claim commit so other workers can see the branch before editable work starts.
- Run the baseline exactly once:
make autoresearch-eval DESCRIPTION=baselineThe baseline row becomes the first successful row in the focus-specific results ledger.
To let Codex drive repeated iterations automatically, run:
make autoresearch-loop MAX_ITERATIONS=5Omit MAX_ITERATIONS to keep looping until interrupted.
When useful, split the loop into two read-only scout agents:
- one quality lane to inspect parity, compliance, and ratchet evidence
- one feature lane to inspect allowlisted unimplemented work with an explicit judge
Keep the main loop on one editable surface at a time.
When more than one researcher is active, use a shared claim surface before editing. The preferred claim surface is an open draft PR.
Rules:
- Every editable branch must have exactly one active claim.
- The claim must name the lane, editable surface, owner, and current hypothesis.
- Only one active editable claim may own a given
(lane, editable surface)pair at a time. - If that pair is already claimed, switch to read-only scouting, review, or a different pair instead of starting overlapping edits.
- Keep one researcher to one editable branch at a time.
- If no shared claim surface exists, fall back to one editable researcher at a time.
Use draft PRs as the coordination queue:
- create the draft PR before any editable code change; use an empty claim commit if the branch needs visible history first
- keep the title generic and change-focused
- record the current blocker, next action, and latest attempt/results in the PR body
- close or retarget stale claims quickly so the queue stays trustworthy
Unlike the original autoresearch repo, gopus is not a single-file project.
Keep each run to one editable surface:
celt/encoder/silk/container/ogg/- one narrowly-scoped root wrapper/control surface directly supporting that area
- one narrowly-scoped benchmark or test helper directly supporting that surface
Do not spread one experiment across multiple subsystems unless the change is structurally required.
Do not edit these during normal experiments:
program.mdtools/autoresearch.shtools/benchguard/main.gotools/bench_guardrails.jsontestvectors/testdata/tmp_check/
Each experiment is judged in this order:
- Focused quality/parity for
quality,mixed, andunimplemented:
make test-qualityFor performance, keep using:
GOWORK=off GOPUS_TEST_TIER=parity GOPUS_STRICT_LIBOPUS_REF=1 \
go test ./testvectors -run 'TestSILKParamTraceAgainstLibopus|TestEncoderComplianceSummary' -count=1- Hot-path guardrails for every lane:
make bench-guard- Allowlisted unimplemented-feature checks for
mixedandunimplemented:
GOWORK=off go test ./container/ogg -count=1The current safe seed is mix-arrivals-f32wav.
- Fair throughput comparison for
performance:
GOWORK=off go run ./examples/bench-encode -in reports/autoresearch/speech.ogg -iters 2 -warmup 1 -mode both -bitrate 64000 -complexity 10Use make verify-production only before proposing merge-ready changes, not on every loop.
The results ledgers are intentionally local and untracked.
results.tsv: performance laneresults.quality.tsv: quality laneresults.unimplemented.tsv: allowlisted unimplemented-feature laneresults.mixed.tsv: mixed lane
Performance header:
commit parity benchguard gopus_avg_rt libopus_avg_rt rt_ratio status description
Quality-like header:
commit quality benchguard quality_mean_gap_db quality_min_gap_db score status description
Status values:
baselinekeepdiscardcrash
rt_ratio is gopus_avg_rt / libopus_avg_rt for the performance lane.
For the quality-like lanes, higher score is better.
The quality-like score combines the encoder compliance gap summary with the
minimum Hybrid->CELT transition SNR emitted by make test-quality.
For the top-level management lanes:
performanceshould prefer hard metrics and ledger rowslibopus parityshould prefer explicit target tests, fixtures, or side-by-side evidence against libopuscode quality / maintainabilitymay use qualitative evidence, but the PR must still name the concrete simplification, risk reduction, or test improvement being claimed
Loop forever until the human stops you:
- Inspect active claims and choose an unclaimed
(lane, editable surface)pair. - Look at the current branch, HEAD commit, and the best successful row with
make autoresearch-best. - Refresh the draft PR claim with the current blocker and next action before editing.
- Make one idea-sized change inside the chosen surface.
- Commit the experiment before evaluation.
- Run:
make autoresearch-eval DESCRIPTION='short experiment note'- Read the appended row in the focus-specific results ledger.
- Update the draft PR claim with the attempt description, latest result row, blocker, and next action.
- If the status is
keep, continue from that commit. - If the status is
discard, rewind to the prior successful commit and try the next idea. - If the status is
crash, fix only obvious mechanical mistakes; otherwise abandon the idea and move on.
If you want the repository to drive Codex directly instead of relying on a human-operated agent session, use:
make autoresearch-loopKeep a change only when all of these are true:
- the lane's required quality/parity checks pass
bench-guardpasses- the lane's score improves:
rt_ratioforperformance- quality score for
quality - allowlisted feature score for
unimplemented - mixed score for
mixed
If results are effectively flat, prefer the simpler change.
Use a single merge steward or an explicit sequential merge queue.
Rules:
- Merge only one green experimental slice at a time.
- Prefer the oldest unblocked green PR unless a later PR is explicitly dependent on an earlier foundational slice.
- Before merge, rebase onto the current queue head and rerun the lane's named
evidence:
performance: the relevant benchmark or ledger-backed judgelibopus parity: the targeted parity, capability, or compatibility checks against libopuscode quality / maintainability: the targeted tests plus the structural evidence named in the PR
- Run
make bench-guardbefore merge when the change touches a hot path. - After any merge, every open PR touching the same surface or shared helper must rebase and revalidate before it can merge.
- Do not batch-merge multiple experiments together.
- libopus 1.6.1 in
tmp_check/opus-1.6.1/remains the source of truth. - Avoid heuristic codec tuning before source alignment.
- Do not treat raw
ErrUnimplementedstubs as loop targets unless a pinned judge exists. - Prefer short loops over large refactors.