Updated: 2026-03-12
This page tracks the benchmark sets that are most useful for the current product decisions.
A document is counted as release-ready only if it is:
- compliant
- fidelity-passed
- finalized as
complete, notmanual_remediation
Optional visible review items are advisory only.
Artifact: backend/data/benchmarks/corpus_20260308_202258/corpus_report.md
25 / 25successful outputs release-ready2failed inputs, both damaged PDFs
This is the main regression corpus for the current pipeline.
Current measured run:
- backend/data/benchmarks/corpus_20260311_121723/corpus_report.md
- backend/data/benchmarks/corpus_20260311_121723/corpus_summary.json
Corpus mix:
- articles and readings
- guides and admin documents
- syllabi/course materials
- scanned office documents
Current result:
7 / 7complete7 / 7compliant7 / 7fidelity-passed7 / 7release-ready0manual remediation
Measured from recorded provider usage/cost fields, not a hand-built pricing estimate.
| Metric | Value |
|---|---|
| Total cost | $0.179212 |
| Average cost / PDF | $0.025602 |
| Median cost / PDF | $0.013667 |
| Average cost / page | $0.003144 |
| Average runtime / PDF | 76.45s |
| Median runtime / PDF | 54.91s |
Interpretation:
- this is the current release-sanity corpus after the execution-first review cleanup
- ordinary non-huge CUNY documents are now often only a few cents each on this representative set
- the remaining cost outliers are still figure-heavy or semantics-heavy guides
Acceptance run:
Result:
7 / 7complete7 / 7compliant7 / 7fidelity-passed
This is a stress suite, not the normal CUNY audience baseline.
irs_ss4.pdf before semantic batching:
- source: backend/data/benchmarks/corpus_20260309_131058/corpus_summary.json
89LLM requests$0.331174125.70s
irs_ss4.pdf after page-scoped batching:
- source: backend/data/benchmarks/corpus_20260309_134239/corpus_summary.json
5LLM requests$0.04797856.84s
Delta:
- requests:
-94.4% - cost:
-85.5% - runtime:
-54.8%
The biggest clean wins so far come from:
- semantic-unit prompt caching
- page-scoped form batching with per-field fallback
- Gemini structured outputs instead of looser JSON prompting
- provider retry/backoff instead of rerunning whole workflows after transient failures
The next likely cost target for the real CUNY audience is figure-heavy guide/admin documents.
There is now an explicit round-trip benchmark design for stronger verification than compliance plus fidelity alone:
- start from a gold accessible PDF
- strip benchmark-target accessibility semantics
- remediate the stripped file
- compare the output back to the gold file
See:
Current stripping utility:
cd backend
PYTHONPATH=. uv run python scripts/strip_accessibility.py \
--input /path/to/gold-accessible.pdf \
--output data/benchmarks/roundtrip/mydoc_stripped.pdfComparison utility:
cd backend
PYTHONPATH=. uv run python scripts/roundtrip_compare.py \
--gold /path/to/gold-accessible.pdf \
--candidate /path/to/remediated-output.pdf \
--manifest /path/to/mydoc.roundtrip.jsonCorpus runner:
cd backend
PYTHONPATH=. uv run python scripts/roundtrip_corpus_benchmark.pyThe round-trip runner defaults to the assistive-core workflow profile. That profile keeps the full downstream validation/fidelity/review loop and skips only the figure alt-text branch. Use --workflow-profile full when you want figure/alt-text behavior included as well.
The round-trip comparison now reports form field presence and field-type recovery separately from exact accessible-name replay, so assistive-core form checks can be written against name/role/value semantics rather than a single gold /TU string.
Adobe Accessibility Checker can be run locally against benchmark outputs when Acrobat-style evidence is needed. This is intentionally not part of app validation because it uploads PDFs to Adobe and spends one PDF Services transaction per document:
cd backend
uv run --with pdfservices-sdk python scripts/adobe_accessibility_check.py \
/path/to/candidate.pdf \
--credentials /path/to/PDFServicesAPI-Credentials.zip \
--output-dir data/adobe-accessibility-checks \
--confirm-spendThe helper records local usage in ~/.cache/pdf-accessibility-app/adobe_accessibility_usage.json and defaults to a local cap of 100 transactions per month.