Skip to content

Latest commit

 

History

History
158 lines (113 loc) · 5.19 KB

File metadata and controls

158 lines (113 loc) · 5.19 KB

Benchmarks And Cost

Updated: 2026-03-12

This page tracks the benchmark sets that are most useful for the current product decisions.

Release gate reminder

A document is counted as release-ready only if it is:

  1. compliant
  2. fidelity-passed
  3. finalized as complete, not manual_remediation

Optional visible review items are advisory only.

Exact curated corpus

Artifact: backend/data/benchmarks/corpus_20260308_202258/corpus_report.md

  • 25 / 25 successful outputs release-ready
  • 2 failed inputs, both damaged PDFs

This is the main regression corpus for the current pipeline.

Representative non-huge corpus

Current measured run:

Corpus mix:

  • articles and readings
  • guides and admin documents
  • syllabi/course materials
  • scanned office documents

Current result:

  • 7 / 7 complete
  • 7 / 7 compliant
  • 7 / 7 fidelity-passed
  • 7 / 7 release-ready
  • 0 manual remediation

Cost summary

Measured from recorded provider usage/cost fields, not a hand-built pricing estimate.

Metric Value
Total cost $0.179212
Average cost / PDF $0.025602
Median cost / PDF $0.013667
Average cost / page $0.003144
Average runtime / PDF 76.45s
Median runtime / PDF 54.91s

Interpretation:

  • this is the current release-sanity corpus after the execution-first review cleanup
  • ordinary non-huge CUNY documents are now often only a few cents each on this representative set
  • the remaining cost outliers are still figure-heavy or semantics-heavy guides

Official form set (stress suite)

Acceptance run:

Result:

  • 7 / 7 complete
  • 7 / 7 compliant
  • 7 / 7 fidelity-passed

This is a stress suite, not the normal CUNY audience baseline.

Batching/caching win case

irs_ss4.pdf before semantic batching:

irs_ss4.pdf after page-scoped batching:

Delta:

  • requests: -94.4%
  • cost: -85.5%
  • runtime: -54.8%

What cost optimization currently means

The biggest clean wins so far come from:

  • semantic-unit prompt caching
  • page-scoped form batching with per-field fallback
  • Gemini structured outputs instead of looser JSON prompting
  • provider retry/backoff instead of rerunning whole workflows after transient failures

The next likely cost target for the real CUNY audience is figure-heavy guide/admin documents.

Stronger verification direction

There is now an explicit round-trip benchmark design for stronger verification than compliance plus fidelity alone:

  • start from a gold accessible PDF
  • strip benchmark-target accessibility semantics
  • remediate the stripped file
  • compare the output back to the gold file

See:

Current stripping utility:

cd backend
PYTHONPATH=. uv run python scripts/strip_accessibility.py \
  --input /path/to/gold-accessible.pdf \
  --output data/benchmarks/roundtrip/mydoc_stripped.pdf

Comparison utility:

cd backend
PYTHONPATH=. uv run python scripts/roundtrip_compare.py \
  --gold /path/to/gold-accessible.pdf \
  --candidate /path/to/remediated-output.pdf \
  --manifest /path/to/mydoc.roundtrip.json

Corpus runner:

cd backend
PYTHONPATH=. uv run python scripts/roundtrip_corpus_benchmark.py

The round-trip runner defaults to the assistive-core workflow profile. That profile keeps the full downstream validation/fidelity/review loop and skips only the figure alt-text branch. Use --workflow-profile full when you want figure/alt-text behavior included as well.

The round-trip comparison now reports form field presence and field-type recovery separately from exact accessible-name replay, so assistive-core form checks can be written against name/role/value semantics rather than a single gold /TU string.

Adobe Accessibility Checker can be run locally against benchmark outputs when Acrobat-style evidence is needed. This is intentionally not part of app validation because it uploads PDFs to Adobe and spends one PDF Services transaction per document:

cd backend
uv run --with pdfservices-sdk python scripts/adobe_accessibility_check.py \
  /path/to/candidate.pdf \
  --credentials /path/to/PDFServicesAPI-Credentials.zip \
  --output-dir data/adobe-accessibility-checks \
  --confirm-spend

The helper records local usage in ~/.cache/pdf-accessibility-app/adobe_accessibility_usage.json and defaults to a local cap of 100 transactions per month.