ML project for document understanding in the wild.
Given noisy document images (blur, low illumination, perspective distortions), this pipeline:
- Retrieves relevant document images/regions from text queries.
- Runs visual question answering (VQA) on document images.
- Evaluates QA quality with Exact Match (EM) and token-level F1.
- Reports robustness by condition (clean vs degraded).
Most real document workflows use phone captures and imperfect scans, not clean PDFs.
This project is designed around that reality and maps to practical use cases:
- invoice/receipt/form automation
- mobile document search assistants
- assistive document reading tools
- noisy OCR+LLM production pipelines
- Data curation: WildDoc subset prep and local image/CSV generation.
- Retrieval: Brute-force embedding retrieval + Top-1/Top-K evaluation.
- VQA: OpenAI-backed VQA interface for document QA.
- Robustness: Blur/brightness/perspective perturbation helpers and per-condition metrics.
- Tested codebase: Unit tests for data, retrieval, metrics, VQA, robustness, and index build flow.
- Python, PyTorch, NumPy, pandas
- OpenCV (image perturbations)
- Hugging Face Datasets (WildDoc loading)
- OpenAI API (VQA runtime)
- pytest (unit testing)
Retrieval benchmark on a local WildDoc subset (200 samples: 100 ChartQA + 100 DocVQA), evaluated with Top-1 and Top-5 accuracy.
| Method | Top-1 | Top-5 | Notes |
|---|---|---|---|
CLIP retrieval pipeline (openai/clip-vit-base-patch32) |
0.055 | 0.130 | End-to-end model pipeline in scripts/test_retrieval_model.py |
| Type-aware heuristic baseline (100-trial mean) | 0.007 | 0.041 | Keyword-based chart/doc routing + random retrieval within type |
| Random baseline (100-trial mean) | 0.005 | 0.025 | Uniform random top-k from full image pool |
Interpretation: the CLIP-based retrieval pipeline consistently outperforms both random and heuristic baselines on the same evaluation subset.
src/wilddoc/data.py– WildDoc subset preparation and CSV/image generation.retrieval.py– retrieval logic, retrieval metrics, index manifest helpers.vqa.py– OpenAI-backed VQA wrapper and VQA evaluation.metrics.py– EM/F1 metrics.robustness.py– perturbations and robustness report aggregation.
scripts/build_index.py– build index manifest fromquestions.csv+ images.run_retrieval_eval.py– retrieval evaluation from prediction CSV.run_vqa_eval.py– VQA evaluation from prediction CSV.run_robustness_eval.py– robustness evaluation by condition from prediction CSV.
tests/– unit tests covering module and CLI-facing behavior.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .The VQA module makes real OpenAI API calls. Set this locally before running VQA-related code:
OPENAI_API_KEY
Do not commit secrets. Local .env is ignored by .gitignore.
python -m pytestAsk a question about your own document image:
python scripts/ask_doc.py \
--image path/to/your_document.jpg \
--question "What is the due date?"Build index manifest:
python scripts/build_index.py \
--questions-csv data/questions.csv \
--image-dir data/images \
--output-json data/index_manifest.jsonEvaluate retrieval predictions:
python scripts/run_retrieval_eval.py \
--csv data/retrieval_predictions.csv \
--k 5 \
--output-json data/retrieval_metrics.jsonRun end-to-end retrieval model testing (CLIP embeddings + Top-K metrics):
python scripts/test_retrieval_model.py \
--questions-csv data/questions.csv \
--image-dir data/images \
--model-name openai/clip-vit-base-patch32 \
--k 5 \
--max-rows 200 \
--output-pred-csv data/retrieval_predictions.csv \
--output-metrics-json data/retrieval_metrics.jsonEvaluate VQA predictions:
python scripts/run_vqa_eval.py \
--csv data/vqa_predictions.csv \
--output-json data/vqa_metrics.jsonEvaluate robustness by condition:
python scripts/run_robustness_eval.py \
--csv data/robustness_predictions.csv \
--condition-col condition \
--output-json data/robustness_metrics.jsonThis project implementation and experimental framing are inspired by concepts and assignments from CISC451 (multimodal/document understanding course content).
The codebase and engineering structure in this repository are independently developed as a standalone portfolio project.