WildDoc-Robust: Multimodal Document Retrieval + VQA

ML project for document understanding in the wild.
Given noisy document images (blur, low illumination, perspective distortions), this pipeline:

Retrieves relevant document images/regions from text queries.
Runs visual question answering (VQA) on document images.
Evaluates QA quality with Exact Match (EM) and token-level F1.
Reports robustness by condition (clean vs degraded).

Why this project matters

Most real document workflows use phone captures and imperfect scans, not clean PDFs.
This project is designed around that reality and maps to practical use cases:

invoice/receipt/form automation
mobile document search assistants
assistive document reading tools
noisy OCR+LLM production pipelines

Core Features

Data curation: WildDoc subset prep and local image/CSV generation.
Retrieval: Brute-force embedding retrieval + Top-1/Top-K evaluation.
VQA: OpenAI-backed VQA interface for document QA.
Robustness: Blur/brightness/perspective perturbation helpers and per-condition metrics.
Tested codebase: Unit tests for data, retrieval, metrics, VQA, robustness, and index build flow.

Tech Stack

Python, PyTorch, NumPy, pandas
OpenCV (image perturbations)
Hugging Face Datasets (WildDoc loading)
OpenAI API (VQA runtime)
pytest (unit testing)

Benchmark Snapshot

Retrieval benchmark on a local WildDoc subset (200 samples: 100 ChartQA + 100 DocVQA), evaluated with Top-1 and Top-5 accuracy.

Method	Top-1	Top-5	Notes
CLIP retrieval pipeline (`openai/clip-vit-base-patch32`)	0.055	0.130	End-to-end model pipeline in `scripts/test_retrieval_model.py`
Type-aware heuristic baseline (100-trial mean)	0.007	0.041	Keyword-based chart/doc routing + random retrieval within type
Random baseline (100-trial mean)	0.005	0.025	Uniform random top-k from full image pool

Interpretation: the CLIP-based retrieval pipeline consistently outperforms both random and heuristic baselines on the same evaluation subset.

Repository Layout

src/wilddoc/
- data.py – WildDoc subset preparation and CSV/image generation.
- retrieval.py – retrieval logic, retrieval metrics, index manifest helpers.
- vqa.py – OpenAI-backed VQA wrapper and VQA evaluation.
- metrics.py – EM/F1 metrics.
- robustness.py – perturbations and robustness report aggregation.
scripts/
- build_index.py – build index manifest from questions.csv + images.
- run_retrieval_eval.py – retrieval evaluation from prediction CSV.
- run_vqa_eval.py – VQA evaluation from prediction CSV.
- run_robustness_eval.py – robustness evaluation by condition from prediction CSV.
tests/ – unit tests covering module and CLI-facing behavior.

Setup

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

Environment Variables

The VQA module makes real OpenAI API calls. Set this locally before running VQA-related code:

OPENAI_API_KEY

Do not commit secrets. Local .env is ignored by .gitignore.

Run Tests

python -m pytest

CLI Usage

Ask a question about your own document image:

python scripts/ask_doc.py \
  --image path/to/your_document.jpg \
  --question "What is the due date?"

Build index manifest:

python scripts/build_index.py \
  --questions-csv data/questions.csv \
  --image-dir data/images \
  --output-json data/index_manifest.json

Evaluate retrieval predictions:

python scripts/run_retrieval_eval.py \
  --csv data/retrieval_predictions.csv \
  --k 5 \
  --output-json data/retrieval_metrics.json

Run end-to-end retrieval model testing (CLIP embeddings + Top-K metrics):

python scripts/test_retrieval_model.py \
  --questions-csv data/questions.csv \
  --image-dir data/images \
  --model-name openai/clip-vit-base-patch32 \
  --k 5 \
  --max-rows 200 \
  --output-pred-csv data/retrieval_predictions.csv \
  --output-metrics-json data/retrieval_metrics.json

Evaluate VQA predictions:

python scripts/run_vqa_eval.py \
  --csv data/vqa_predictions.csv \
  --output-json data/vqa_metrics.json

Evaluate robustness by condition:

python scripts/run_robustness_eval.py \
  --csv data/robustness_predictions.csv \
  --condition-col condition \
  --output-json data/robustness_metrics.json

Academic Attribution

This project implementation and experimental framing are inspired by concepts and assignments from CISC451 (multimodal/document understanding course content).
The codebase and engineering structure in this repository are independently developed as a standalone portfolio project.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
src/wilddoc		src/wilddoc
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WildDoc-Robust: Multimodal Document Retrieval + VQA

Why this project matters

Core Features

Tech Stack

Benchmark Snapshot

Repository Layout

Setup

Environment Variables

Run Tests

CLI Usage

Academic Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WildDoc-Robust: Multimodal Document Retrieval + VQA

Why this project matters

Core Features

Tech Stack

Benchmark Snapshot

Repository Layout

Setup

Environment Variables

Run Tests

CLI Usage

Academic Attribution

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages