Skip to content

Ziqing110/wilddoc-robust-vqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WildDoc-Robust: Multimodal Document Retrieval + VQA

ML project for document understanding in the wild.
Given noisy document images (blur, low illumination, perspective distortions), this pipeline:

  • Retrieves relevant document images/regions from text queries.
  • Runs visual question answering (VQA) on document images.
  • Evaluates QA quality with Exact Match (EM) and token-level F1.
  • Reports robustness by condition (clean vs degraded).

Why this project matters

Most real document workflows use phone captures and imperfect scans, not clean PDFs.
This project is designed around that reality and maps to practical use cases:

  • invoice/receipt/form automation
  • mobile document search assistants
  • assistive document reading tools
  • noisy OCR+LLM production pipelines

Core Features

  • Data curation: WildDoc subset prep and local image/CSV generation.
  • Retrieval: Brute-force embedding retrieval + Top-1/Top-K evaluation.
  • VQA: OpenAI-backed VQA interface for document QA.
  • Robustness: Blur/brightness/perspective perturbation helpers and per-condition metrics.
  • Tested codebase: Unit tests for data, retrieval, metrics, VQA, robustness, and index build flow.

Tech Stack

  • Python, PyTorch, NumPy, pandas
  • OpenCV (image perturbations)
  • Hugging Face Datasets (WildDoc loading)
  • OpenAI API (VQA runtime)
  • pytest (unit testing)

Benchmark Snapshot

Retrieval benchmark on a local WildDoc subset (200 samples: 100 ChartQA + 100 DocVQA), evaluated with Top-1 and Top-5 accuracy.

Method Top-1 Top-5 Notes
CLIP retrieval pipeline (openai/clip-vit-base-patch32) 0.055 0.130 End-to-end model pipeline in scripts/test_retrieval_model.py
Type-aware heuristic baseline (100-trial mean) 0.007 0.041 Keyword-based chart/doc routing + random retrieval within type
Random baseline (100-trial mean) 0.005 0.025 Uniform random top-k from full image pool

Interpretation: the CLIP-based retrieval pipeline consistently outperforms both random and heuristic baselines on the same evaluation subset.

Repository Layout

  • src/wilddoc/
    • data.py – WildDoc subset preparation and CSV/image generation.
    • retrieval.py – retrieval logic, retrieval metrics, index manifest helpers.
    • vqa.py – OpenAI-backed VQA wrapper and VQA evaluation.
    • metrics.py – EM/F1 metrics.
    • robustness.py – perturbations and robustness report aggregation.
  • scripts/
    • build_index.py – build index manifest from questions.csv + images.
    • run_retrieval_eval.py – retrieval evaluation from prediction CSV.
    • run_vqa_eval.py – VQA evaluation from prediction CSV.
    • run_robustness_eval.py – robustness evaluation by condition from prediction CSV.
  • tests/ – unit tests covering module and CLI-facing behavior.

Setup

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

Environment Variables

The VQA module makes real OpenAI API calls. Set this locally before running VQA-related code:

  • OPENAI_API_KEY

Do not commit secrets. Local .env is ignored by .gitignore.

Run Tests

python -m pytest

CLI Usage

Ask a question about your own document image:

python scripts/ask_doc.py \
  --image path/to/your_document.jpg \
  --question "What is the due date?"

Build index manifest:

python scripts/build_index.py \
  --questions-csv data/questions.csv \
  --image-dir data/images \
  --output-json data/index_manifest.json

Evaluate retrieval predictions:

python scripts/run_retrieval_eval.py \
  --csv data/retrieval_predictions.csv \
  --k 5 \
  --output-json data/retrieval_metrics.json

Run end-to-end retrieval model testing (CLIP embeddings + Top-K metrics):

python scripts/test_retrieval_model.py \
  --questions-csv data/questions.csv \
  --image-dir data/images \
  --model-name openai/clip-vit-base-patch32 \
  --k 5 \
  --max-rows 200 \
  --output-pred-csv data/retrieval_predictions.csv \
  --output-metrics-json data/retrieval_metrics.json

Evaluate VQA predictions:

python scripts/run_vqa_eval.py \
  --csv data/vqa_predictions.csv \
  --output-json data/vqa_metrics.json

Evaluate robustness by condition:

python scripts/run_robustness_eval.py \
  --csv data/robustness_predictions.csv \
  --condition-col condition \
  --output-json data/robustness_metrics.json

Academic Attribution

This project implementation and experimental framing are inspired by concepts and assignments from CISC451 (multimodal/document understanding course content).
The codebase and engineering structure in this repository are independently developed as a standalone portfolio project.

About

Robust multimodal document understanding pipeline (retrieval + VQA) on WildDoc with EM/F1 and degradation-based robustness evaluation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages