This solution processes PDF files and extracts a structured outline based on headings. It outputs the result as a JSON file per PDF and adheres strictly to the challenge constraints:
- No machine learning
- Fully offline
- Sub-10s processing for 50-page PDFs
- β€ 16GB RAM
- Works on Linux (AMD64)
This is a logic-based, non-ML solution that:
-
Extracts raw text + font/position data from PDFs using
pdf.js-extract -
Converts content into markdown-style blocks, assigning heading levels (
H1βH3) based on:- Font size
- Font weight (bold/italic)
- Vertical position
- Regex-based heuristics
- Text block analysis
- Visual/positional logic
Template heading filtering
-
Identifies headings from markdown and tags them with page numbers
-
Applies heuristic filtering to remove:
- Repetitive headers (e.g. "Page 1", "Confidential")
- Common footer patterns
-
Outputs a JSON structure matching the provided schema:
{ "title": "Document Title", "outline": [ { "level": "H1", "text": "Heading Text", "page": 1 }, { "level": "H2", "text": "Heading Text", "page": 2 }, { "level": "H3", "text": "Heading Text", "page": 3 } ] }
No external APIs or internet access are used at any point.
All tools are open-source and installed inside the Docker container:
| Library | Purpose |
|---|---|
pdf.js-extract |
Extracts structured data from PDFs |
marked |
Markdown parser for heading extraction |
Node.js (v18) |
Main runtime for logic |
Adobe_Challange_1_a/
βββ input/ # π₯ Place your input PDFs here
βββ output/ # π€ JSON outputs appear here
βββ sample_dataset/ # π§ͺ Sample input/output/testing
β βββ pdfs/
β βββ outputs/
β βββ schema/output_schema.json
βββ Dockerfile # π³ Docker config (in root)
βββ process_pdfs.js # π Main logic
βββ package.json # π¦ Node.js dependencies
βββ README.md # π You're reading it
π Your solution must work in an isolated, offline, CPU-only Docker container.
git clone https://github.com/harshcodesdev/Adobe_Challange_1_a.git
cd Adobe_Challange_1_a
docker build --platform linux/amd64 -t pdf-processor .docker run --rm \
-v $(pwd)/sample_dataset/pdfs:/app/input:ro \
-v $(pwd)/sample_dataset/outputs:/app/output \
--network none \
pdf-processorβ‘ Output will appear in: sample_dataset/outputs/
- Place your PDFs in the
input/folder - Make sure
output/folder exists and is writable - Then run:
docker run --rm \
-v $(pwd)/input:/app/input:ro \
-v $(pwd)/output:/app/output \
--network none \
pdf-processorβ‘ For every filename.pdf, youβll get filename.json in output/