📄 PDF Processing Solution — Adobe India Hackathon 2025

🧠 Overview

This solution processes PDF files and extracts a structured outline based on headings. It outputs the result as a JSON file per PDF and adheres strictly to the challenge constraints:

No machine learning
Fully offline
Sub-10s processing for 50-page PDFs
≤ 16GB RAM
Works on Linux (AMD64)

🧰 Approach

This is a logic-based, non-ML solution that:

Extracts raw text + font/position data from PDFs using pdf.js-extract
Converts content into markdown-style blocks, assigning heading levels (H1–H3) based on:
- Font size
- Font weight (bold/italic)
- Vertical position
- Regex-based heuristics
- Text block analysis
- Visual/positional logic

Template heading filtering

Identifies headings from markdown and tags them with page numbers
Applies heuristic filtering to remove:
- Repetitive headers (e.g. "Page 1", "Confidential")
- Common footer patterns

Outputs a JSON structure matching the provided schema:

{
  "title": "Document Title",
  "outline": [
    {
      "level": "H1",
      "text": "Heading Text",
      "page": 1
    },
     {
      "level": "H2",
      "text": "Heading Text",
      "page": 2
    },
     {
      "level": "H3",
      "text": "Heading Text",
      "page": 3
    }
  ]
}

No external APIs or internet access are used at any point.

📦 Libraries & Tools Used

All tools are open-source and installed inside the Docker container:

Library	Purpose
`pdf.js-extract`	Extracts structured data from PDFs
`marked`	Markdown parser for heading extraction
`Node.js (v18)`	Main runtime for logic

📁 Project Structure

Adobe_Challange_1_a/
├── input/                      # 📥 Place your input PDFs here
├── output/                     # 📤 JSON outputs appear here
├── sample_dataset/             # 🧪 Sample input/output/testing
│   ├── pdfs/
│   ├── outputs/
│   └── schema/output_schema.json
├── Dockerfile                  # 🐳 Docker config (in root)
├── process_pdfs.js             # 🚀 Main logic
├── package.json                # 📦 Node.js dependencies
└── README.md                   # 📚 You're reading it

🛠️ How to Build & Run

🔒 Your solution must work in an isolated, offline, CPU-only Docker container.

▶️ 1. Clone & Build Docker Image

git clone https://github.com/harshcodesdev/Adobe_Challange_1_a.git
cd Adobe_Challange_1_a

docker build --platform linux/amd64 -t pdf-processor .

🧪 2. Run with Sample Data (Test Run)

docker run --rm \
  -v $(pwd)/sample_dataset/pdfs:/app/input:ro \
  -v $(pwd)/sample_dataset/outputs:/app/output \
  --network none \
  pdf-processor

➡ Output will appear in: sample_dataset/outputs/

📁 3. Run with Your Own Data

Place your PDFs in the input/ folder
Make sure output/ folder exists and is writable
Then run:

docker run --rm \
  -v $(pwd)/input:/app/input:ro \
  -v $(pwd)/output:/app/output \
  --network none \
  pdf-processor

➡ For every filename.pdf, you’ll get filename.json in output/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📄 PDF Processing Solution — Adobe India Hackathon 2025

🧠 Overview

🧰 Approach

📦 Libraries & Tools Used

📁 Project Structure

🛠️ How to Build & Run

▶️ 1. Clone & Build Docker Image

🧪 2. Run with Sample Data (Test Run)

📁 3. Run with Your Own Data

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📄 PDF Processing Solution — Adobe India Hackathon 2025

🧠 Overview

🧰 Approach

📦 Libraries & Tools Used

📁 Project Structure

🛠️ How to Build & Run

▶️ 1. Clone & Build Docker Image

🧪 2. Run with Sample Data (Test Run)

📁 3. Run with Your Own Data