Skip to content

Latest commit

Β 

History

History
139 lines (104 loc) Β· 3.43 KB

File metadata and controls

139 lines (104 loc) Β· 3.43 KB

πŸ“„ PDF Processing Solution β€” Adobe India Hackathon 2025

🧠 Overview

This solution processes PDF files and extracts a structured outline based on headings. It outputs the result as a JSON file per PDF and adheres strictly to the challenge constraints:

  • No machine learning
  • Fully offline
  • Sub-10s processing for 50-page PDFs
  • ≀ 16GB RAM
  • Works on Linux (AMD64)

🧰 Approach

This is a logic-based, non-ML solution that:

  1. Extracts raw text + font/position data from PDFs using pdf.js-extract

  2. Converts content into markdown-style blocks, assigning heading levels (H1–H3) based on:

    • Font size
    • Font weight (bold/italic)
    • Vertical position
    • Regex-based heuristics
    • Text block analysis
    • Visual/positional logic

Template heading filtering

  1. Identifies headings from markdown and tags them with page numbers

  2. Applies heuristic filtering to remove:

    • Repetitive headers (e.g. "Page 1", "Confidential")
    • Common footer patterns
  3. Outputs a JSON structure matching the provided schema:

    {
      "title": "Document Title",
      "outline": [
        {
          "level": "H1",
          "text": "Heading Text",
          "page": 1
        },
         {
          "level": "H2",
          "text": "Heading Text",
          "page": 2
        },
         {
          "level": "H3",
          "text": "Heading Text",
          "page": 3
        }
      ]
    }

No external APIs or internet access are used at any point.


πŸ“¦ Libraries & Tools Used

All tools are open-source and installed inside the Docker container:

Library Purpose
pdf.js-extract Extracts structured data from PDFs
marked Markdown parser for heading extraction
Node.js (v18) Main runtime for logic

πŸ“ Project Structure

Adobe_Challange_1_a/
β”œβ”€β”€ input/                      # πŸ“₯ Place your input PDFs here
β”œβ”€β”€ output/                     # πŸ“€ JSON outputs appear here
β”œβ”€β”€ sample_dataset/             # πŸ§ͺ Sample input/output/testing
β”‚   β”œβ”€β”€ pdfs/
β”‚   β”œβ”€β”€ outputs/
β”‚   └── schema/output_schema.json
β”œβ”€β”€ Dockerfile                  # 🐳 Docker config (in root)
β”œβ”€β”€ process_pdfs.js             # πŸš€ Main logic
β”œβ”€β”€ package.json                # πŸ“¦ Node.js dependencies
└── README.md                   # πŸ“š You're reading it

πŸ› οΈ How to Build & Run

πŸ”’ Your solution must work in an isolated, offline, CPU-only Docker container.


▢️ 1. Clone & Build Docker Image

git clone https://github.com/harshcodesdev/Adobe_Challange_1_a.git
cd Adobe_Challange_1_a

docker build --platform linux/amd64 -t pdf-processor .

πŸ§ͺ 2. Run with Sample Data (Test Run)

docker run --rm \
  -v $(pwd)/sample_dataset/pdfs:/app/input:ro \
  -v $(pwd)/sample_dataset/outputs:/app/output \
  --network none \
  pdf-processor

➑ Output will appear in: sample_dataset/outputs/


πŸ“ 3. Run with Your Own Data

  1. Place your PDFs in the input/ folder
  2. Make sure output/ folder exists and is writable
  3. Then run:
docker run --rm \
  -v $(pwd)/input:/app/input:ro \
  -v $(pwd)/output:/app/output \
  --network none \
  pdf-processor

➑ For every filename.pdf, you’ll get filename.json in output/