Skip to content

Parses a pdf file using 8 different libraries over Python, Go and JavaScript to give you an "idea" how well your CV parses in an ATS

License

Notifications You must be signed in to change notification settings

treejamie/maybe-ats-parse

Repository files navigation

parsepdf

A multi-library PDF text extraction tool to test how Applicant Tracking Systems (ATS) parse your resume/CV.

If you've created a fancy resume in Figma, Canva, or similar design tools, there's a good chance ATS systems can't read it properly. This tool lets you see exactly what different parsing engines extract from your PDF.

Why?

Most job applications go through an ATS that extracts text from your resume. If the extraction is garbled, your resume might get rejected before a human ever sees it. This tool runs your PDF through 8 different parsing libraries used by real ATS systems, so you can spot problems before applying.

Quick Start (Docker)

# Build the image
docker build -t parsepdf .

# Parse a PDF (outputs appear next to the input file)
docker run --rm -v ~/Desktop:/data parsepdf resume.pdf

This mounts your Desktop folder to /data in the container. The 8 output .txt files appear alongside your PDF.

You can mount any folder:

docker run --rm -v /path/to/folder:/data parsepdf myfile.pdf

Installation (without Docker)

git clone https://github.com/yourusername/parsepdf.git
cd parsepdf

# Python dependencies
pip install -r requirements.txt

# JavaScript dependencies
npm install

# Go binary (already built, or rebuild with)
go build -o parsepdf-go parsepdf.go

# Optional: poppler for pdftotext (recommended)
brew install poppler  # macOS
# apt install poppler-utils  # Ubuntu/Debian

Usage

./parse resume.pdf

By default, it looks for PDFs in ~/Desktop and outputs there too:

./parse resume.pdf  # looks for ~/Desktop/resume.pdf

Use absolute paths for other locations:

./parse /path/to/resume.pdf

Output

Running ./parse resume.pdf creates 8 text files:

Library Output File Notes
pdfminer.six resume.pdfminer.txt Python - common in enterprise ATS
PyMuPDF resume.pymupdf.txt Python - fast and accurate
pdfplumber resume.pdfplumber.txt Python - good for tables
pdf-parse resume.pdfparse.txt JavaScript - built on pdf.js
pdfjs-dist resume.pdfjs.txt JavaScript - Mozilla's pdf.js
pdf2json resume.pdf2json.txt JavaScript
ledongthuc/pdf resume.ledongthuc.txt Go
poppler resume.pdftotext.txt CLI - very common in enterprise ATS

What to look for

Open each output file and check:

  • Is the text in the right order? Multi-column layouts often get jumbled
  • Are there garbage characters? Custom fonts may not embed properly
  • Is anything missing? Text in images won't be extracted
  • Are words running together? Spacing issues from design tools

If pdftotext (poppler) output looks bad, most enterprise ATS systems will have the same problem.

Fixing common issues

  1. Text out of order: Flatten your design to a single-column layout, or ensure text boxes are created in reading order
  2. Garbage characters: Use standard fonts (Arial, Helvetica, Times) or ensure fonts are properly embedded
  3. Missing text: Don't put important info in images - use actual text
  4. Export settings: If you really must use Figma, there's no exporting of PDF with actual text, it outlines it. So you have to export as SVG, uncheck the outline text box and then reconstruct the SVG files in another application. It's a proper faff. But who really knows right? Maybe it's you and maybe its' the CV. There's no way to know.

Libraries used

Python

JavaScript

Go

CLI

  • poppler - PDF rendering library (pdftotext)

License

MIT License - see LICENSE

About

Parses a pdf file using 8 different libraries over Python, Go and JavaScript to give you an "idea" how well your CV parses in an ATS

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •