A multi-library PDF text extraction tool to test how Applicant Tracking Systems (ATS) parse your resume/CV.
If you've created a fancy resume in Figma, Canva, or similar design tools, there's a good chance ATS systems can't read it properly. This tool lets you see exactly what different parsing engines extract from your PDF.
Most job applications go through an ATS that extracts text from your resume. If the extraction is garbled, your resume might get rejected before a human ever sees it. This tool runs your PDF through 8 different parsing libraries used by real ATS systems, so you can spot problems before applying.
# Build the image
docker build -t parsepdf .
# Parse a PDF (outputs appear next to the input file)
docker run --rm -v ~/Desktop:/data parsepdf resume.pdfThis mounts your Desktop folder to /data in the container. The 8 output .txt files appear alongside your PDF.
You can mount any folder:
docker run --rm -v /path/to/folder:/data parsepdf myfile.pdfgit clone https://github.com/yourusername/parsepdf.git
cd parsepdf
# Python dependencies
pip install -r requirements.txt
# JavaScript dependencies
npm install
# Go binary (already built, or rebuild with)
go build -o parsepdf-go parsepdf.go
# Optional: poppler for pdftotext (recommended)
brew install poppler # macOS
# apt install poppler-utils # Ubuntu/Debian./parse resume.pdfBy default, it looks for PDFs in ~/Desktop and outputs there too:
./parse resume.pdf # looks for ~/Desktop/resume.pdfUse absolute paths for other locations:
./parse /path/to/resume.pdfRunning ./parse resume.pdf creates 8 text files:
| Library | Output File | Notes |
|---|---|---|
| pdfminer.six | resume.pdfminer.txt |
Python - common in enterprise ATS |
| PyMuPDF | resume.pymupdf.txt |
Python - fast and accurate |
| pdfplumber | resume.pdfplumber.txt |
Python - good for tables |
| pdf-parse | resume.pdfparse.txt |
JavaScript - built on pdf.js |
| pdfjs-dist | resume.pdfjs.txt |
JavaScript - Mozilla's pdf.js |
| pdf2json | resume.pdf2json.txt |
JavaScript |
| ledongthuc/pdf | resume.ledongthuc.txt |
Go |
| poppler | resume.pdftotext.txt |
CLI - very common in enterprise ATS |
Open each output file and check:
- Is the text in the right order? Multi-column layouts often get jumbled
- Are there garbage characters? Custom fonts may not embed properly
- Is anything missing? Text in images won't be extracted
- Are words running together? Spacing issues from design tools
If pdftotext (poppler) output looks bad, most enterprise ATS systems will have the same problem.
- Text out of order: Flatten your design to a single-column layout, or ensure text boxes are created in reading order
- Garbage characters: Use standard fonts (Arial, Helvetica, Times) or ensure fonts are properly embedded
- Missing text: Don't put important info in images - use actual text
- Export settings: If you really must use Figma, there's no exporting of PDF with actual text, it outlines it. So you have to export as SVG, uncheck the outline text box and then reconstruct the SVG files in another application. It's a proper faff. But who really knows right? Maybe it's you and maybe its' the CV. There's no way to know.
Python
- pdfminer.six - Pure Python PDF parser
- PyMuPDF - Python bindings for MuPDF
- pdfplumber - PDF parsing with table extraction
JavaScript
- pdf-parse - PDF parser built on pdf.js
- pdfjs-dist - Mozilla's PDF.js
- pdf2json - PDF to JSON converter
Go
- ledongthuc/pdf - Pure Go PDF reader
CLI
- poppler - PDF rendering library (pdftotext)
MIT License - see LICENSE