Skip to content
/ afpp Public

A fast, efficient, and minimal PDF parser for Node.js. Zero bloat. One dependency. Production-ready.

License

Notifications You must be signed in to change notification settings

l2ysho/afpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

359 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

afpp

Version codecov Node npm Downloads Repo Size Last Commit

afpp — A modern, dependency-light PDF parser for Node.js.

Built for performance, reliability, and developer sanity.


Overview

afpp (Another PDF Parser, Properly) is a Node.js library for extracting text and images from PDF files without heavyweight native dependencies, event-loop blocking, or fragile runtime assumptions.

The project was created to address recurring problems encountered with existing PDF tooling in the Node.js ecosystem:

  • Excessive bundle sizes and transitive dependencies
  • Native build steps (canvas, ImageMagick, Ghostscript)
  • Browser-specific assumptions (window, DOM, canvas)
  • Poor TypeScript support
  • Unreliable handling of encrypted PDFs
  • Performance and memory inefficiencies

afpp focuses on predictable behavior, explicit APIs, and production-ready defaults.


Key Features

  • Zero native build dependencies
  • Fully asynchronous, non-blocking architecture
  • First-class TypeScript support
  • Supports local files, buffers, and remote URLs
  • Handles encrypted PDFs
  • Configurable concurrency and rendering scale
  • Minimal and auditable dependency graph

Requirements

  • Node.js >= 22.14.0

Installation

Install using your preferred package manager:

npm install afpp
# or
yarn add afpp
# or
pnpm add afpp

Quick Start

All parsing functions accept the same input types:

  • string (file path)
  • Buffer
  • URL

Extract Text from a PDF

import { readFile } from 'fs/promises';
import path from 'path';

import { pdf2string } from 'afpp';

(async () => {
  const filePath = path.join('..', 'test', 'example.pdf');
  const buffer = await readFile(filePath);

  const pages = await pdf2string(buffer);
  console.log(pages); // ['Page 1 text', 'Page 2 text', ...]
})();

Render PDF Pages as Images

import { pdf2image } from 'afpp';

(async () => {
  const url = new URL('https://pdfobject.com/pdf/sample.pdf');
  const images = await pdf2image(url);

  console.log(images); // [Buffer, Buffer, ...]
})();

Streaming API (Large PDFs)

For large PDFs, use streaming functions to process pages incrementally without loading all results into memory:

import { writeFile } from 'fs/promises';

import { streamPdf2image, streamPdf2string } from 'afpp';

// Stream images - process each page as it's rendered
for await (const { pageNumber, pageCount, data } of streamPdf2image(
  './large.pdf',
)) {
  await writeFile(`page-${pageNumber}.png`, data);
  console.log(`Processed ${pageNumber}/${pageCount}`);
}

// Stream text - process each page as it's extracted
for await (const { pageNumber, data } of streamPdf2string('./large.pdf')) {
  console.log(`Page ${pageNumber}: ${data.substring(0, 100)}...`);
}

Benefits:

  • Lower peak memory usage
  • Faster time-to-first-result
  • Built-in progress tracking via pageNumber and pageCount

Extract PDF Metadata

import { getPdfMetadata } from 'afpp';

const metadata = await getPdfMetadata('./document.pdf');
console.log(metadata.pageCount); // e.g. 9
console.log(metadata.isEncrypted); // false
console.log(metadata.title); // 'My Document' or undefined
console.log(metadata.creationDate); // Date object or undefined

// Encrypted PDF
const meta = await getPdfMetadata('./secure.pdf', { password: 'secret' });
console.log(meta.isEncrypted); // true

Low-Level Parsing API

For advanced use cases, parsePdf exposes page-level control and transformation.

import { parsePdf } from 'afpp';

(async () => {
  const response = await fetch('https://pdfobject.com/pdf/sample.pdf');
  const buffer = Buffer.from(await response.arrayBuffer());

  const result = await parsePdf(buffer, {}, (pageContent) => pageContent);
  console.log(result);
})();

Configuration

All public APIs accept a shared options object.

const result = await parsePdf(buffer, {
  concurrency: 5,
  imageEncoding: 'jpeg',
  password: 'STRONG_PASS',
  scale: 4,
});

AfppParseOptions

Option Type Default Description
concurrency number | 'auto' 1 Number of pages processed in parallel. Use 'auto' for CPU-based scaling.
imageEncoding 'png' | 'jpeg' | 'webp' | 'avif' 'png' Output format for rendered images
password string Password for encrypted PDFs
scale number 1.0 Rendering scale (1.0 = 72 DPI, 2.0 = 144 DPI)

PdfMetadata

Returned by getPdfMetadata. All fields except pageCount and isEncrypted are optional — absent metadata fields are undefined, never empty strings.

Field Type Description
pageCount number Total number of pages
isEncrypted boolean Whether the document required a password to open
title string? Document title
author string? Document author
subject string? Document subject
creator string? Application that created the document
producer string? PDF producer application
creationDate Date? Document creation date
modificationDate Date? Document last modification date

Design Principles

  • Node-first: No browser globals or DOM assumptions
  • Explicit over implicit: No magic configuration
  • Fail fast: Clear errors instead of silent corruption
  • Production-oriented: Optimized for long-running processes

License

MIT © Richard Solár

About

A fast, efficient, and minimal PDF parser for Node.js. Zero bloat. One dependency. Production-ready.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 5