PDF-Decomposer

A powerful TypeScript library for comprehensive PDF processing and content extraction. Optimized for production use with universal browser and Node.js support.

Core Features

PDF Decomposer Class

Load Once, Use Many Times - Initialize PDF once, perform multiple operations
Progress Tracking - Observable pattern with real-time progress callbacks
Error Handling - Comprehensive error reporting with page-level context
Memory Efficient - Built-in memory management and cleanup
Universal Support - Works in Node.js 16+ and all modern browsers

Main Operations

1. Content Decomposition (`decompose()`)

Extract structured text with positioning and formatting:

Smart element composition with elementComposer
Content area cleaning with cleanComposer
Page-level composition with pageComposer
Image extraction from embedded PDF objects
Link extraction from PDF annotations and text patterns
Smart URL detection with comprehensive email and domain pattern matching

2. Screenshot Generation (`screenshot()`)

High-quality page rendering to PNG/JPEG
Configurable resolution and quality
Batch processing with progress tracking
File output or base64 data URLs

3. PDF Data Generation (`data()`)

pwa-admin compatible data structure
Interactive area mapping with normalized coordinates
Widget ID generation following epub conventions
Article relationship management
skipScreenshots option for memory-constrained environments

4. PDF Slicing (`slice()`)

Extract specific page ranges
Generate new PDF documents
Replace internal document structure
Preserve all metadata and formatting

Advanced Content Processing

Element Composer

Groups scattered text elements into coherent paragraphs
Font-size based header element recognition (h1, h2, h3, etc.)
Smart span merging for headers with same font-size/family but different colors
Content consolidation for multiple heading tags
Preserves reading order and text flow
Smart font and spacing analysis

Page Composer

Merges continuous content across pages
Detects article boundaries and section breaks
Interview and feature content recognition
Typography consistency analysis

Clean Composer

Filters out headers, footers, and page numbers
Content area detection with configurable margins
Image size validation and filtering
Control character removal

Image Extraction

Universal browser-compatible processing
Multiple format support (RGB, RGBA, Grayscale)
Auto-scaling for memory safety
Duplicate detection and removal

Link Extraction (`extractLinks: true`)

PDF Annotations: Extract interactive link annotations with URLs and destinations
Text Pattern Matching: Detect URLs in text content (e.g., "GIA.edu/jewelryservices")
Email Detection: Find email addresses in document text with automatic mailto: prefix
Smart URL Recognition: Enhanced regex patterns for domain+path detection
Link Types: Support for external URLs, internal PDF destinations, and email links
No Duplicates: Intelligent handling prevents text/link element duplication
Position Data: Accurate bounding box coordinates for each link
Link Attributes: Rich metadata including link type, context text, and extraction method

Performance and Memory

Memory Manager - Adaptive cleanup and monitoring
Progress Callbacks - Real-time operation tracking
Background Processing - Non-blocking operations
Batch Processing - Efficient multi-page handling

Installation

npm install @febbyrg/pdf-decomposer

# For Node.js with canvas support (optional)
npm install canvas

# For browser usage
npm install pdfjs-dist

Quick Start

Class-Based API (Recommended)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Load PDF once, use many times
const pdf = new PdfDecomposer(buffer) // Buffer, ArrayBuffer, or Uint8Array
await pdf.initialize()

// Multiple operations on same PDF
const pages = await pdf.decompose({
  elementComposer: true, // Group text into paragraphs
  pageComposer: true, // Merge continuous content across pages
  cleanComposer: true, // Clean headers/footers
  extractImages: true, // Extract embedded images
  extractLinks: true // Extract links and annotations from PDF
})

// Enhanced MinifyOptions with Element Attributes
const styledPages = await pdf.decompose({
  elementComposer: true,
  minify: true,
  minifyOptions: {
    format: 'html', // data field contains formatted HTML
    elementAttributes: true // Include styling information
  }
})

const screenshots = await pdf.screenshot({
  imageWidth: 1024,
  imageQuality: 90
})

const pdfData = await pdf.data({
  // pwa-admin compatible format
  imageWidth: 1024,
  elementComposer: true
})

const sliced = await pdf.slice({
  // Extract first 5 pages
  numberPages: 5
})

// Access PDF properties
console.log(`Pages: \${pdf.numPages}`)
console.log(`Fingerprint: \${pdf.fingerprint}`)

Factory Method (One-liner)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// Create and initialize in one step
const pdf = await PdfDecomposer.create(buffer)
const pages = await pdf.decompose({ elementComposer: true })

Progress Tracking

const pdf = new PdfDecomposer(buffer)

// Subscribe to progress updates
pdf.subscribe((state) => {
  console.log(`\${state.progress}% - \${state.message}`)
})

await pdf.initialize()
const result = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true
})

Browser Environment (Angular, React, Vue)

import { PdfDecomposer } from '@febbyrg/pdf-decomposer'

// In browser - use File API
async function processPdfFile(file: File) {
  const arrayBuffer = await file.arrayBuffer()
  const pdf = new PdfDecomposer(arrayBuffer)
  await pdf.initialize()

  return await pdf.decompose({
    elementComposer: true,
    extractImages: true
  })
}

// Configure PDF.js worker (once per app)
import { PdfWorkerConfig } from '@febbyrg/pdf-decomposer'
PdfWorkerConfig.configure() // Auto-configures worker URL

Advanced Usage Examples

Content Processing Pipeline

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Step 1: Extract raw content with advanced processing
const pages = await pdf.decompose({
  startPage: 1,
  endPage: 10,
  elementComposer: true,
  pageComposer: true,
  cleanComposer: true,
  extractImages: true,
  minify: true,
  cleanComposerOptions: {
    topMarginPercent: 0.15,
    bottomMarginPercent: 0.1,
    minTextHeight: 8,
    removeControlCharacters: true
  }
})

// Step 2: Generate interactive data for web apps
const interactiveData = await pdf.data({
  startPage: 1,
  endPage: 10,
  imageWidth: 1024,
  elementComposer: true
})

// Step 3: Create high-quality screenshots
const screenshots = await pdf.screenshot({
  startPage: 1,
  endPage: 10,
  imageWidth: 1200,
  imageQuality: 95
})

PDF Slicing and Processing

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

console.log(`Original PDF: \${pdf.numPages} pages`)

// Slice to first 5 pages (modifies internal PDF)
const sliceResult = await pdf.slice({
  numberPages: 5
})

console.log(`Sliced PDF: \${pdf.numPages} pages`) // Now shows 5
console.log(`Saved \${sliceResult.fileSize} bytes`)

// Process the sliced PDF
const pages = await pdf.decompose({
  elementComposer: true
})

Link Extraction

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Extract links from PDF content
const pagesWithLinks = await pdf.decompose({
  extractLinks: true,
  elementComposer: true
})

// Process found links
pagesWithLinks.pages.forEach((page, pageIndex) => {
  const linkElements = page.elements.filter((el) => el.type === 'link')

  linkElements.forEach((link) => {
    console.log(`Page \${pageIndex + 1}: Found \${link.attributes.linkType}`)
    console.log(`  URL: \${link.data}`)
    console.log(`  Position: [\${link.boundingBox.left}, \${link.boundingBox.top}]`)

    if (link.attributes.text) {
      console.log(`  Context: "\${link.attributes.text}"`)
    }
  })
})

API Reference

PdfDecomposer Class

Constructor

new PdfDecomposer(input: Buffer | ArrayBuffer | Uint8Array)

Static Methods

// Factory method - create and initialize in one step
static async create(input: Buffer | ArrayBuffer | Uint8Array): Promise<PdfDecomposer>

Instance Methods

// Initialize PDF (required before other operations)
async initialize(): Promise<void>

// Extract content and structure
async decompose(options?: PdfDecomposerOptions): Promise<DecomposeResult>

// Generate page screenshots
async screenshot(options?: ScreenshotOptions): Promise<ScreenshotResult>

// Generate pwa-admin compatible data structure
async data(options?: DataOptions): Promise<DataResult>

// Slice PDF to specific page range
async slice(options?: SliceOptions): Promise<SliceResult>

// Subscribe to progress updates
subscribe(callback: (state: PdfDecomposerState) => void): void

// Get PDF and page fingerprints for caching
async getFingerprints(): Promise<{ pdfHash: string; pageHashes: string[]; total: number }>

Properties

readonly numPages: number           // Total number of pages
readonly fingerprint: string        // PDF fingerprint for caching
readonly initialized: boolean       // Initialization status

Options Interfaces

PdfDecomposerOptions

interface PdfDecomposerOptions {
  startPage?: number // First page (1-indexed, default: 1)
  endPage?: number // Last page (1-indexed, default: all)
  outputDir?: string // Output directory for files
  elementComposer?: boolean // Group text into paragraphs
  pageComposer?: boolean // Merge continuous content across pages
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations from PDF
  minify?: boolean // Compact output format
  cleanComposer?: boolean // Remove headers/footers
  cleanComposerOptions?: PdfCleanComposerOptions
  minifyOptions?: {
    format?: 'plain' | 'html' // Data field format
    elementAttributes?: boolean // Include slim element attributes
  }
}

ScreenshotOptions

interface ScreenshotOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory for image files
  imageWidth?: number // Image width (default: 1200)
  imageQuality?: number // JPEG quality 1-100 (default: 90)
}

DataOptions

interface DataOptions {
  startPage?: number // First page (1-indexed)
  endPage?: number // Last page (1-indexed)
  outputDir?: string // Output directory
  extractImages?: boolean // Extract embedded images
  extractLinks?: boolean // Extract links and annotations
  elementComposer?: boolean // Group elements into paragraphs
  cleanComposer?: boolean // Clean content area
  imageWidth?: number // Screenshot width (default: 1024)
  imageQuality?: number // Screenshot quality (default: 90)
}

SliceOptions

interface SliceOptions {
  numberPages?: number // Number of pages from start
  startPage?: number // Starting page (1-indexed, default: 1)
  endPage?: number // Ending page (1-indexed)
}

PdfCleanComposerOptions

interface PdfCleanComposerOptions {
  topMarginPercent?: number // Exclude top % for headers (default: 0.1)
  bottomMarginPercent?: number // Exclude bottom % for footers (default: 0.1)
  sideMarginPercent?: number // Exclude side % (default: 0.05)
  minTextHeight?: number // Minimum text height (default: 8)
  minTextWidth?: number // Minimum text width (default: 10)
  minTextLength?: number // Minimum text length (default: 3)
  removeControlCharacters?: boolean // Remove non-printable chars (default: true)
  removeIsolatedCharacters?: boolean // Remove isolated chars (default: true)
  minImageWidth?: number // Minimum image width (default: 50)
  minImageHeight?: number // Minimum image height (default: 50)
  minImageArea?: number // Minimum image area (default: 2500)
  coverPageDetection?: boolean // Detect cover pages (default: true)
  coverPageThreshold?: number // Cover detection threshold (default: 0.8)
}

Result Interfaces

DecomposeResult

interface DecomposeResult {
  pages: PdfPageContent[]
}

interface PdfPageContent {
  pageIndex: number // 0-based page index
  pageNumber: number // 1-based page number
  width: number // Page width in points
  height: number // Page height in points
  title: string // Page title
  elements: PdfElement[] // Extracted elements
  metadata?: {
    composedFromPages?: number[] // Original page indices (for pageComposer)
    [key: string]: any
  }
}

ScreenshotResult

interface ScreenshotResult {
  totalPages: number
  screenshots: ScreenshotPageResult[]
}

interface ScreenshotPageResult {
  pageNumber: number // 1-based page number
  width: number // Image width in pixels
  height: number // Image height in pixels
  screenshot: string // Base64 data URL
  filePath?: string // File path if outputDir provided
  error?: string // Error message if failed
}

DataResult

interface DataResult {
  data: PdfData[]
}

interface PdfData {
  id: string // Unique page identifier
  index: number // 0-based page index
  image: string // Page screenshot URL
  thumbnail: string // Thumbnail URL
  areas: PdfArea[] // Interactive areas
}

interface PdfArea {
  id: string // Unique area identifier
  coords: number[] // [x1, y1, x2, y2] normalized 0-1
  articleId: number // Associated article ID
  widgetId: string // Widget identifier (P: or T:)
}

SliceResult

interface SliceResult {
  pdfBytes: Uint8Array // Sliced PDF data
  originalPageCount: number // Original page count
  slicedPageCount: number // Sliced page count
  pageRange: {
    startPage: number
    endPage: number
  }
  fileSize: number // Size in bytes
}

Testing and Development

Run Tests

npm test                    # Comprehensive test suite
npm run test:screenshot     # Screenshot generation tests
npm run test:data          # PDF data generation tests

Build and Development

npm run build              # Build TypeScript to dist/
npm run build:watch        # Watch mode for development
npm run lint               # ESLint validation

Environment Support

Feature	Node.js	Browser	Notes
Text Extraction	Yes	Yes	Full support both environments
Image Extraction	Yes	Yes	Universal canvas-based processing
Screenshots	Yes	Yes	Node.js uses canvas, browser Canvas API
PDF Slicing	Yes	Yes	Uses pdf-lib in both environments
Progress Tracking	Yes	Yes	Observable pattern with callbacks
Memory Management	Yes	Limited	Advanced in Node.js, basic in browser
File Output	Yes	No	Browser returns data URLs/blobs
Element Composer	Yes	Yes	Smart text grouping
Page Composer	Yes	Yes	Cross-page content merging
Clean Composer	Yes	Yes	Header/footer removal

Browser Compatibility

Chrome 60+
Firefox 55+
Safari 11+
Edge 79+
Mobile browsers (iOS Safari, Chrome Mobile)

Node.js Requirements

Node.js 16+ required
Canvas optional for enhanced screenshot quality
TypeScript 4.9+ for development

Production Usage Examples

Memory Optimization

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Process in smaller batches for large PDFs
const totalPages = pdf.numPages
const batchSize = 10

for (let start = 1; start <= totalPages; start += batchSize) {
  const end = Math.min(start + batchSize - 1, totalPages)

  const batch = await pdf.decompose({
    startPage: start,
    endPage: end,
    elementComposer: true
  })

  // Process batch results...
}

Built-in Memory Limits (v1.0.6+):

MAX_SAFE_PIXELS: 2M pixels per image
MAX_DIMENSION: 2000px max width/height
MAX_IMAGES_PER_PAGE: 20 images
Canvas size limits: 1200x1600 for screenshots
Sequential processing to reduce peak memory
Use skipScreenshots: true in data() to skip page image generation

Error Handling

const pdf = new PdfDecomposer(buffer)

pdf.subscribe((state) => {
  console.log(`Progress: \${state.progress}%`)
})

try {
  await pdf.initialize()
  const result = await pdf.decompose()
} catch (error) {
  if (error.name === 'InvalidPdfError') {
    console.error('Invalid PDF format:', error.message)
  } else if (error.name === 'MemoryError') {
    console.error('Memory limit exceeded:', error.message)
  } else {
    console.error('Processing failed:', error.message)
  }
}

Caching Strategy

const pdf = new PdfDecomposer(buffer)
await pdf.initialize()

// Use fingerprint for caching
const fingerprints = await pdf.getFingerprints()
const cacheKey = `pdf_\${fingerprints.pdfHash}`

// Check cache before processing
const cached = cache.get(cacheKey)
if (!cached) {
  const result = await pdf.decompose()
  cache.set(cacheKey, result, { ttl: 3600 }) // 1 hour
}

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Use TypeScript for all new code
Add tests for new features
Update README for API changes
Follow existing code style
Test in both Node.js and browser environments

Publishing

Setup for Publishing

# Initial setup (run once)
npm run setup:publishing

# Verify configuration
npm run setup:verify

Publishing Commands

# Publish to NPM only
npm run publish:npm

# Publish to GitHub Packages only
npm run publish:github

# Publish to both registries
npm run publish:both

# Version bump + publish
npm version patch && npm run publish:both

License

PDF-Decomposer is dual-licensed:

Non-Commercial Use (Free)

Personal projects
Educational use
Research purposes
Open source projects

Commercial Use (Paid License Required)

Commercial applications
Revenue-generating products
Enterprise software
Distribution in commercial products

For commercial licensing, contact febby.rachmat@gmail.com

See LICENSE file for complete terms.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github		.github
scripts		scripts
src		src
.gitignore		.gitignore
.npmignore		.npmignore
.npmrc.template		.npmrc.template
CHANGELOG.md		CHANGELOG.md
COMMERCIAL-LICENSE.md		COMMERCIAL-LICENSE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

PDF-Decomposer

Core Features

PDF Decomposer Class

Main Operations

1. Content Decomposition (decompose())

2. Screenshot Generation (screenshot())

3. PDF Data Generation (data())

4. PDF Slicing (slice())

Advanced Content Processing

Element Composer

Page Composer

Clean Composer

Image Extraction

Link Extraction (extractLinks: true)

Performance and Memory

Installation

Quick Start

Class-Based API (Recommended)

Factory Method (One-liner)

Progress Tracking

Browser Environment (Angular, React, Vue)

Advanced Usage Examples

Content Processing Pipeline

PDF Slicing and Processing

Link Extraction

API Reference

PdfDecomposer Class

Constructor

Static Methods

Instance Methods

Properties

Options Interfaces

PdfDecomposerOptions

ScreenshotOptions

DataOptions

SliceOptions

PdfCleanComposerOptions

Result Interfaces

DecomposeResult

ScreenshotResult

DataResult

SliceResult

Testing and Development

Run Tests

Build and Development

Environment Support

Browser Compatibility

Node.js Requirements

Production Usage Examples

Memory Optimization

Error Handling

Caching Strategy

Contributing

Development Guidelines

Publishing

Setup for Publishing

Publishing Commands

License

Non-Commercial Use (Free)

Commercial Use (Paid License Required)

Links

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Content Decomposition (`decompose()`)

2. Screenshot Generation (`screenshot()`)

3. PDF Data Generation (`data()`)

4. PDF Slicing (`slice()`)

Link Extraction (`extractLinks: true`)

Packages