Skip to content

woladi/macos-vision

Repository files navigation

macos-vision

Apple Vision for Node.js — native, fast, offline, no API keys required.

Uses macOS's built-in Vision framework via a compiled Swift binary. Works completely offline. No cloud services, no API keys, no Python, zero runtime dependencies.

Requirements

  • macOS 12+
  • Node.js 18+
  • Xcode Command Line Tools
xcode-select --install

Installation

npm install macos-vision

The native Swift binary is compiled automatically on install.

What this is (and isn't)

macos-vision gives you raw Apple Vision results — text, coordinates, bounding boxes, labels.

It is not a document pipeline. It does not:

  • Convert PDFs or images to Markdown
  • Understand document structure (headings, tables, paragraphs)
  • Chain multiple detections into a final report

For those use cases, use the raw output as input to an LLM or a post-processing layer of your own.


CLI

# OCR — plain text (default)
npx macos-vision photo.jpg

# Structured OCR blocks with bounding boxes
npx macos-vision --blocks photo.jpg

# Detect faces
npx macos-vision --faces photo.jpg

# Detect barcodes and QR codes
npx macos-vision --barcodes photo.jpg

# Detect rectangular shapes
npx macos-vision --rectangles photo.jpg

# Find document boundary
npx macos-vision --document photo.jpg

# Classify image content
npx macos-vision --classify photo.jpg

# Run all detections at once
npx macos-vision --all photo.jpg

Multiple flags can be combined: npx macos-vision --blocks --faces --classify photo.jpg

Structured results are printed as JSON to stdout.


API

import { ocr, detectFaces, detectBarcodes, detectRectangles, detectDocument, classify } from 'macos-vision'

// OCR — plain text
const text = await ocr('photo.jpg')

// OCR — structured blocks with bounding boxes
const blocks = await ocr('photo.jpg', { format: 'blocks' })

// Detect faces
const faces = await detectFaces('photo.jpg')

// Detect barcodes and QR codes
const codes = await detectBarcodes('invoice.jpg')

// Detect rectangular shapes (tables, forms, cards)
const rects = await detectRectangles('document.jpg')

// Find document boundary in a photo
const doc = await detectDocument('photo.jpg') // DocumentBounds | null

// Classify image content
const labels = await classify('photo.jpg')

// Layout inference — unified reading-order-sorted representation
const layout = inferLayout({ textBlocks: blocks, faces, barcodes: codes })
// layout is LayoutBlock[] — ready to feed into a Markdown renderer or LLM context

Layout inference

inferLayout merges raw Vision results into a unified LayoutBlock[] sorted in reading order (top-to-bottom, left-to-right). Text blocks are grouped into lines and paragraphs using geometric heuristics.

import { ocr, detectFaces, detectBarcodes, inferLayout } from 'macos-vision';

const blocks   = await ocr('page.png', { format: 'blocks' });
const faces    = await detectFaces('page.png');
const barcodes = await detectBarcodes('page.png');

const layout = inferLayout({ textBlocks: blocks, faces, barcodes });

for (const block of layout) {
  if (block.kind === 'text') {
    console.log(`[p${block.paragraphId} l${block.lineId}] ${block.text}`);
  } else {
    console.log(`[${block.kind}] at (${block.x.toFixed(2)}, ${block.y.toFixed(2)})`);
  }
}

LayoutBlock is a discriminated union — use block.kind to narrow the type:

kind Extra fields
'text' text, lineId, paragraphId
'barcode' value, type
'face'
'rectangle'
'document'

Note: Layout inference is a heuristic layer. It does not understand multi-column layouts or rotated text. Treat it as structured input for downstream tools, not as ground truth.

API

ocr(imagePath, options?)

Extracts text from an image.

Parameter Type Default Description
imagePath string Path to image (PNG, JPG, JPEG, WEBP)
options.format 'text' | 'blocks' 'text' Plain text or structured blocks with coordinates

Returns Promise<string> or Promise<VisionBlock[]>.

interface VisionBlock {
  text: string
  x: number       // 0–1 from left
  y: number       // 0–1 from top
  width: number   // 0–1
  height: number  // 0–1
}

detectFaces(imagePath)

Detects human faces and returns their bounding boxes.

interface Face {
  x: number; y: number; width: number; height: number
  confidence: number  // 0–1
}

detectBarcodes(imagePath)

Detects barcodes and QR codes and decodes their payload.

interface Barcode {
  type: string    // e.g. 'org.iso.QRCode', 'org.gs1.EAN-13'
  value: string   // decoded content
  x: number; y: number; width: number; height: number
}

detectRectangles(imagePath)

Finds rectangular shapes (documents, tables, cards, forms).

interface Rectangle {
  topLeft: [number, number]; topRight: [number, number]
  bottomLeft: [number, number]; bottomRight: [number, number]
  confidence: number
}

detectDocument(imagePath)

Finds the boundary of a document in a photo (e.g. paper on a desk). Returns null if no document is found.

interface DocumentBounds {
  topLeft: [number, number]; topRight: [number, number]
  bottomLeft: [number, number]; bottomRight: [number, number]
  confidence: number
}

classify(imagePath)

Returns top image classification labels with confidence scores.

interface Classification {
  identifier: string   // e.g. 'document', 'outdoor', 'animal'
  confidence: number   // 0–1
}

Why macos-vision?

macos-vision Tesseract.js Cloud APIs
Offline
No API key
Native speed
Zero runtime deps
OCR with bounding boxes
Face detection
Barcode / QR
Document detection
Image classification
macOS only

Apple Vision is the same engine used by macOS Spotlight, Live Text, and Shortcuts — highly optimized and accurate.

License

MIT

About

Apple Vision OCR & image analysis for Node.js — native, fast, offline, no API keys

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors