Apple Vision for Node.js — native, fast, offline, no API keys required.
Uses macOS's built-in Vision framework via a compiled Swift binary. Works completely offline. No cloud services, no API keys, no Python, zero runtime dependencies.
- macOS 12+
- Node.js 18+
- Xcode Command Line Tools
xcode-select --installnpm install macos-visionThe native Swift binary is compiled automatically on install.
macos-vision gives you raw Apple Vision results — text, coordinates, bounding boxes, labels.
It is not a document pipeline. It does not:
- Convert PDFs or images to Markdown
- Understand document structure (headings, tables, paragraphs)
- Chain multiple detections into a final report
For those use cases, use the raw output as input to an LLM or a post-processing layer of your own.
# OCR — plain text (default)
npx macos-vision photo.jpg
# Structured OCR blocks with bounding boxes
npx macos-vision --blocks photo.jpg
# Detect faces
npx macos-vision --faces photo.jpg
# Detect barcodes and QR codes
npx macos-vision --barcodes photo.jpg
# Detect rectangular shapes
npx macos-vision --rectangles photo.jpg
# Find document boundary
npx macos-vision --document photo.jpg
# Classify image content
npx macos-vision --classify photo.jpg
# Run all detections at once
npx macos-vision --all photo.jpgMultiple flags can be combined: npx macos-vision --blocks --faces --classify photo.jpg
Structured results are printed as JSON to stdout.
import { ocr, detectFaces, detectBarcodes, detectRectangles, detectDocument, classify } from 'macos-vision'
// OCR — plain text
const text = await ocr('photo.jpg')
// OCR — structured blocks with bounding boxes
const blocks = await ocr('photo.jpg', { format: 'blocks' })
// Detect faces
const faces = await detectFaces('photo.jpg')
// Detect barcodes and QR codes
const codes = await detectBarcodes('invoice.jpg')
// Detect rectangular shapes (tables, forms, cards)
const rects = await detectRectangles('document.jpg')
// Find document boundary in a photo
const doc = await detectDocument('photo.jpg') // DocumentBounds | null
// Classify image content
const labels = await classify('photo.jpg')
// Layout inference — unified reading-order-sorted representation
const layout = inferLayout({ textBlocks: blocks, faces, barcodes: codes })
// layout is LayoutBlock[] — ready to feed into a Markdown renderer or LLM contextinferLayout merges raw Vision results into a unified LayoutBlock[] sorted in reading order (top-to-bottom, left-to-right). Text blocks are grouped into lines and paragraphs using geometric heuristics.
import { ocr, detectFaces, detectBarcodes, inferLayout } from 'macos-vision';
const blocks = await ocr('page.png', { format: 'blocks' });
const faces = await detectFaces('page.png');
const barcodes = await detectBarcodes('page.png');
const layout = inferLayout({ textBlocks: blocks, faces, barcodes });
for (const block of layout) {
if (block.kind === 'text') {
console.log(`[p${block.paragraphId} l${block.lineId}] ${block.text}`);
} else {
console.log(`[${block.kind}] at (${block.x.toFixed(2)}, ${block.y.toFixed(2)})`);
}
}LayoutBlock is a discriminated union — use block.kind to narrow the type:
kind |
Extra fields |
|---|---|
'text' |
text, lineId, paragraphId |
'barcode' |
value, type |
'face' |
— |
'rectangle' |
— |
'document' |
— |
Note: Layout inference is a heuristic layer. It does not understand multi-column layouts or rotated text. Treat it as structured input for downstream tools, not as ground truth.
Extracts text from an image.
| Parameter | Type | Default | Description |
|---|---|---|---|
imagePath |
string |
— | Path to image (PNG, JPG, JPEG, WEBP) |
options.format |
'text' | 'blocks' |
'text' |
Plain text or structured blocks with coordinates |
Returns Promise<string> or Promise<VisionBlock[]>.
interface VisionBlock {
text: string
x: number // 0–1 from left
y: number // 0–1 from top
width: number // 0–1
height: number // 0–1
}Detects human faces and returns their bounding boxes.
interface Face {
x: number; y: number; width: number; height: number
confidence: number // 0–1
}Detects barcodes and QR codes and decodes their payload.
interface Barcode {
type: string // e.g. 'org.iso.QRCode', 'org.gs1.EAN-13'
value: string // decoded content
x: number; y: number; width: number; height: number
}Finds rectangular shapes (documents, tables, cards, forms).
interface Rectangle {
topLeft: [number, number]; topRight: [number, number]
bottomLeft: [number, number]; bottomRight: [number, number]
confidence: number
}Finds the boundary of a document in a photo (e.g. paper on a desk). Returns null if no document is found.
interface DocumentBounds {
topLeft: [number, number]; topRight: [number, number]
bottomLeft: [number, number]; bottomRight: [number, number]
confidence: number
}Returns top image classification labels with confidence scores.
interface Classification {
identifier: string // e.g. 'document', 'outdoor', 'animal'
confidence: number // 0–1
}| macos-vision | Tesseract.js | Cloud APIs | |
|---|---|---|---|
| Offline | ✅ | ✅ | ❌ |
| No API key | ✅ | ✅ | ❌ |
| Native speed | ✅ | ❌ | — |
| Zero runtime deps | ✅ | ❌ | ❌ |
| OCR with bounding boxes | ✅ | ✅ | ✅ |
| Face detection | ✅ | ❌ | ✅ |
| Barcode / QR | ✅ | ❌ | ✅ |
| Document detection | ✅ | ❌ | ✅ |
| Image classification | ✅ | ❌ | ✅ |
| macOS only | ✅ | ❌ | ❌ |
Apple Vision is the same engine used by macOS Spotlight, Live Text, and Shortcuts — highly optimized and accurate.
MIT