epublic8

A Go-based gRPC/HTTP service for converting documents to EPUB format with OCR support.

Features

OCR - Extract text from PDFs with garbled/CE-encoded fonts using Vision (macOS) or Tesseract
Figure detection - Automatically crops embedded figures from OCR'd pages and inserts them inline at the correct position in the EPUB, scaled proportionally to their PDF page width
EPUB Generation - Convert documents to EPUB with automatic chapter detection
gRPC API - High-performance gRPC interface for programmatic access
Web UI - Drag-and-drop browser interface with real-time conversion logs
Kubernetes Ready - Production-ready K8s manifests with HPA
Prometheus Metrics - Built-in metrics endpoint for monitoring
OpenTelemetry Tracing - Optional distributed tracing support
Security - Basic authentication and allowed hosts filtering

Architecture

The service exposes two interfaces on the same process: a gRPC server on :50051 for programmatic access and an HTTP server on :8080 for the web UI. Both interfaces feed into the same document processor, which extracts text from the uploaded file and passes it to the EPUB generator. The generated EPUB is written to disk and served back via a download URL.

OCR parallelism is bounded by a package-level semaphore sized to GOMAXPROCS, so one busy conversion can use all allocated cores while multiple concurrent conversions share them fairly.

On macOS the service compiles and uses the Vision framework (ocr/vision_ocr.swift) for OCR, which returns per-word bounding boxes in addition to text. Those bounding boxes drive figure detection: vertical gaps followed by a "Fig. N" caption are cropped from the page image and embedded in the EPUB at the paragraph where the caption appears. Figure images carry a width fraction (crop width ÷ page width) so they render at their original proportional size rather than being stretched to the full column width. On Linux, Vision is unavailable and Tesseract is used; figure detection is skipped.

Quick Start

Local Development

# Clone and navigate to project
cd epublic8

# Install dependencies
go mod tidy

# Generate protobuf code (only needed if you modify document.proto)
protoc --go_out=. --go_opt=paths=source_relative \
  --go-grpc_out=. --go-grpc_opt=paths=source_relative \
  pb/document.proto

# Build
go build -o bin/epublic8 ./cmd/server

# Run (uses all CPUs for parallel OCR by default)
./bin/epublic8

# Run with limited parallelism
GOMAXPROCS=2 OMP_NUM_THREADS=1 ./bin/epublic8

# Stop
./stop

# Open browser
# http://localhost:8080

Docker

# Build image
docker build -t document-service:latest .

# Run container
# OMP_NUM_THREADS=1 prevents each Tesseract process from spawning extra threads;
# without it, GOMAXPROCS=2 workers each try to claim all available CPU threads.
# GOMAXPROCS should match --cpus so the OCR semaphore is sized correctly.
docker run -p 8080:8080 -p 50051:50051 \
  --cpus 2 --memory 2g \
  -e GOMAXPROCS=2 \
  -e OMP_NUM_THREADS=1 \
  document-service:latest

Kubernetes (minikube)

The manifest uses imagePullPolicy: Never, so the image must exist inside minikube's Docker daemon before deploying. Build it there directly:

# Point your shell at minikube's Docker daemon, build, then restore
eval $(minikube docker-env)
docker build -t document-service:latest .
eval $(minikube docker-env -u)

Deploy and verify:

kubectl apply -f deploy/k8s.yaml

kubectl get pods -l app=document-service   # wait for 1/1 Running
kubectl get svc document-service-lb        # EXTERNAL-IP will be <pending> until tunnel

The manifest creates a LoadBalancer service. On macOS with the Docker driver the node IP is not reachable from the host, so run minikube tunnel in a dedicated terminal to assign 127.0.0.1 as the external IP:

minikube tunnel   # keep this running; may prompt for sudo

The service is then available at http://localhost:8080 and gRPC at localhost:50051.

To rebuild and redeploy after code changes:

eval $(minikube docker-env)
docker build -t document-service:latest .
eval $(minikube docker-env -u)
kubectl rollout restart deployment/document-service

API Reference

gRPC

service DocumentService {
  rpc ProcessDocument(DocumentRequest) returns (DocumentResponse);
  rpc StreamProcessDocument(stream DocumentChunk) returns (stream DocumentChunkResponse);
  rpc ExtractEntities(EntityRequest) returns (EntityResponse);
  rpc Health(HealthRequest) returns (HealthResponse);
}

ExtractEntities uses regex-based pattern matching to identify PERSON, LOCATION, ORGANIZATION, DATE, EMAIL, and PHONE entities. It accepts a 1 MB text limit and returns deduplicated results with confidence scores.

Health returns the current service status and number of active requests for readiness probes.

HTTP

Method	Endpoint	Description
GET	`/`	Upload page
POST	`/api/upload`	Upload and convert document (streams SSE log events)
GET	`/download?file=filename`	Download generated EPUB
GET	`/metrics`	Prometheus metrics endpoint

Rate limiting note: per-IP upload rate limiting reads X-Forwarded-For / X-Real-IP to key on the real client address. The service must be deployed behind a trusted proxy (K8s Ingress, cloud load balancer) that strips or overwrites these headers before forwarding. Direct exposure to untrusted traffic would allow clients to spoof the header and bypass rate limits.

`POST /api/upload`

Upload form fields:

file — the document to convert (max 200 MB)
smart_ocr — true (default) to run pdffonts before extraction and skip straight to OCR when Custom-encoded fonts are detected

Response: text/event-stream. Each SSE event is a JSON object on the data: line:

{"type":"log","message":"..."}

Progress line, streamed in real time.

{"type":"done","download_url":"/download?file=...","filename":"...","chapters":4,"chars":85000,"epub_kb":142.3,"processing_ms":3210}

Conversion complete.

{"type":"error","message":"..."}

Conversion failed.

Configuration

The service supports configuration via YAML config file, environment variables, or command-line flags. Environment variables take precedence over config file values.

Config File

# config.yaml example
server:
  grpcPort: "50051"
  httpPort: "8080"

ocr:
  concurrency: 2
  languages:
    - srp_latn+hrv
    - srp_latn
    - eng

epub:
  chapterWords: 1500
  outputDir: "/tmp/epubs"

cleanup:
  enabled: true
  retentionHours: 24
  intervalHours: 1

security:
  basicAuth: "admin:secret"  # or bcrypt hash
  allowedHosts:
    - example.com

tracing:
  enabled: false
  serviceName: "epublic8"
  consoleExporter: true

metrics:
  enabled: true
  path: "/metrics"

Use -config flag or CONFIG_PATH environment variable to load config file:

./bin/epublic8 -config config.yaml

Environment Variables

Variable	Default	Description
`CONFIG_PATH`	-	Path to YAML config file
`GRPC_PORT`	`50051`	gRPC server port
`HTTP_PORT`	`8080`	HTTP server port
`OUTPUT_DIR`	(temp dir)	Directory for generated EPUBs. If unset, a temp dir is created and removed on exit. If set, files persist and are cleaned up after 24 hours by an in-process loop.
`GOMAXPROCS`	all CPUs	Go scheduler parallelism. Set to match container CPU limit.
`OMP_NUM_THREADS`	all CPUs	Threads per Tesseract process (OpenMP). Set to `1` to prevent each concurrent Tesseract worker from spawning extra threads and saturating the CPU limit.
`OCR_CONCURRENCY`	`GOMAXPROCS`	Max concurrent OCR page workers. Defaults to the Go scheduler parallelism.
`OCR_LANGUAGES`	`srp_latn+hrv, srp_latn, eng`	OCR language codes (comma-separated)
`EPUB_CHAPTER_WORDS`	`1500`	Word count target per chapter when no headings are detected.
`EPUB_CLEANUP_ENABLED`	`false`	Enable automatic EPUB cleanup
`EPUB_RETENTION_HOURS`	`24`	Hours before generated EPUBs are deleted by the cleanup loop
`EPUB_CLEANUP_INTERVAL_HOURS`	`1`	How often the cleanup loop runs (hours)
`BASIC_AUTH`	-	Basic auth credentials (format: `username:password` or bcrypt hash)
`ALLOWED_HOSTS`	-	Comma-separated list of allowed Host headers
`TRACING_ENABLED`	`false`	Enable OpenTelemetry tracing
`TRACING_SERVICE_NAME`	`epublic8`	Service name for tracing
`TRACING_CONSOLE_EXPORTER`	`true`	Reserved for future console exporter wiring; currently has no effect
`METRICS_ENABLED`	`true`	Enable Prometheus metrics
`METRICS_PATH`	`/metrics`	Metrics endpoint path

Supported Formats

Input: PDF, plain text (.txt), Markdown (.md), HTML (.html)

Output: EPUB 2.0

PDF handling

pdftotext is used for text extraction. If garbled Central European encoding is detected (common in Bosnian/Croatian/Serbian PDFs where font maps CE characters to ASCII positions), the service falls back to OCR at 300 DPI. On macOS, Apple Vision OCR is used and returns per-word bounding boxes used for figure detection. On Linux, Tesseract is the fallback. Language priority for both engines: srp_latn+hrv → srp_latn → eng.

For native-text PDFs (no OCR needed), pdfimages extracts any embedded raster images directly.

Footnotes embedded at the bottom of pages (separated by form feeds) are detected and stripped from the extracted text.

Figure detection (macOS / Vision OCR only)

Two strategies are used to locate figures on each OCR'd page:

Gap-based — a vertical gap larger than 4% of page height followed by a "Fig. N" caption. The figure region spans full page width.
Side-by-side — all text above a "Fig. N" caption sits in a right column (x > 40% of page width), indicating the figure occupies the left column.

Cropped figure images are embedded in the EPUB at the paragraph where the caption appears. Each image carries its original width as a fraction of the page width so it renders at proportional size (e.g. a half-page figure renders at 50% column width).

Chapter detection

Chapters are split on headings matching keywords like Glava, Poglavlje, Chapter, standalone Roman numerals, or known section names (UVOD, PREDGOVOR, POGOVOR, etc.). If no headings are found, the text is split every 1500 words.

Metrics

The service exposes Prometheus metrics at /metrics (configurable via METRICS_PATH):

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, status
`http_request_duration_seconds`	Histogram	HTTP request duration
`documents_processed_total`	Counter	Documents processed (success/error)
`documents_in_progress`	Gauge	Documents currently being processed
`ocr_calls_total`	Counter	Total OCR API calls
`ocr_processing_duration_seconds`	Histogram	OCR processing duration
`http_active_requests`	Gauge	Currently active HTTP requests

Kubernetes

Resource limits are derived from actual workload measurements:

	Value	Reasoning
CPU request	1000m	GOMAXPROCS=2 OCR workers + Go scheduler at steady state
CPU limit	2000m	2 Tesseract processes × 1 OMP thread = 2 CPUs peak
Memory request	768Mi	~250 MB heap per in-flight request × 2 concurrent
Memory limit	2Gi	Headroom for 3 concurrent max-size (200 MB) requests
`/tmp` sizeLimit	1Gi	200 MB upload + ~50 MB pdftoppm PNGs per request × ~4 concurrent

Why OMP_NUM_THREADS=1 matters: Tesseract defaults to using all available CPU threads via OpenMP. Without this setting, GOMAXPROCS=2 concurrent Tesseract workers each try to claim all CPUs, resulting in severe CFS throttling and 2–3× slower OCR.

Why /tmp is disk-backed: On cgroupsv2 (all modern k8s nodes), medium: Memory tmpfs usage counts against the container memory cgroup. A memory-backed /tmp with a 2 Gi sizeLimit and a 2 Gi container memory limit leaves almost nothing for the Go heap, causing OOM kills under realistic load. The disk-backed emptyDir keeps temp file I/O off the memory budget entirely.

Other K8s features:

Horizontal Pod Autoscaler — scales 1–20 replicas; CPU target 70% of 2000m (~1400m, triggers when a second OCR request arrives while the first is active)
Liveness/Readiness Probes — HTTP GET / on port 8080; terminationGracePeriodSeconds: 40 gives the 15 s gRPC + 10 s HTTP shutdown sequence room to complete
In-process cleanup — EPUBs older than 24 hours are removed automatically by a background goroutine

Project Structure

.
├── cmd/
│   └── server/
│       └── main.go           # Entry point, server startup, cleanup loop
├── ocr/
│   └── vision_ocr.swift      # macOS Vision OCR binary (compiled at startup)
├── pb/
│   └── document.proto        # Protocol Buffer definitions
├── internal/
│   ├── config/
│   │   └── config.go         # YAML config, env vars, CLI flags
│   ├── model/
│   │   ├── document.go       # PDF extraction, OCR, footnote processing, EPUB generation
│   │   └── figures.go        # Vision OCR bounding-box figure detection and cropping
│   ├── handler/
│   │   ├── handler.go        # gRPC handlers
│   │   ├── web.go            # HTTP handlers, SSE streaming, EPUB zip writer, web UI
│   │   └── middleware/
│   │       └── auth.go       # Basic auth middleware
│   ├── metrics/
│   │   └── metrics.go        # Prometheus metrics
│   ├── tracing/
│   │   └── tracing.go        # OpenTelemetry tracing
│   └── errors/
│       └── errors.go        # Error types and handling
├── deploy/
│   └── k8s.yaml              # Kubernetes manifests (Deployment, Services, HPA, ConfigMap)
├── Dockerfile
├── Makefile
├── .golangci.yml
├── go.mod
├── go.sum
├── stop                      # Stop script (sends SIGTERM via PID file)
└── README.md

Tech Stack

Go 1.26 - Programming language
gRPC / Protocol Buffers - RPC interface
pdftotext / pdftoppm / pdfimages - PDF text extraction and rendering (poppler-utils)
Apple Vision - Primary OCR engine on macOS (provides bounding boxes for figure detection)
Tesseract OCR - OCR fallback on Linux / when Vision is unavailable
Prometheus - Metrics collection
OpenTelemetry - Distributed tracing
Kubernetes - Container orchestration

Development

# Run tests
make test

# Lint code
make lint

# Build binary
make build

# Full build (lint + test + build)
make all

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

epublic8

Features

Architecture

Quick Start

Local Development

Docker

Kubernetes (minikube)

API Reference

gRPC

HTTP

`POST /api/upload`

Configuration

Config File

Environment Variables

Supported Formats

PDF handling

Figure detection (macOS / Vision OCR only)

Chapter detection

Metrics

Kubernetes

Project Structure

Tech Stack

Development

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
cmd/server		cmd/server
deploy		deploy
internal		internal
ocr		ocr
pb		pb
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
document-service		document-service
go.mod		go.mod
go.sum		go.sum
stop		stop

Folders and files

Latest commit

History

Repository files navigation

epublic8

Features

Architecture

Quick Start

Local Development

Docker

Kubernetes (minikube)

API Reference

gRPC

HTTP

POST /api/upload

Configuration

Config File

Environment Variables

Supported Formats

PDF handling

Figure detection (macOS / Vision OCR only)

Chapter detection

Metrics

Kubernetes

Project Structure

Tech Stack

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

`POST /api/upload`