Skip to content

feat(ocr): serialize uploads via a persistent on-disk queue#85

Open
PhilippMundhenk wants to merge 4 commits into
masterfrom
feature/ocr-queue-worker
Open

feat(ocr): serialize uploads via a persistent on-disk queue#85
PhilippMundhenk wants to merge 4 commits into
masterfrom
feature/ocr-queue-worker

Conversation

@PhilippMundhenk
Copy link
Copy Markdown
Owner

Summary

  • Today each scan spawns a backgrounded curl upload to the OCR microservice. With several scans in quick succession the uploads compete for the same egress bandwidth and per-stream throughput collapses (logs show MB/s → ~100 KB/s), which makes OCR look hung for many minutes. The OCR backend is also single-threaded, so the fan-out wins nothing.
  • Replaces the inline curl blocks in scantofile-0.2.4-1.sh and scanRear.sh with an ocr_enqueue.sh helper that writes a small job file via a hidden-temp + atomic rename.
  • Adds ocr_worker.sh, a single-consumer poller that claims the oldest pending job via mv pending → in_progress, increments ATTEMPTS on disk before uploading, then either deletes on success, requeues with linear backoff, or moves to failed/ after OCR_MAX_ATTEMPTS (default 5). On startup it first sweeps any stranded in_progress/ files back to pending/ so a container restart loses nothing.
  • Wires the worker into runScanner.sh under a tiny bash supervisor (while true; do worker; sleep 5; done) so a crashed worker auto-restarts and inherits the existing OCR_* / FTP_* / notification env vars.

Design constraints honoured

  • No inotify, no flock — coordination is rename(2) only, so the queue is safe on SMB-mounted /scans and other filesystems with limited semantics.
  • Survives restarts — queue lives at /scans/.ocr_queue/{pending,in_progress,failed}, which is on the persistent volume. Recovery sweep on worker start.
  • Atomic — every state transition is a single mv within the same directory tree; enqueue stages to a hidden .…tmp file before renaming into pending/.
  • Retries — linear backoff (attempts * 30 s), max controlled by OCR_MAX_ATTEMPTS; ATTEMPTS is persisted before the upload so a poison job can't loop forever.
  • Supervised — bash while true; sleep 5 wrapper in runScanner.sh.

Tunables (env, all optional)

Var Default Notes
OCR_QUEUE_DIR /scans/.ocr_queue Anywhere on the persistent volume
OCR_QUEUE_POLL_SECONDS 5 How often the worker checks pending/
OCR_MAX_ATTEMPTS 5 Then the job moves to failed/

Test plan

  • Local smoke: enqueue script writes the expected KEY=VALUE file via .tmp + atomic rename.
  • Local smoke: worker fails repeatedly against an unreachable OCR endpoint, increments ATTEMPTS, lands the job in failed/ after OCR_MAX_ATTEMPTS.
  • Local smoke: orphaned in_progress/ file is recovered to pending/ on worker start.
  • Container test: scan 5+ documents in quick succession, observe single in-flight upload at a time, watch queue drain.
  • Container test: kill the worker mid-upload, verify supervisor relaunches and recovery moves the job back to pending/.
  • Container test: restart container with items in pending/ and in_progress/, verify everything drains after startup.
  • Container test: confirm post-OCR notifications (trigger_inotify, trigger_telegram, sendtoftps, REMOVE_ORIGINAL_AFTER_OCR) still fire correctly from the worker.

Notes

🤖 Generated with Claude Code

PhilippMundhenk and others added 2 commits May 17, 2026 18:07
Concurrent scans previously each spawned a backgrounded `curl` to the OCR
microservice. With N nearly-simultaneous scans the uploads compete for
the same egress bandwidth and effective per-stream throughput collapses
(MB/s -> ~100 KB/s in observed logs), so OCR appears to "hang" for many
minutes. The OCR backend itself is also single-threaded, so the fan-out
never wins anything.

This change introduces a single-consumer queue:

* `script/ocr_enqueue.sh` writes a small KEY=VALUE job file via a
  hidden temp + atomic rename, so a partial write can never be picked
  up. `scantofile` and `scanRear` now call this instead of spawning
  curl themselves.

* `script/ocr_worker.sh` polls `pending/`, atomically claims the
  oldest job via `mv pending -> in_progress`, increments ATTEMPTS on
  disk *before* uploading (so a crash mid-upload is accounted for),
  then either deletes the job on success, requeues with linear
  backoff, or moves to `failed/` after `OCR_MAX_ATTEMPTS` (default 5).
  On startup the worker first sweeps any stranded `in_progress/`
  files back to `pending/` so a container restart loses nothing.

* `files/runScanner.sh` launches the worker under a simple bash
  supervisor (`while true; do worker; sleep 5; done`) so a worker
  crash auto-restarts and `OCR_*` / `FTP_*` / notification env vars
  are inherited.

Coordination uses only `rename(2)`; no `inotify`, no `flock`. That
keeps the queue safe when `/scans` is mounted from SMB or another
filesystem with limited semantics. Queue state lives under
`/scans/.ocr_queue/` and therefore survives container restarts.

Tunables (all env vars, all optional):
  OCR_QUEUE_DIR             default /scans/.ocr_queue
  OCR_QUEUE_POLL_SECONDS    default 5
  OCR_MAX_ATTEMPTS          default 5

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PhilippMundhenk and others added 2 commits May 17, 2026 19:36
ocr_enqueue.bats (5 cases): argument validation, job file format,
atomic-rename hygiene (no leftover .tmp), sortable timestamp prefix,
FIFO ordering of two enqueues.

ocr_worker.bats (6 cases): startup recovery sweeps in_progress/ back
to pending/; successful upload removes the job and writes the output
PDF; REMOVE_ORIGINAL_AFTER_OCR removes the source PDF on success;
repeated curl failures land in failed/ with ATTEMPTS=MAX; missing
OCR_SERVER/PORT/PATH fails the job rather than looping forever;
missing input PDF fails the job. Worker tests use a copy of the
script with the linear backoff sed'd down to zero and a short
OCR_QUEUE_POLL_SECONDS so a full failure cycle completes in seconds.

To make the worker testable without a writable /scans, add an
OCR_OUTPUT_DIR env var (defaults to /scans, override in tests to a
sandbox dir).

tests/helpers/common.bash is identical to the file added in #86 so a
later merge of either branch is a clean no-op.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant