feat(ocr): serialize uploads via a persistent on-disk queue#85
Open
PhilippMundhenk wants to merge 4 commits into
Open
feat(ocr): serialize uploads via a persistent on-disk queue#85PhilippMundhenk wants to merge 4 commits into
PhilippMundhenk wants to merge 4 commits into
Conversation
Concurrent scans previously each spawned a backgrounded `curl` to the OCR microservice. With N nearly-simultaneous scans the uploads compete for the same egress bandwidth and effective per-stream throughput collapses (MB/s -> ~100 KB/s in observed logs), so OCR appears to "hang" for many minutes. The OCR backend itself is also single-threaded, so the fan-out never wins anything. This change introduces a single-consumer queue: * `script/ocr_enqueue.sh` writes a small KEY=VALUE job file via a hidden temp + atomic rename, so a partial write can never be picked up. `scantofile` and `scanRear` now call this instead of spawning curl themselves. * `script/ocr_worker.sh` polls `pending/`, atomically claims the oldest job via `mv pending -> in_progress`, increments ATTEMPTS on disk *before* uploading (so a crash mid-upload is accounted for), then either deletes the job on success, requeues with linear backoff, or moves to `failed/` after `OCR_MAX_ATTEMPTS` (default 5). On startup the worker first sweeps any stranded `in_progress/` files back to `pending/` so a container restart loses nothing. * `files/runScanner.sh` launches the worker under a simple bash supervisor (`while true; do worker; sleep 5; done`) so a worker crash auto-restarts and `OCR_*` / `FTP_*` / notification env vars are inherited. Coordination uses only `rename(2)`; no `inotify`, no `flock`. That keeps the queue safe when `/scans` is mounted from SMB or another filesystem with limited semantics. Queue state lives under `/scans/.ocr_queue/` and therefore survives container restarts. Tunables (all env vars, all optional): OCR_QUEUE_DIR default /scans/.ocr_queue OCR_QUEUE_POLL_SECONDS default 5 OCR_MAX_ATTEMPTS default 5 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
ocr_enqueue.bats (5 cases): argument validation, job file format, atomic-rename hygiene (no leftover .tmp), sortable timestamp prefix, FIFO ordering of two enqueues. ocr_worker.bats (6 cases): startup recovery sweeps in_progress/ back to pending/; successful upload removes the job and writes the output PDF; REMOVE_ORIGINAL_AFTER_OCR removes the source PDF on success; repeated curl failures land in failed/ with ATTEMPTS=MAX; missing OCR_SERVER/PORT/PATH fails the job rather than looping forever; missing input PDF fails the job. Worker tests use a copy of the script with the linear backoff sed'd down to zero and a short OCR_QUEUE_POLL_SECONDS so a full failure cycle completes in seconds. To make the worker testable without a writable /scans, add an OCR_OUTPUT_DIR env var (defaults to /scans, override in tests to a sandbox dir). tests/helpers/common.bash is identical to the file added in #86 so a later merge of either branch is a clean no-op. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
curlupload to the OCR microservice. With several scans in quick succession the uploads compete for the same egress bandwidth and per-stream throughput collapses (logs show MB/s → ~100 KB/s), which makes OCR look hung for many minutes. The OCR backend is also single-threaded, so the fan-out wins nothing.curlblocks inscantofile-0.2.4-1.shandscanRear.shwith anocr_enqueue.shhelper that writes a small job file via a hidden-temp + atomic rename.ocr_worker.sh, a single-consumer poller that claims the oldest pending job viamv pending → in_progress, incrementsATTEMPTSon disk before uploading, then either deletes on success, requeues with linear backoff, or moves tofailed/afterOCR_MAX_ATTEMPTS(default 5). On startup it first sweeps any strandedin_progress/files back topending/so a container restart loses nothing.runScanner.shunder a tiny bash supervisor (while true; do worker; sleep 5; done) so a crashed worker auto-restarts and inherits the existingOCR_*/FTP_*/ notification env vars.Design constraints honoured
inotify, noflock— coordination isrename(2)only, so the queue is safe on SMB-mounted/scansand other filesystems with limited semantics./scans/.ocr_queue/{pending,in_progress,failed}, which is on the persistent volume. Recovery sweep on worker start.mvwithin the same directory tree; enqueue stages to a hidden.…tmpfile before renaming intopending/.attempts * 30 s), max controlled byOCR_MAX_ATTEMPTS; ATTEMPTS is persisted before the upload so a poison job can't loop forever.while true; sleep 5wrapper inrunScanner.sh.Tunables (env, all optional)
OCR_QUEUE_DIR/scans/.ocr_queueOCR_QUEUE_POLL_SECONDS5pending/OCR_MAX_ATTEMPTS5failed/Test plan
.tmp+ atomic rename.failed/afterOCR_MAX_ATTEMPTS.in_progress/file is recovered topending/on worker start.pending/.pending/andin_progress/, verify everything drains after startup.trigger_inotify,trigger_telegram,sendtoftps,REMOVE_ORIGINAL_AFTER_OCR) still fire correctly from the worker.Notes
masterand is independent of fix(scanRear): guard kill against missing or stale scan_pid #84 (fix/scanrear-pid-kill-guard). Both touchscanRear.shin non-overlapping regions so they should merge cleanly in either order.🤖 Generated with Claude Code