Note
This doc is generated by AI because what else can we use
ledoxide is a specialized, client-pulling based HTTP server designed to implement a Vision-Language Model (VLM) based bookkeeping and expense extraction workflow. Its primary goal is to process images of receipts, invoices, or screenshotted transaction records (e.g., social media purchase notifications) and autonomously extract structured billing data: descriptions (notes), exact monetary amounts, and appropriate expense categorization.
The application is containerized and readily available via Docker.
Because ledoxide relies on heavy Vision-Language Models executing through llama.cpp, running the container with NVIDIA GPU support (--gpus all) is highly recommended.
docker run -p 3100:3100 \
--gpus all \
-e HF_TOKEN="your_huggingface_token" \
-e AUTH_KEY="your_secret_bearer_token" \
-v $HOME/.cache/huggingface:/huggingface \
zhufucdev/ledoxide:latest| Variable | Description |
|---|---|
AUTH_KEY |
Used as the Bearer token to protect endpoints. If not provided via flag or env var, a random key is generated and logged on startup. |
HF_HOME |
Directory for Hugging Face cache (Defaults to /huggingface inside the Docker image). Crucial for caching the heavy LLM/VLM models between container restarts. |
HF_TOKEN |
Required for downloading models from Hugging Face if they are gated or to avoid rate limits. |
HF_ENDPOINT |
Can be used to set a custom Hugging Face proxy endpoint. |
RUST_LOG |
Set to debug to enable verbose logging, including underlying llama.cpp inference logs. |
When running natively or overriding the Docker command, the following arguments are supported:
-b, --bind <BIND>: The address to bind to (default:127.0.0.1:3100).-c, --categories <CATEGORIES>: A list of valid categories for expenses (defaults: Groceries, Transport, Rent, Entertainment, Shopping, Drink, Food).--max-concurrency <N>: Maximum number of models to run simultaneously (default: 4).--large-model: Instructs the server to use a larger vision model configuration.--model-timeout-minutes <MINS>: Time before an inactive model is evicted from RAM/VRAM to save resources (default: 5).--offline: Prevents reaching out to Hugging Face; forces the use of locally cached models only.
The server exposes a simple REST API:
-
GET /Returns the server package name and version string. -
POST /create_taskAccepts amultipart/form-datapayload containing an image file (key:image) and optionallylm_samplingandvlm_samplingJSON parameters. Requires:Authorization: Bearer <AUTH_KEY>header. Returns: A JSONTaskControlBlockcontaining a unique task ID indicating the task is pending. -
GET /get_task/{task_id}Checks the status of a specific task by ID. Requires:Authorization: Bearer <AUTH_KEY>header. Returns: The task state (pending,running, orfinished). Iffinished, it includes the extracted structured data:notes,amount, andcategory.
- Architecture: The application is written in Rust, leveraging
tokiofor its async runtime andaxumfor HTTP routing. - Inference Engine: It uses
llama-cpp-2for efficient local inference andhf-hubfor model distribution management. It heavily utilizes LLM grammar constraints (llguidanceand.larkschema files) to strictly enforce output formatting (ensuring numbers are extracted cleanly and categories strictly match the configured list). - Model Pipeline: The standard pipeline involves a Vision Model (defaulting to
Qwen3-VL-4B-Instruct-GGUF) that extracts a highly detailed text description of the uploaded image. This description is then piped into a smaller Text Model (defaulting togemma-3-1b-it-qat-q4_0-gguf) that runs targeted prompts to extract the summary notes, numeric amount, and category.
- Model Memory Timeout: To preserve system RAM and GPU VRAM,
ledoxidewraps its loaded models in aTimedModelconstruct. If a model remains unused for the configurable timeout period (default 5 minutes), it is automatically dropped from memory and will be seamlessly reloaded from disk on the next request. - Task Swapping: To prevent the server's memory from bloating with historical task data over long uptimes, the internal
Schedulerimplements an on-disk swap queue. When the in-memory finished queue exceeds--max-memory-size(default: 468,000 items), older finished tasks are serialized usingpostcardand flushed to a temporary swap file on disk. The/get_taskendpoint streams over both active memory and the disk swap seamlessly. - Hugging Face Mount: To prevent redownloading multi-gigabyte
.gguffiles, it is vital to mount theHF_HOMEcache to persistent host storage when using Docker.
- Task Removal: Finished tasks remain in memory or the on-disk swap file indefinitely. There is currently no API to "delete" or "acknowledge" a task to free its disk footprint once retrieved. Over extreme uptimes on busy servers, the swap file could grow continuously.
- CUDA Optimization: Depending on your GPU architecture, the Docker image may trigger a warning about an unsupported
UPSCALEoperator in theMTL0backend for CLIP execution, though inference typically falls back gracefully. Flash Attention is enabled by default to mitigate this footprint. - Gated Model: Attempts to download Gemma 3 would fail without specifying a valid
HF_TOKEN. You can visit a Gemma repo to see if you have access.