Skip to content

zhufucdev/ledoxide

Repository files navigation

ledoxide

Note

This doc is generated by AI because what else can we use

ledoxide is a specialized, client-pulling based HTTP server designed to implement a Vision-Language Model (VLM) based bookkeeping and expense extraction workflow. Its primary goal is to process images of receipts, invoices, or screenshotted transaction records (e.g., social media purchase notifications) and autonomously extract structured billing data: descriptions (notes), exact monetary amounts, and appropriate expense categorization.

Usage

The application is containerized and readily available via Docker.

Running with Docker

Because ledoxide relies on heavy Vision-Language Models executing through llama.cpp, running the container with NVIDIA GPU support (--gpus all) is highly recommended.

docker run -p 3100:3100 \
  --gpus all \
  -e HF_TOKEN="your_huggingface_token" \
  -e AUTH_KEY="your_secret_bearer_token" \
  -v $HOME/.cache/huggingface:/huggingface \
  zhufucdev/ledoxide:latest

Environment Variables

Variable Description
AUTH_KEY Used as the Bearer token to protect endpoints. If not provided via flag or env var, a random key is generated and logged on startup.
HF_HOME Directory for Hugging Face cache (Defaults to /huggingface inside the Docker image). Crucial for caching the heavy LLM/VLM models between container restarts.
HF_TOKEN Required for downloading models from Hugging Face if they are gated or to avoid rate limits.
HF_ENDPOINT Can be used to set a custom Hugging Face proxy endpoint.
RUST_LOG Set to debug to enable verbose logging, including underlying llama.cpp inference logs.

CLI Arguments

When running natively or overriding the Docker command, the following arguments are supported:

  • -b, --bind <BIND>: The address to bind to (default: 127.0.0.1:3100).
  • -c, --categories <CATEGORIES>: A list of valid categories for expenses (defaults: Groceries, Transport, Rent, Entertainment, Shopping, Drink, Food).
  • --max-concurrency <N>: Maximum number of models to run simultaneously (default: 4).
  • --large-model: Instructs the server to use a larger vision model configuration.
  • --model-timeout-minutes <MINS>: Time before an inactive model is evicted from RAM/VRAM to save resources (default: 5).
  • --offline: Prevents reaching out to Hugging Face; forces the use of locally cached models only.

API Endpoints

The server exposes a simple REST API:

  • GET / Returns the server package name and version string.

  • POST /create_task Accepts a multipart/form-data payload containing an image file (key: image) and optionally lm_sampling and vlm_sampling JSON parameters. Requires: Authorization: Bearer <AUTH_KEY> header. Returns: A JSON TaskControlBlock containing a unique task ID indicating the task is pending.

  • GET /get_task/{task_id} Checks the status of a specific task by ID. Requires: Authorization: Bearer <AUTH_KEY> header. Returns: The task state (pending, running, or finished). If finished, it includes the extracted structured data: notes, amount, and category.

Implementation Details

  • Architecture: The application is written in Rust, leveraging tokio for its async runtime and axum for HTTP routing.
  • Inference Engine: It uses llama-cpp-2 for efficient local inference and hf-hub for model distribution management. It heavily utilizes LLM grammar constraints (llguidance and .lark schema files) to strictly enforce output formatting (ensuring numbers are extracted cleanly and categories strictly match the configured list).
  • Model Pipeline: The standard pipeline involves a Vision Model (defaulting to Qwen3-VL-4B-Instruct-GGUF) that extracts a highly detailed text description of the uploaded image. This description is then piped into a smaller Text Model (defaulting to gemma-3-1b-it-qat-q4_0-gguf) that runs targeted prompts to extract the summary notes, numeric amount, and category.

Caching Strategies & Resource Management

  • Model Memory Timeout: To preserve system RAM and GPU VRAM, ledoxide wraps its loaded models in a TimedModel construct. If a model remains unused for the configurable timeout period (default 5 minutes), it is automatically dropped from memory and will be seamlessly reloaded from disk on the next request.
  • Task Swapping: To prevent the server's memory from bloating with historical task data over long uptimes, the internal Scheduler implements an on-disk swap queue. When the in-memory finished queue exceeds --max-memory-size (default: 468,000 items), older finished tasks are serialized using postcard and flushed to a temporary swap file on disk. The /get_task endpoint streams over both active memory and the disk swap seamlessly.
  • Hugging Face Mount: To prevent redownloading multi-gigabyte .gguf files, it is vital to mount the HF_HOME cache to persistent host storage when using Docker.

Minor Caveats

  • Task Removal: Finished tasks remain in memory or the on-disk swap file indefinitely. There is currently no API to "delete" or "acknowledge" a task to free its disk footprint once retrieved. Over extreme uptimes on busy servers, the swap file could grow continuously.
  • CUDA Optimization: Depending on your GPU architecture, the Docker image may trigger a warning about an unsupported UPSCALE operator in the MTL0 backend for CLIP execution, though inference typically falls back gracefully. Flash Attention is enabled by default to mitigate this footprint.
  • Gated Model: Attempts to download Gemma 3 would fail without specifying a valid HF_TOKEN. You can visit a Gemma repo to see if you have access.

About

Client pulling based HTTP server to implement a VLM based bookkeeping workflow.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors