Data extraction experiment

Experiment on fine-tuning a Qwen model for data extraction with dynamic forms and short, natural speech.

Motivation

For this kind of task, LLMs with prompt engineering are usually enough. For small setups, a small language model (SLM) fine-tuned for extraction can be cheaper and easier to self-host. This experiment was done to see how well the idea translates to real use and to other domains later on.

Structure

Frontend: React (Vite, Tailwind) chat UI—mostly vibe-coded, with some pieces (e.g. WebSocket client, config) taken from earlier projects for cleaner setup. Define form fields in the sidebar, send text via chat; WebSocket (MessagePack) to the API.
Backend: Redis pub/sub with FastAPI as gateway; API on port 4000, WebSocket at /ws.
Extraction service: Loads Qwen2.5-3B + LoRA adapter from artifacts/lora_formfill, runs inference on extraction tasks from Redis, publishes results. Uses snake case for field keys; stops generation at the first complete JSON object.
Training service: LoRA fine-tuning (PEFT, bitsandbytes 4-bit) on Qwen; saves adapter to artifacts/lora_formfill. Optional; uncomment in docker-compose.yml and run when (re)training.
STT service: Speech-to-text via Redis; optional for the extraction flow.

Session state (form fields + chat history) is stored in Redis and restored on reload. Clear-chat button resets the conversation.

Run

Requires Docker (and NVIDIA Container Toolkit if using GPU for extraction/training).

docker compose up --build

API: http://localhost:4000
Frontend: http://localhost:3000 (WebSocket target ws://localhost:4000)

Place the LoRA adapter under ./artifacts/lora_formfill (e.g. by running the training service once); otherwise extraction returns empty forms.

Execution

What was done

Chat UI with configurable form fields (left panel), message list, and filled-form blocks with edit/save.
WebSocket API: chat (extraction), save_form, transcribe (STT); session GET/PUT for persistence.
Extraction pipeline: prompt + 4-bit base model + LoRA, JSON output, snake_case normalization for field names.
Training pipeline: dataset from CSV, same prompt format, LoRA training, adapter written to artifacts/lora_formfill.
Backend logging (API and extraction-service) to stdout and log files; no prompt/model output sent to the frontend.

Notes:

Stopping criteria: The model often produced valid JSON, but as a small model it frequently continued with extra dialogue (e.g. follow-up turns). Generation is stopped after the first complete top-level JSON object. A simple brace-matching stack is used to detect the closing } so we only keep and parse that first object.
STT: Speech-to-text wiring exists (channel, payloads) but is not fully implemented; to be done.

Conclusion & Results

Benchmarking (manual and otherwise) indicates that the fine-tuned model performs better than the base Qwen model for this extraction task. Results are good, with the usual caveats: occasional bugs and clear room for improvement (data quality, prompt tweaks, more training). The stack is runnable end-to-end-chat, forms, session persistence, training pipeline and works as a small experimental baseline to build on or adapt for other domains.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
artifacts/lora_formfill		artifacts/lora_formfill
demo		demo
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data extraction experiment

Motivation

Structure

Run

Execution

What was done

Conclusion & Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data extraction experiment

Motivation

Structure

Run

Execution

What was done

Conclusion & Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages