Skip to content

makschernetskyi/SLM_form_info_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data extraction experiment

Experiment on fine-tuning a Qwen model for data extraction with dynamic forms and short, natural speech.

Form filler UI: sidebar with field definitions, chat with extracted form values

Motivation

For this kind of task, LLMs with prompt engineering are usually enough. For small setups, a small language model (SLM) fine-tuned for extraction can be cheaper and easier to self-host. This experiment was done to see how well the idea translates to real use and to other domains later on.

Structure

  • Frontend: React (Vite, Tailwind) chat UI—mostly vibe-coded, with some pieces (e.g. WebSocket client, config) taken from earlier projects for cleaner setup. Define form fields in the sidebar, send text via chat; WebSocket (MessagePack) to the API.
  • Backend: Redis pub/sub with FastAPI as gateway; API on port 4000, WebSocket at /ws.
  • Extraction service: Loads Qwen2.5-3B + LoRA adapter from artifacts/lora_formfill, runs inference on extraction tasks from Redis, publishes results. Uses snake case for field keys; stops generation at the first complete JSON object.
  • Training service: LoRA fine-tuning (PEFT, bitsandbytes 4-bit) on Qwen; saves adapter to artifacts/lora_formfill. Optional; uncomment in docker-compose.yml and run when (re)training.
  • STT service: Speech-to-text via Redis; optional for the extraction flow.

Session state (form fields + chat history) is stored in Redis and restored on reload. Clear-chat button resets the conversation.

Run

Requires Docker (and NVIDIA Container Toolkit if using GPU for extraction/training).

docker compose up --build

Place the LoRA adapter under ./artifacts/lora_formfill (e.g. by running the training service once); otherwise extraction returns empty forms.

Execution

What was done

  • Chat UI with configurable form fields (left panel), message list, and filled-form blocks with edit/save.
  • WebSocket API: chat (extraction), save_form, transcribe (STT); session GET/PUT for persistence.
  • Extraction pipeline: prompt + 4-bit base model + LoRA, JSON output, snake_case normalization for field names.
  • Training pipeline: dataset from CSV, same prompt format, LoRA training, adapter written to artifacts/lora_formfill.
  • Backend logging (API and extraction-service) to stdout and log files; no prompt/model output sent to the frontend.

Notes:

  • Stopping criteria: The model often produced valid JSON, but as a small model it frequently continued with extra dialogue (e.g. follow-up turns). Generation is stopped after the first complete top-level JSON object. A simple brace-matching stack is used to detect the closing } so we only keep and parse that first object.
  • STT: Speech-to-text wiring exists (channel, payloads) but is not fully implemented; to be done.

Conclusion & Results

Benchmarking (manual and otherwise) indicates that the fine-tuned model performs better than the base Qwen model for this extraction task. Results are good, with the usual caveats: occasional bugs and clear room for improvement (data quality, prompt tweaks, more training). The stack is runnable end-to-end-chat, forms, session persistence, training pipeline and works as a small experimental baseline to build on or adapt for other domains.

About

fine tuning of QWEN-3B Small Language model for information extraction purposes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors