This project implements Named Entity Recognition (NER) and intent classification using small language models. It includes tools for dataset generation, model training, and inference for both classification and causal language modeling tasks. The project now features a Streamlit web interface for easy interaction with all functionality.
β¨ Web Interface: Interactive Streamlit application for all operations π€ Dual Model Support: Both BERT-based classification and causal language models π Dataset Generation: Automated dataset creation using OpenAI's GPT models π Real-time Inference: Web-based inference for trained models π¦ Package Structure: Proper Python package with setup.py π Remote LLM Support: Integration with cloud-based language models
Install the package and dependencies:
# Clone the repository
git clone <your-repo-url>
cd small_lm
# Install the package
pip install -e .Create a .env file in the project root:
echo "OPENAI_API_KEY=your_api_key_here" > .envYou will need:
- OpenAI API Key: For dataset generation (uses cost-effective gpt-4.1-mini)
- Hugging Face Account: For model downloads and uploads
- Generate access tokens: HF Security Tokens
- Login via CLI:
huggingface-cli login
Start the Streamlit application:
streamlit run app.pyThis opens a web interface with the following sections:
- Home: Project overview and dataset management
- Dataset: Generate and view training datasets
- Train: Train SLM and BERT models
- Inference: Run inference on trained models
small_lm/
βββ app.py # Main Streamlit application
βββ streamlit_pages/ # Web interface components
β βββ dataset_app.py # Dataset generation and viewing
β βββ train_app.py # Model training interface
β βββ inference_app.py # Inference interface
βββ small_lm/ # Core package
β βββ classification/ # BERT-based classification models
β β βββ train_bert.py # Training script for classification
β β βββ infer_bert.py # Inference script for classification
β β βββ data_loader.py # Data loading utilities
β β βββ default_bert.py # Default BERT model configuration
β βββ causal_slms/ # Causal language models
β β βββ train_slm.py # Training script for causal LM
β β βββ infer_slm.py # Inference script for causal LM
β β βββ data_loader.py # Data loading utilities
β β βββ default_lm.py # Default LM configuration
β βββ generate_dataset.py # Dataset generation script
βββ remote_llms/ # Remote LLM integration
β βββ infer.py # Remote inference utilities
βββ datasets/ # Generated and processed datasets
βββ trained_models/ # Saved model checkpoints
βββ configs/ # Configuration files
βββ docs/ # Detailed documentation
βββ setup.py # Package configuration
βββ requirements.txt # Project dependencies
- Launch the app:
streamlit run app.py - Generate Dataset: Use the Dataset tab to create training data
- Train Models: Use the Train tab for SLM or BERT training
- Run Inference: Use the Inference tab to test trained models
python small_lm/generate_dataset.py# Train BERT classification model
python small_lm/classification/train_bert.py
# Train causal language model
python small_lm/causal_slms/train_slm.py# BERT inference
python small_lm/classification/infer_bert.py path/to/model
# SLM inference
python small_lm/causal_slms/infer_slm.py path/to/model- Base Model: dslim/distilbert-NER
- Task: Token classification and intent recognition
- Features: Fast inference, lightweight (60M parameters)
- Output: BIO tags for entities (DATE, TIME, NAME, EMAIL)
- Base Model: SmolLM2-135M-Instruct
- Task: Text generation for entity recognition
- Features: Smallest available causal LM, chat-based format
- Output: Structured entity recognition responses
The dataset uses BIO (Begin-Inside-Outside) tagging for named entities:
{
"text": "Schedule a meeting at 2 PM on Friday with john@example.com",
"labels": [
{"word": "2", "label": "B-TIME"},
{"word": "PM", "label": "I-TIME"},
{"word": "Friday", "label": "B-DATE"},
{"word": "john@example.com", "label": "B-EMAIL"}
],
"intent": "inquiry"
}Supported Entities:
DATE: Dates and day referencesTIME: Time expressionsNAME: Person namesEMAIL: Email addresses
Intents:
inquiry: Meeting scheduling requestscancel: Meeting cancellation requests
Key hyperparameters to adjust:
learning_rate: Start with 2e-5 for BERT, 5e-5 for SLMnum_train_epochs: 3-5 epochs typically sufficientbatch_size: Adjust based on GPU memorymax_length: 128 tokens for most email scenarios
Customize in generate_dataset.py:
- Number of examples
- Entity types and frequency
- Intent distribution
- Complexity levels
For detailed information, see:
- docs/README.md - Comprehensive technical documentation
- Model architecture details
- Training strategies and PEFT/LoRA explanations
- Dataset format specifications
Core dependencies:
transformers==4.51.3- Hugging Face transformerstorch==2.6.0- PyTorch frameworkpeft==0.15.2- Parameter-efficient fine-tuningstreamlit==1.45.1- Web interfaceopenai==1.78.1- Dataset generationpython-dotenv- Environment management
The project is packaged as a proper Python package:
# Install in development mode
pip install -e .
# Install with development dependencies
pip install -e ".[dev]"- β Web interface for all operations
- β Package structure and setup
- β Remote LLM integration
- π Support for additional entity types
- π Local model server deployment
- π Batch inference capabilities
- π Model performance metrics dashboard
- π Export trained models to different formats
Common Issues:
- GPU Memory: Reduce batch size if CUDA out of memory
- API Limits: Check OpenAI API quota for dataset generation
- HF Authentication: Ensure
huggingface-cli loginis completed - Model Loading: Verify model paths in trained_models directory
Performance Tips:
- Use PEFT/LoRA for memory-efficient training
- Start with smaller datasets for prototyping
- Monitor validation loss to prevent overfitting
- Use the web interface for easier debugging