A production-ready Q&A chatbot built with NVIDIA NeMo 2.0 and LLAMA3, featuring high-quality data curation and efficient LoRA fine-tuning.
This project implements a comprehensive Question-Answering chatbot using NVIDIA's NeMo 2.0 framework with LLAMA3 8B and Low-Rank Adaptation (LoRA) fine-tuning. It addresses the critical importance of high-quality training data through a robust curation pipeline and provides explainability features to enhance user trust.
This is the second part of a 6-part workshop series on practical LLM implementation, building on the foundation established in Part 1: Practical Guide to Fine-Tuning LLMs with NVIDIA NeMo and LoRA.
- Data Quality Pipeline: Comprehensive data curation with document processing, QA generation, and multi-dimensional quality filtering
- Efficient LoRA Fine-tuning: Parameter-efficient adaptation of LLAMA3 8B with optimized hyperparameters
- Explainable AI: Interface with attention visualization and confidence metrics
- Production-Ready Deployment: FastAPI backend, Gradio UI, and NeMo Inference Microservice integration
- Comprehensive Testing: Test suite covering all modules for quality assurance
The project follows a modular design with clear separation of concerns:
- Configuration: Python-based configuration (not YAML) as per NeMo 2.0 recommendations
- Data Curation: Document processing, QA generation, and quality filtering
- Modeling: Model loading, LoRA integration, and evaluation metrics
- Recipes: Reusable workflows for data curation, training, and inference
- API: FastAPI-based REST API for serving the model
- UI: Gradio-based user interface with explainability features
- NIM: NeMo Inference Microservice for containerized deployment
Before starting, ensure you have:
- NVIDIA GPU with CUDA 12.8 support
- Docker and NVIDIA Container Toolkit installed
- Sufficient disk space (at least 20GB)
- NGC account and API token
The project uses NVIDIA's NeMo container for a consistent development environment. The setup is managed through a Makefile for simplicity.
- Clone the repository:
git clone https://github.com/T-DevH/nemo-qa-chatbot.git
cd nemo-qa-chatbot- Build and Run the Container:
# Build and start the container
make run
# To access the container shell later
make shellThe Makefile handles:
- Pulling the NVIDIA NeMo container (nvcr.io/nvidia/nemo:25.02)
- Setting up the development environment
- Installing system dependencies
- Installing Poetry for dependency management
- Installing PyTorch with CUDA 12.8 support
- Installing project dependencies
- Installing mamba-ssm for efficient training
Note: If you encounter issues with mamba-ssm installation, it may be due to PyTorch not being properly installed first. In this case, you can try installing it manually inside the container:
# Access the container shell
make shell
# Install PyTorch first
pip install torch==2.7.0+cu128 torchaudio==2.7.0+cu128 --index-url https://download.pytorch.org/whl/cu128
# Then install mamba-ssm
pip install mamba-ssm==2.2.2If you prefer to set up without containers:
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Poetry
pip install poetry
# Install dependencies
poetry install
# Install PyTorch with CUDA support
poetry run pip install torch==2.7.0+cu128 torchaudio==2.7.0+cu128 --index-url https://download.pytorch.org/whl/cu128
# Install mamba-ssm
poetry run pip install mamba-ssm==2.2.2Before starting, ensure you have:
- NVIDIA GPU with CUDA 12.8 support
- Docker and NVIDIA Container Toolkit installed
- Sufficient disk space (at least 20GB)
- NGC account and API token
# Download NGC CLI
wget https://ngc.nvidia.com/downloads/ngccli_linux.zip
unzip ngccli_linux.zip
# Move to a system directory
sudo mv ngc /usr/local/bin/
# Verify installation
ngc --version# Login to NGC
ngc config set
# Enter your NGC API key when prompted
# You can find your API key at: https://ngc.nvidia.com/setup/api-key# Basic download
python scripts/download_model.py
# With specific options
python scripts/download_model.py \
--model_name llama3-8b \
--output_dir models/base
# Force redownload if model exists
python scripts/download_model.py --forceThe download script will:
- Verify sufficient disk space
- Check NGC CLI availability
- Download the model using NGC CLI
- Show download progress in real-time
- Handle any download errors gracefully
Available models:
llama3-8b: LLAMA3 8B modelllama3.1-8b: LLAMA3.1 8B model
After downloading, verify that the model files are present and correctly structured:
# Check the model directory structure
ls -la models/base/llama3-8b/llama-3_1-8b-nemo_v1.0
# Expected output should show:
# - llama3_1_8b.nemo (main model file, ~16GB)
# - Other configuration filespython scripts/curate_data.py --input_dir data/raw --output_dir data/processedpython scripts/train.py \
--model_path models/base/llama3-8b \
--train_data data/datasets/train.jsonl \
--val_data data/datasets/val.jsonl \
--output_dir models/finetunedpython scripts/evaluate.py \
--model_path models/finetuned/final_model \
--test_data data/datasets/test.jsonl \
--output_path evaluation_results.jsonpython scripts/deploy.py \
--model_path models/finetuned/final_modelpython scripts/export_nim.py \
--model_path models/finetuned/final_model \
--output_dir nim/exportcd nim/export
docker build -t llama3-qa-chatbot-nim .
docker run -p 8000:8000 --gpus all llama3-qa-chatbot-nimThe project includes several Jupyter notebooks for exploration and demonstration:
- Data Exploration: Explore the data curation process
- Model Analysis: Analyze model outputs and performance
- Interactive Demo: Try out the chatbot in an interactive environment
nemo-qa-chatbot/
├── nemo_qa/ # Main package
│ ├── config/ # Python-based configuration
│ ├── curator/ # Data curation module
│ ├── recipes/ # NeMo 2.0 recipes
│ ├── modeling/ # Model implementation
│ ├── api/ # FastAPI implementation
│ └── ui/ # Gradio UI implementation
├── nim/ # NeMo Inference Microservice
├── scripts/ # Command-line scripts
├── notebooks/ # Jupyter notebooks
├── tests/ # Unit tests
└── data/ # Data directory
This project is part of a 6-part workshop series:
- ✅ Foundation: Fine-tuning basics with NVIDIA NeMo and LoRA
- ✅ Quality: Data curation and optimized fine-tuning (this project)
- 🔜 Advanced Reasoning: Chain-of-thought implementation
- 🔜 Alignment: RLHF and alignment techniques
- 🔜 Multimodal: Multi-modal capabilities and RAG
- 🔜 Deployment: Enterprise deployment and monitoring
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
if you use this project in your reaserch or work, please consider citing: @misc{hammadou2025nemoqachatbot, author = {Hammadou, Tarik}, title = {NeMo QA Chatbot: Production-Ready Q&A with LLAMA3 and NeMo 2.0}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/T-DevH/nemo-qa-chatbot}} }
- NVIDIA NeMo team for the amazing framework
- LLAMA3 for the base model
- All contributors who have helped shape this project
We use Docker for development to ensure a consistent environment across all developers and to leverage NVIDIA's pre-built NeMo container. This approach has several advantages:
-
Pre-built Dependencies: The NVIDIA NeMo container (
nvcr.io/nvidia/nemo:25.02) comes with:- CUDA and cuDNN pre-installed and configured
- PyTorch with CUDA support
- Pre-built
transformer-engineoptimized for the container's environment - Other NVIDIA-specific optimizations
-
Isolated Environment: Docker provides an isolated environment that:
- Prevents conflicts with system-level dependencies
- Ensures consistent behavior across different machines
- Makes it easy to switch between different versions of dependencies
-
Simplified Setup: New developers can start working with just:
make run
The make run command orchestrates the following process:
-
Container Pull: Downloads the NVIDIA NeMo base image (
nvcr.io/nvidia/nemo:25.02) -
Volume Mounting:
- Your local project directory is mounted at
/workspace/nemo_qa_chatbot - Changes made locally are immediately reflected in the container
- Your code runs in the container but is edited on your host machine
- Your local project directory is mounted at
-
Dependency Installation:
- Installs Poetry inside the container
- Runs
poetry install --without transformer-engine - This installs all dependencies from
pyproject.tomlexcepttransformer-engine - The pre-built
transformer-enginefrom the container is used instead
| File | Used? | Role |
|---|---|---|
pyproject.toml |
✅ | Defines all your app's dependencies |
Makefile |
✅ | Orchestrates container setup and launch |
nemo-toolkit |
✅ | Installed from TOML via Poetry |
transformer-engine |
❌ | Pre-built in container (skipped in TOML) |
| Your code/scripts | ✅ | Mounted live from your host machine |
We intentionally skip installing transformer-engine via Poetry because:
- The NVIDIA container already includes a pre-built, optimized version
- Building from source can be complex and error-prone
- The pre-built version is guaranteed to work with the container's CUDA setup
-
Start the container:
make run
-
Access the container shell (in a new terminal):
make shell
-
Your code is mounted at
/workspace/nemo_qa_chatbot- Edit files on your host machine
- Run code inside the container
- Changes are immediately reflected
-
When done, simply exit the container:
exit