NeMo QA Chatbot

A production-ready Q&A chatbot built with NVIDIA NeMo 2.0 and LLAMA3, featuring high-quality data curation and efficient LoRA fine-tuning.

Overview

This project implements a comprehensive Question-Answering chatbot using NVIDIA's NeMo 2.0 framework with LLAMA3 8B and Low-Rank Adaptation (LoRA) fine-tuning. It addresses the critical importance of high-quality training data through a robust curation pipeline and provides explainability features to enhance user trust.

This is the second part of a 6-part workshop series on practical LLM implementation, building on the foundation established in Part 1: Practical Guide to Fine-Tuning LLMs with NVIDIA NeMo and LoRA.

Key Features

Data Quality Pipeline: Comprehensive data curation with document processing, QA generation, and multi-dimensional quality filtering
Efficient LoRA Fine-tuning: Parameter-efficient adaptation of LLAMA3 8B with optimized hyperparameters
Explainable AI: Interface with attention visualization and confidence metrics
Production-Ready Deployment: FastAPI backend, Gradio UI, and NeMo Inference Microservice integration
Comprehensive Testing: Test suite covering all modules for quality assurance

Architecture

The project follows a modular design with clear separation of concerns:

Configuration: Python-based configuration (not YAML) as per NeMo 2.0 recommendations
Data Curation: Document processing, QA generation, and quality filtering
Modeling: Model loading, LoRA integration, and evaluation metrics
Recipes: Reusable workflows for data curation, training, and inference
API: FastAPI-based REST API for serving the model
UI: Gradio-based user interface with explainability features
NIM: NeMo Inference Microservice for containerized deployment

Installation

Prerequisites

Before starting, ensure you have:

NVIDIA GPU with CUDA 12.8 support
Docker and NVIDIA Container Toolkit installed
Sufficient disk space (at least 20GB)
NGC account and API token

Container-based Setup

The project uses NVIDIA's NeMo container for a consistent development environment. The setup is managed through a Makefile for simplicity.

Clone the repository:

git clone https://github.com/T-DevH/nemo-qa-chatbot.git
cd nemo-qa-chatbot

Build and Run the Container:

# Build and start the container
make run

# To access the container shell later
make shell

The Makefile handles:

Pulling the NVIDIA NeMo container (nvcr.io/nvidia/nemo:25.02)
Setting up the development environment
Installing system dependencies
Installing Poetry for dependency management
Installing PyTorch with CUDA 12.8 support
Installing project dependencies
Installing mamba-ssm for efficient training

Note: If you encounter issues with mamba-ssm installation, it may be due to PyTorch not being properly installed first. In this case, you can try installing it manually inside the container:

# Access the container shell
make shell

# Install PyTorch first
pip install torch==2.7.0+cu128 torchaudio==2.7.0+cu128 --index-url https://download.pytorch.org/whl/cu128

# Then install mamba-ssm
pip install mamba-ssm==2.2.2

Manual Setup (Alternative)

If you prefer to set up without containers:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Poetry
pip install poetry

# Install dependencies
poetry install

# Install PyTorch with CUDA support
poetry run pip install torch==2.7.0+cu128 torchaudio==2.7.0+cu128 --index-url https://download.pytorch.org/whl/cu128

# Install mamba-ssm
poetry run pip install mamba-ssm==2.2.2

Usage

Prerequisites

Before starting, ensure you have:

NVIDIA GPU with CUDA 12.8 support
Docker and NVIDIA Container Toolkit installed
Sufficient disk space (at least 20GB)
NGC account and API token

1. Install NGC CLI (if not already installed)

# Download NGC CLI
wget https://ngc.nvidia.com/downloads/ngccli_linux.zip
unzip ngccli_linux.zip

# Move to a system directory
sudo mv ngc /usr/local/bin/

# Verify installation
ngc --version

2. Configure NGC CLI

# Login to NGC
ngc config set

# Enter your NGC API key when prompted
# You can find your API key at: https://ngc.nvidia.com/setup/api-key

3. Download Base Model

# Basic download
python scripts/download_model.py

# With specific options
python scripts/download_model.py \
  --model_name llama3-8b \
  --output_dir models/base

# Force redownload if model exists
python scripts/download_model.py --force

The download script will:

Verify sufficient disk space
Check NGC CLI availability
Download the model using NGC CLI
Show download progress in real-time
Handle any download errors gracefully

Available models:

llama3-8b: LLAMA3 8B model
llama3.1-8b: LLAMA3.1 8B model

4. Verify Model Download

After downloading, verify that the model files are present and correctly structured:

# Check the model directory structure
ls -la models/base/llama3-8b/llama-3_1-8b-nemo_v1.0

# Expected output should show:
# - llama3_1_8b.nemo (main model file, ~16GB)
# - Other configuration files

5. Curate Training Data

python scripts/curate_data.py --input_dir data/raw --output_dir data/processed

6. Fine-tune with LoRA

python scripts/train.py \
  --model_path models/base/llama3-8b \
  --train_data data/datasets/train.jsonl \
  --val_data data/datasets/val.jsonl \
  --output_dir models/finetuned

7. Evaluate the Model

python scripts/evaluate.py \
  --model_path models/finetuned/final_model \
  --test_data data/datasets/test.jsonl \
  --output_path evaluation_results.json

8. Deploy Chatbot

python scripts/deploy.py \
  --model_path models/finetuned/final_model

9. Export as NIM (Optional)

python scripts/export_nim.py \
  --model_path models/finetuned/final_model \
  --output_dir nim/export

10. Build and Run NIM Container (Optional)

cd nim/export
docker build -t llama3-qa-chatbot-nim .
docker run -p 8000:8000 --gpus all llama3-qa-chatbot-nim

Jupyter Notebooks

The project includes several Jupyter notebooks for exploration and demonstration:

Data Exploration: Explore the data curation process
Model Analysis: Analyze model outputs and performance
Interactive Demo: Try out the chatbot in an interactive environment

Project Structure

nemo-qa-chatbot/
├── nemo_qa/               # Main package
│   ├── config/            # Python-based configuration
│   ├── curator/           # Data curation module
│   ├── recipes/           # NeMo 2.0 recipes
│   ├── modeling/          # Model implementation
│   ├── api/               # FastAPI implementation
│   └── ui/                # Gradio UI implementation
├── nim/                   # NeMo Inference Microservice
├── scripts/               # Command-line scripts
├── notebooks/             # Jupyter notebooks
├── tests/                 # Unit tests
└── data/                  # Data directory

Coming Soon in the Workshop Series

This project is part of a 6-part workshop series:

✅ Foundation: Fine-tuning basics with NVIDIA NeMo and LoRA
✅ Quality: Data curation and optimized fine-tuning (this project)
🔜 Advanced Reasoning: Chain-of-thought implementation
🔜 Alignment: RLHF and alignment techniques
🔜 Multimodal: Multi-modal capabilities and RAG
🔜 Deployment: Enterprise deployment and monitoring

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

if you use this project in your reaserch or work, please consider citing: @misc{hammadou2025nemoqachatbot, author = {Hammadou, Tarik}, title = {NeMo QA Chatbot: Production-Ready Q&A with LLAMA3 and NeMo 2.0}, year = {2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/T-DevH/nemo-qa-chatbot}} }

Acknowledgments

NVIDIA NeMo team for the amazing framework
LLAMA3 for the base model
All contributors who have helped shape this project

Development Setup with Docker

We use Docker for development to ensure a consistent environment across all developers and to leverage NVIDIA's pre-built NeMo container. This approach has several advantages:

Pre-built Dependencies: The NVIDIA NeMo container (nvcr.io/nvidia/nemo:25.02) comes with:
- CUDA and cuDNN pre-installed and configured
- PyTorch with CUDA support
- Pre-built transformer-engine optimized for the container's environment
- Other NVIDIA-specific optimizations
Isolated Environment: Docker provides an isolated environment that:
- Prevents conflicts with system-level dependencies
- Ensures consistent behavior across different machines
- Makes it easy to switch between different versions of dependencies
Simplified Setup: New developers can start working with just:
```
make run
```

What Happens When You Run `make run`

The make run command orchestrates the following process:

Container Pull: Downloads the NVIDIA NeMo base image (nvcr.io/nvidia/nemo:25.02)
Volume Mounting:
- Your local project directory is mounted at /workspace/nemo_qa_chatbot
- Changes made locally are immediately reflected in the container
- Your code runs in the container but is edited on your host machine
Dependency Installation:
- Installs Poetry inside the container
- Runs poetry install --without transformer-engine
- This installs all dependencies from pyproject.toml except transformer-engine
- The pre-built transformer-engine from the container is used instead

Key Files and Their Roles

File	Used?	Role
`pyproject.toml`	✅	Defines all your app's dependencies
`Makefile`	✅	Orchestrates container setup and launch
`nemo-toolkit`	✅	Installed from TOML via Poetry
`transformer-engine`	❌	Pre-built in container (skipped in TOML)
Your code/scripts	✅	Mounted live from your host machine

Why We Skip `transformer-engine` in Poetry

We intentionally skip installing transformer-engine via Poetry because:

The NVIDIA container already includes a pre-built, optimized version
Building from source can be complex and error-prone
The pre-built version is guaranteed to work with the container's CUDA setup

Development Workflow

Start the container:
```
make run
```
Access the container shell (in a new terminal):
```
make shell
```
Your code is mounted at /workspace/nemo_qa_chatbot
- Edit files on your host machine
- Run code inside the container
- Changes are immediately reflected
When done, simply exit the container:
```
exit
```

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
nemo_qa		nemo_qa
nim		nim
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

NeMo QA Chatbot

Overview

Key Features

Architecture

Installation

Prerequisites

Container-based Setup

Manual Setup (Alternative)

Usage

Prerequisites

1. Install NGC CLI (if not already installed)

2. Configure NGC CLI

3. Download Base Model

4. Verify Model Download

5. Curate Training Data

6. Fine-tune with LoRA

7. Evaluate the Model

8. Deploy Chatbot

9. Export as NIM (Optional)

10. Build and Run NIM Container (Optional)

Jupyter Notebooks

Project Structure

Coming Soon in the Workshop Series

Contributing

License

Citation

Acknowledgments

Development Setup with Docker

What Happens When You Run make run

Key Files and Their Roles

Why We Skip transformer-engine in Poetry

Development Workflow

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

What Happens When You Run `make run`

Why We Skip `transformer-engine` in Poetry

Packages