TableRAG is an advanced question-answering framework that combines structured tabular data (CSV files) and unstructured text documents (PDF, DOCX, TXT, MD) using Retrieval-Augmented Generation (RAG). Ask natural language questions and get intelligent answers that leverage both your data tables and text content.
✅ Multi-modal Document Support: CSV tables, PDF documents, Word files, Markdown, and plain text
✅ Hybrid RAG Architecture: Combines SQL execution (precise) + vector search (semantic)
✅ Interactive Streamlit UI: Drag-and-drop uploads with real-time processing
✅ Intelligent Query Processing: LLM-powered query decomposition and answer synthesis
✅ Advanced Data Handling: Auto-encoding detection, CSV dialect sniffing, column type inference
✅ Comprehensive Error Handling: Graceful fallbacks and detailed debug information
✅ In-Memory Processing: Fast iteration without persistent storage requirements
✅ CLI Support: Command-line interface for batch processing
Option 1: HTML Video (works on some platforms)
Option 2: Clickable Video Thumbnail

Option 3: Direct Link
🎬
▶️ Watch Full Demo Video - Complete walkthrough of TableRAG features
The interface demonstrates the clean, intuitive design with:
- 📁 Drag-and-drop file upload (CSV, PDF, DOCX, TXT, MD)
- ⚡ Real-time processing with progress indicators
- 🔧 Debug mode with SQL query inspection
- 💬 Interactive Q&A with comprehensive answers
graph TD
A[📁 File Upload] --> B{File Type?}
B -->|CSV| C[🗃️ CSV Parser]
B -->|PDF/DOCX/TXT| D[📄 Text Extractor]
C --> E[🧠 SQL Schema Generation]
E --> F[💾 SQLite In-Memory DB]
D --> G[✂️ Text Chunking]
G --> H[🔤 Sentence Transformers]
H --> I[🔍 FAISS Vector Index]
J[❓ User Query] --> K[🤖 Query Decomposition<br/>Groq LLM]
K --> L[🔍 Vector Search]
I --> L
L --> M[📚 Retrieved Chunks]
K --> N[💬 NL2SQL Generation]
F --> N
N --> O[⚙️ SQL Execution]
O --> P[📊 Query Results]
M --> Q[🎯 Answer Synthesis<br/>Groq LLM]
P --> Q
Q --> R[✅ Final Answer]
Core Components:
- Document Ingestion: Multi-format file processing with validation
- Dual Storage: SQLite tables + FAISS vector embeddings
- Query Intelligence: LLM-powered query understanding and decomposition
- Hybrid Retrieval: SQL precision + semantic search
- Answer Generation: Context-aware response synthesis
TableRAG/
├── 🎯 Core Application
│ ├── streamlit_app.py # Main Streamlit UI (268 lines)
│ ├── run.py # CLI interface
│ └── app/ # Core logic modules
│ ├── config.py # Environment configuration
│ ├── pipeline/
│ │ └── rag_pipeline.py # Main RAG orchestration (227 lines)
│ ├── llm/
│ │ ├── groq_client.py # Groq API integration
│ │ └── prompts.py # LLM prompt templates
│ ├── database/
│ │ └── sql_executor.py # SQLite operations (269 lines)
│ ├── embeddings/
│ │ └── embedder.py # Sentence Transformers wrapper
│ ├── retrieval/
│ │ └── faiss_index.py # FAISS vector operations
│ └── utils/
│ ├── ingest.py # Multi-format file processing (321 lines)
│ ├── chunking.py # Text segmentation
│ └── logging.py # Centralized logging
│
├── 📁 Data & Storage
│ ├── data/ # User data directories
│ │ ├── tables/ # CSV files (persistent)
│ │ └── texts/ # Text documents (persistent)
│ ├── db/ # SQLite databases
│ │ └── tables.db # Persistent database (optional)
│ └── index/ # FAISS index files
│ └── faiss.index # Vector index (persistent)
│
├── 🎬 Assets & Documentation
│ ├── assets/
│ │ ├── Screenshot 2025-10-09 230417.png # UI demo
│ │ └── Screen Recording 2025-10-09 225828.mp4 # Video demo
│ ├── test_assets/ # Sample files for testing
│ │ ├── test.csv
│ │ ├── report.pdf
│ │ └── report.html
│ └── README.md # This documentation
│
├── ⚙️ Configuration
│ ├── requirements.txt # Python dependencies
│ ├── .env # Environment variables (create this)
│ ├── .gitignore # Git exclusions
│ └── helper.py # Development utilities
│
└── 🐍 Virtual Environment
└── trag/ # Python virtual environment
- Python 3.12+ (recommended)
- Groq API Key (for LLM access)
- Git (for cloning)
git clone https://github.com/HemaKumar0077/TableRAG
cd TableRAG# Windows
python -m venv trag
trag\Scripts\activate
# macOS/Linux
python3 -m venv trag
source trag/bin/activatepip install -r requirements.txtCreate a .env file in the project root:
# ===== REQUIRED CONFIGURATION =====
GROQ_API_KEY=gsk_your_groq_api_key_here
# ===== OPTIONAL CONFIGURATION =====
# Embedding Model (Hugging Face)
EMBEDDING_MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2
# Database Settings
DB_TYPE=sqlite
SQLITE_DB_PATH=db/tables.db
# FAISS Index Configuration
FAISS_INDEX_PATH=index/faiss.index
# Retrieval Parameters
TOP_K_RETRIEVAL=5
MAX_ITERATIONS=1
# Logging
LOG_LEVEL=INFO📋 How to get a Groq API Key:
- Visit console.groq.com
- Sign up/login with your account
- Navigate to "API Keys" section
- Create a new API key
- Copy and paste into your
.envfile
mkdir -p data/tables data/texts db indexstreamlit run streamlit_app.pyFeatures:
- 🖱️ Drag & Drop: Upload CSV, PDF, DOCX, TXT, MD files
- ⚡ Real-time Processing: See upload progress and validation
- 🔧 Debug Mode: Inspect SQL queries and execution details
- 📊 Interactive Results: View data tables and text chunks
⚠️ Error Handling: Clear feedback on processing issues
Workflow:
- Upload Files: Drag CSV files (→ tables) and text files (→ chunks)
- Process Documents: Click "🚀 Process Documents"
- Ask Questions: Type natural language queries
- Get Answers: View synthesized responses with debug info
python run.pyExample Session:
🔍 TableRAG CLI
Ask a question based on your text and table knowledge base.
🧠 Enter your question: What was the total revenue by region?
✅ Answer: Based on the sales data, the total revenue by region is...
--- Debug Info ---
📚 Retrieved Chunks: [relevant text excerpts]
📄 SQL Query: SELECT region, SUM(revenue) FROM sales_data GROUP BY region
🧾 SQL Result: [{"region": "North", "revenue": 150000}, ...]
📊 Table Analysis:
- "What is the total sales revenue across all regions?"
- "Which product had the highest growth rate?"
- "Show me all customers with orders above $10,000"
- "What is the average age of customers by location?"
📄 Document Search:
- "What are the key findings from the uploaded reports?"
- "Summarize the main recommendations in the documents"
- "What challenges were mentioned in the analysis?"
🔗 Hybrid Queries:
- "Based on the sales data, what do the reports say about market trends?"
- "Compare the revenue figures with the strategic recommendations"
- Model: Llama-3.3-70B-Versatile
- API: OpenAI-compatible REST interface
- Functions: Query decomposition, SQL generation, answer synthesis
- Timeout: 30-second request limit with retry logic
- Algorithm: Inner Product (IP) for cosine similarity
- Embeddings: Sentence Transformers (384-dim by default)
- Storage: In-memory with optional persistence
- Performance: Sub-second search on 100K+ chunks
- Connection: Thread-safe, in-memory primary storage
- Features: Auto-schema inference, type detection, sanitization
- Safety: SQL injection protection, transaction management
- Validation: Comprehensive error handling and rollback
# Supported formats and processing
SUPPORTED_FORMATS = {
'CSV': 'Parsed → SQLite tables with type inference',
'PDF': 'Text extraction → chunked → vectorized',
'DOCX': 'Content extraction → chunked → vectorized',
'TXT/MD': 'Direct chunking → vectorized',
}| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
Required | Your Groq API authentication key |
EMBEDDING_MODEL_NAME |
all-MiniLM-L6-v2 |
Hugging Face model for embeddings |
SQLITE_DB_PATH |
db/tables.db |
Persistent SQLite database location |
FAISS_INDEX_PATH |
index/faiss.index |
FAISS vector index file path |
TOP_K_RETRIEVAL |
5 |
Number of text chunks to retrieve |
LOG_LEVEL |
INFO |
Logging verbosity (DEBUG/INFO/WARNING/ERROR) |
❌ "Failed to load embedding model"
# Solution: Install/update transformers
pip install --upgrade sentence-transformers torch❌ "Groq API authentication failed"
# Check your .env file has the correct API key
echo $GROQ_API_KEY # Should show your key❌ "CSV parsing errors"
- Cause: Encoding issues or malformed CSV
- Solution: Check file encoding, verify CSV structure
- Debug: Enable "Show Debug Information" in UI
❌ "Empty query results"
- Cause: No relevant data found
- Solution: Verify files were processed successfully
- Check: File Information sidebar shows loaded tables/chunks
- Large CSVs: Files auto-process in 1000-row batches
- Memory Usage: Consider smaller
TOP_K_RETRIEVALvalues - Response Time: Use more specific queries for faster results
Log Location: app.log (rotating, 5MB max)
Log Levels Available:
LOG_LEVEL=DEBUG # Detailed query and processing info
LOG_LEVEL=INFO # Standard operational messages
LOG_LEVEL=WARNING # Issues that don't break functionality
LOG_LEVEL=ERROR # Critical errors requiring attentionKey Metrics Logged:
- File processing times and success rates
- SQL query execution and results
- Vector search performance
- LLM API response times and errors
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Groq - Fast LLM inference
- Hugging Face - Transformer models and embeddings
- FAISS - Efficient similarity search
- Streamlit - Rapid web app development
- SQLite - Embedded database engine
Built with ❤️ for intelligent document analysis