100% Local Inference · Zero Data Leakage · Built for Developers
An industrial-grade Retrieval-Augmented Generation (RAG) system that turns your
entire codebase, technical docs, and architecture diagrams into a conversational second brain.
English · 中文
|
All embeddings and reranking run on your local GPU. Code snippets, documents, and conversations never leave your machine. ChromaDB multi-collection architecture provides physical isolation per project space. |
BM25 (code-aware tokenizer) + Vector (BGE) + RRF fusion. The custom |
|
Tuned for NVIDIA RTX 5060 · PyTorch 2.6 · CUDA 12.8. Concurrent batch ingestion with thread-pool workers, incremental mtime-based scanning, and BM25 instance caching that avoids O(N) vocabulary rebuilds on every chat turn. |
PyMuPDF for PDF, python-docx for Word, Gemini Vision for architecture diagrams/screenshots, and tree-sitter |
|
LLM quota exhausted? The system auto-fails over from Gemini to Qwen (and back) mid-conversation without losing context. BM25 init failure? Falls back to pure vector retrieval. Resilient by design. |
Watchdog monitors workspace directories with configurable debounce. Files are incrementally re-indexed on change; deleted files are pruned from the vector store. Ghost node cleanup runs on every startup. |
| Layer | Technology | Role |
|---|---|---|
| Orchestration | LlamaIndex 0.12+ | Index pipeline, Condense+Context chat engine, RRF fusion |
| LLM | Gemini 2.5 Flash · Qwen-Plus/Max | Conversational generation with seamless failover |
| Embedding | BAAI/bge-large-zh-v1.5 (HuggingFace) |
Local GPU vector embeddings |
| Reranker | BAAI/bge-reranker-v2-m3 (SentenceTransformer) |
Local GPU cross-encoder reranking |
| Vector Store | ChromaDB (Persistent Client) | On-disk persistence, multi-collection isolation |
| Hybrid Search | BM25 + Vector + Reciprocal Rank Fusion | Code-aware tokenizer + semantic dual-pathway |
| UI | Streamlit 1.40+ | Streaming chat, source trace panel, file upload, ZIP extraction |
| File Watching | Watchdog 6.0+ | Debounced incremental indexing on filesystem events |
| Doc Parsing | PyMuPDF · python-docx · tree-sitter | PDF, Word, structural code splitting |
| Tokenization | Jieba + custom regex | Chinese semantic + code identifier extraction |
bendiRAG/
├── main.py # Entry point — loads .env, sets cache paths, launches Streamlit
├── app.py # Streamlit UI — streaming chat, space management, file upload
├── config.py # Configuration — .env loading, workspace persistence
├── rag_engine.py # RAG core — indexing, BM25+Vector retrieval, chat engine
├── watcher.py # Watchdog daemon — debounced incremental file indexing
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template (safe to commit)
├── .env # Local secrets (excluded from Git)
├── .gitignore
├── .chroma_second_brain/ # ChromaDB persistent storage
├── dynamic_workspace/ # Uploaded files & ZIP extraction scratch space
├── workspaces.json # Per-space workspace path registry
└── .index_state_*.json # Per-space mtime index state (auto-generated)
| File | Responsibility |
|---|---|
main.py |
Boot: loads .env, syncs HF_HOME / cache env vars, delegates to run_app() |
app.py |
Full UI lifecycle: space switching, streaming output, source trace expander, chat persistence, file upload/ZIP extraction, workspace CRUD |
config.py |
AppConfig dataclass, dotenv loading, workspaces.json persistence, HTTP proxy injection |
rag_engine.py |
RAG infrastructure: GPU embedding/reranker init, BM25+Vector hybrid retrieval, ChromaDB collection management, ghost node cleanup, incremental scan, code-aware chunking, image captioning |
watcher.py |
Watchdog event handler: debounced batch upsert, file deletion sync to vector store |
- Python 3.10+ · CUDA 12.8 · NVIDIA GPU (RTX 3060 or above recommended)
- Google AI Studio API Key (free tier available)
- (Optional) Aliyun DashScope API Key for Qwen failover
# Create and activate conda environment
conda create -n bendirag python=3.11 -y
conda activate bendirag
# Install PyTorch 2.6 with CUDA 12.8 support
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
# Install project dependencies
pip install -r requirements.txt# Copy the template
cp .env.example .env
# Edit .env with your API keys
# GEMINI_API_KEY=... (required)
# DASHSCOPE_API_KEY=... (optional — Qwen failover)
# WORKSPACE_PATHS=... (directories to index)streamlit run main.py
# or: python main.pyOpen http://localhost:8501 and start conversing with your codebase.
On startup, the app scans all WORKSPACE_PATHS directories, chunks every indexable file (code, docs, images), and ingests them into the vector store. A progress bar shows batch insertion status in real time.
| Action | How | Effect |
|---|---|---|
| Switch space | Sidebar dropdown | Switches to an isolated ChromaDB collection + chat history |
| Create space | + button |
Spawns a blank collection |
| Destroy space | 🗑️ button |
Physically deletes collection + index state + chat history |
- File upload: Drag & drop files or ZIP archives. ZIPs auto-extract (with Chinese filename encoding fix).
- Watch directory: Enter a local folder path — one-click full scan + continuous watchdog monitoring.
- Remove workspace: Click
✕to purge all vector nodes under a directory.
Ask natural-language questions like:
- "Explain the overall architecture and tech stack of this project."
- "What does the recent change to UserService do?"
- "Map out the database table dependencies."
Each response includes a source trace panel showing the originating file path and a 300-character code snippet preview.
| Model | Provider | Required Key |
|---|---|---|
gemini-2.5-flash |
GEMINI_API_KEY |
|
qwen-plus |
Aliyun DashScope | DASHSCOPE_API_KEY |
qwen-max |
Aliyun DashScope | DASHSCOPE_API_KEY |
Auto-failover: If the active model returns a quota/resource error (HTTP 429), the system automatically retries with the alternate provider — no manual intervention needed.
def code_aware_tokenize(text: str) -> list[str]:
code_tokens = re.findall(r"[a-zA-Z0-9_]+", text) # camelCase / snake_case
chinese_tokens = [w for w in jieba.lcut(text) if ...] # Chinese semantics
return code_tokens + chinese_tokensBM25 keeps identifiers like getUserById intact instead of splitting them into ["get", "User", "By", "Id"] — a game-changer for code search.
File created → incremental_upsert_file → update index_state
File modified → delete_ref_doc → re-insert → update index_state mtime
File deleted → watchdog on_deleted → delete_ref_doc
Workspace rm → delete_workspace_nodes → batch purge via index_state
Space destroy → delete_collection → wipe Chroma + state + chat history
Ghost cleanup → cleanup_ghost_nodes → scan index_state for vanished files
BM25 retriever construction requires an O(N) scan over the full document vocabulary. The system caches (index_state_mtime, BM25Retriever) globally and only rebuilds when the space's index state file changes — reuse across chat turns, no per-message penalty.
MIT License — use freely. Your data never leaves your machine.
核心定位:面向软件工程师的工业级 RAG 系统。所有 Embedding 和 Rerank 推理均在本地 GPU 上完成,代码片段绝不离开你的机器。
五大亮点:
- 隐私隔离 — ChromaDB 多 Collection 物理隔离,销毁空间即完整擦除
- 混合检索 — BM25(代码分词)+ Vector(BGE 语义)+ RRF 倒数排序融合
- 硬件压榨 — 针对 RTX 5060 / PyTorch 2.6 / CUDA 12.8 调优,并发批写入 + BM25 缓存复用
- 多格式解析 — PyMuPDF 解析 PDF、python-docx 解析 Word、Gemini 视觉模型描述截图、tree-sitter 结构化切分 Java/Vue
- 自动容灾 — Gemini 配额耗尽时自动切换千问接力,对话不中断
快速安装:
conda create -n bendirag python=3.11 -y && conda activate bendirag
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
pip install -r requirements.txt
cp .env.example .env # 编辑填入 GEMINI_API_KEY
streamlit run main.py技术栈:LlamaIndex 编排 · ChromaDB 向量存储 · Streamlit 界面 · HuggingFace BGE 本地 Embedding/Rerank · Gemini 2.5 Flash + 千问 Plus/Max 双引擎 · Watchdog 文件监听 · Jieba 中文分词
Built with ❤️ for developers who care about privacy and performance.