Multi-domain RAG (Retrieval Augmented Generation) ingestion pipeline that syncs documents from Microsoft SharePoint/OneDrive, generates embeddings, and stores them in FAISS indexes.
This project is a document ingestion pipeline that:
- Connects to Microsoft SharePoint/OneDrive via Microsoft Graph API using OAuth2 client credentials
- Monitors multiple domain folders (configurable via
DOMAINSenvironment variable) - Downloads documents (PDF, DOCX, TXT) from SharePoint
- Extracts and cleans text from those documents
- Chunks text using a parent-child splitting strategy:
- Parent chunks: 1200 chars with 100 char overlap
- Child chunks: 300 chars with 50 char overlap
- Generates embeddings using the
nomic-ai/nomic-embed-text-v1.5model - Stores vectors in FAISS indexes - one index per domain
- Runs continuously, polling for file changes at regular intervals (default 300s)
- Change detection: Detects new, modified, and deleted files and rebuilds indexes accordingly
- Multi-tenant: Uses
TENANT_ID,CLIENT_ID,CLIENT_SECRETfor Microsoft Graph auth - CPU-optimized: Runs on CPU (can be reverted to GPU)
- Per-domain isolation: Each domain has its own FAISS index, metadata, and chunks files
- Python 3.10
- Microsoft SharePoint/OneDrive with app registration
Environment variables (see .env.example):
| Variable | Description |
|---|---|
TENANT_ID |
Microsoft Azure AD tenant ID |
CLIENT_ID |
Application (client) ID |
CLIENT_SECRET |
Client secret for authentication |
SITE_ID |
SharePoint site ID |
DRIVE_ID |
SharePoint drive ID |
DOMAINS |
Comma-separated list of folder names to monitor (e.g., Councilreports,Humanresources) |
POLL_INTERVAL |
How often to check for changes in seconds (default: 300) |
- Copy the example environment file and fill in your credentials:
cp .env.example .env- Build and run with Docker:
docker-compose up -d.
├── app/
│ └── ingest.py # Main ingestion script
├── Dockerfile # Docker image definition
├── docker-compose.yml # Docker Compose configuration
├── requirements.txt # Python dependencies
├── .env.example # Example environment variables
└── README.md # This file
For each domain, the following files are created:
{domain}/faiss_index.bin- FAISS vector index{domain}/metadata.json- File metadata (names, modification dates){domain}/chunks.json- Text chunks with source information
docker build -t rag-ingest-multidomain .docker run -d --env-file .env rag-ingest-multidomainThe project is configured for CPU by default. To enable GPU:
- Uncomment
torchinrequirements.txt - In
Dockerfile, comment out the CPU-only PyTorch install line - In
app/ingest.py, uncomment the CUDA device detection and model movement lines
MIT