RAG Ingest Multi-Domain

Multi-domain RAG (Retrieval Augmented Generation) ingestion pipeline that syncs documents from Microsoft SharePoint/OneDrive, generates embeddings, and stores them in FAISS indexes.

Overview

This project is a document ingestion pipeline that:

Connects to Microsoft SharePoint/OneDrive via Microsoft Graph API using OAuth2 client credentials
Monitors multiple domain folders (configurable via DOMAINS environment variable)
Downloads documents (PDF, DOCX, TXT) from SharePoint
Extracts and cleans text from those documents
Chunks text using a parent-child splitting strategy:
- Parent chunks: 1200 chars with 100 char overlap
- Child chunks: 300 chars with 50 char overlap
Generates embeddings using the nomic-ai/nomic-embed-text-v1.5 model
Stores vectors in FAISS indexes - one index per domain
Runs continuously, polling for file changes at regular intervals (default 300s)

Key Features

Change detection: Detects new, modified, and deleted files and rebuilds indexes accordingly
Multi-tenant: Uses TENANT_ID, CLIENT_ID, CLIENT_SECRET for Microsoft Graph auth
CPU-optimized: Runs on CPU (can be reverted to GPU)
Per-domain isolation: Each domain has its own FAISS index, metadata, and chunks files

Requirements

Python 3.10
Microsoft SharePoint/OneDrive with app registration

Configuration

Environment variables (see .env.example):

Variable	Description
`TENANT_ID`	Microsoft Azure AD tenant ID
`CLIENT_ID`	Application (client) ID
`CLIENT_SECRET`	Client secret for authentication
`SITE_ID`	SharePoint site ID
`DRIVE_ID`	SharePoint drive ID
`DOMAINS`	Comma-separated list of folder names to monitor (e.g., `Councilreports,Humanresources`)
`POLL_INTERVAL`	How often to check for changes in seconds (default: 300)

Quick Start

Copy the example environment file and fill in your credentials:

cp .env.example .env

Build and run with Docker:

docker-compose up -d

Project Structure

.
├── app/
│   └── ingest.py          # Main ingestion script
├── Dockerfile             # Docker image definition
├── docker-compose.yml     # Docker Compose configuration
├── requirements.txt       # Python dependencies
├── .env.example           # Example environment variables
└── README.md              # This file

Output

For each domain, the following files are created:

{domain}/faiss_index.bin - FAISS vector index
{domain}/metadata.json - File metadata (names, modification dates)
{domain}/chunks.json - Text chunks with source information

Docker

Build

docker build -t rag-ingest-multidomain .

Run

docker run -d --env-file .env rag-ingest-multidomain

GPU Support

The project is configured for CPU by default. To enable GPU:

Uncomment torch in requirements.txt
In Dockerfile, comment out the CPU-only PyTorch install line
In app/ingest.py, uncomment the CUDA device detection and model movement lines

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Ingest Multi-Domain

Overview

Key Features

Requirements

Configuration

Quick Start

Project Structure

Output

Docker

Build

Run

GPU Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAG Ingest Multi-Domain

Overview

Key Features

Requirements

Configuration

Quick Start

Project Structure

Output

Docker

Build

Run

GPU Support

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages