This project currently uses a unified CocoIndex-based ingestion pipeline as the primary path.
- Package:
src/ingestion/unified/ - CLI entrypoint:
python -m src.ingestion.unified.cli - Pipeline version in code:
v3.2.1
- Reads files from
GDRIVE_SYNC_DIR(LocalFile source). - Computes stable file identity via manifest/content hash.
- Parses and chunks files through Docling.
- Generates embeddings:
- default: local BGE-M3 (
USE_LOCAL_DENSE_EMBEDDINGS=true) - optional: Voyage dense + BGE-M3 sparse
- default: local BGE-M3 (
- Upserts/deletes points in Qdrant.
- Tracks file state/retries/DLQ in PostgreSQL.
# Validate dependencies, env, and source directory
make ingest-unified-preflight
# Create or validate the runtime collection schema
make ingest-unified-bootstrap
# One-shot run
make ingest-unified
# Continuous watch mode
make ingest-unified-watch
# State/DLQ status
make ingest-unified-status
# Reprocess files in error state
make ingest-unified-reprocess
# Container logs
make ingest-unified-logsDirect CLI equivalents:
uv run python -m src.ingestion.unified.cli preflight
uv run python -m src.ingestion.unified.cli bootstrap
uv run python -m src.ingestion.unified.cli run --watch
uv run python -m src.ingestion.unified.cli status
uv run python -m src.ingestion.unified.cli reprocess --errors- CLI root spans:
ingestion-cli-runingestion-cli-preflight
- Runtime stage spans:
ingestion-flow-run-onceingestion-flow-watchingestion-qdrant-upsert-chunksingestion-qdrant-delete-file
Validate coverage together with API/voice traces:
make validate-traces-fastINGESTION_DATABASE_URLQDRANT_URLDOCLING_URLBGE_M3_URL
Commonly used:
GDRIVE_SYNC_DIRGDRIVE_COLLECTION_NAMERCLONE_CONFIG_FILERCLONE_REMOTEUSE_LOCAL_DENSE_EMBEDDINGSBGE_M3_TIMEOUTBGE_M3_CONCURRENCYMANIFEST_DIR
compose.yml + compose.dev.yml include the ingestion service under profile ingest.
make docker-ingest-up
make ingest-unified-logsThe service mounts GDRIVE_SYNC_DIR into /data/drive-sync with fail-fast bind-mount semantics.
If the host path is missing, docker compose now fails instead of silently creating an empty directory.
preflightfails on Qdrant: confirm collection exists or runbootstrap.preflightfails on sync dir: confirmGDRIVE_SYNC_DIRexists, is a directory, and contains supported files.statusshows only errors: runreprocess --errors, then inspect Docling/BGE-M3 logs.- No files processed: verify
GDRIVE_SYNC_DIRmount/path and allowed file extensions. - Collection exists but has
0 points: verify the Google Drive sync host path is populated before debugging Qdrant.