Skip to content

RAG integration – Data & Embedding pipeline architecture #152

@Kadajett

Description

@Kadajett

Design the pipeline to turn Semfora's existing outputs (toon, sqlite, jsonl) into embeddings for Retrieval‑Augmented Generation.

Goals:

  • Use lightweight outputs to generate embeddings on client machines of unknown power.
  • Handle massive codebases via chunking, on‑disk vector stores, and incremental updates.
  • Keep embeddings up‑to‑date when files change or re‑indexing occurs.

Deliverables:

  • Architecture diagram (Mermaid) linking Semfora indexing, chunking, embedding model, and vector DB.
  • Recommended embedding models (open‑source sentence‑transformers, OpenAI embeddings, etc.) and fallback strategies.
  • Strategy for incremental updates (hash‑based change detection, delta indexing).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions