This project converts structured CSV data into a knowledge graph and builds a vector index over node descriptions for Retrieval-Augmented Generation (RAG).
- Auto-infers simple relations from columns ending with
_id(excluding the primary id column) - Builds a typed multi-directed graph (nodes/edges) using NetworkX
- Generates text representations of nodes and embeds them with
sentence-transformers - Provides a lightweight retriever (nearest neighbors) over node embeddings
- Optional FastAPI service for querying
- Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt- Try the example dataset:
python -m knowledge_graph.cli build-kg --csv .\examples\employees.csv --entity-type Employee --id-column id --output-dir .\artifacts
python -m knowledge_graph.cli build-index --graph-json .\artifacts\graph.json --output-dir .\artifacts --model all-MiniLM-L6-v2
python -m knowledge_graph.cli query --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5- Optional: Run the API
uvicorn knowledge_graph.app:app --reloadThen POST a query to http://127.0.0.1:8000/query with body:
{"query": "Who manages Alice?", "k": 5}- The builder creates one node per CSV row with type
--entity-typeand key from--id-column. - For any other column that ends with
_id, an edge is created from the row's node to a target node with the same type (unless configured otherwise). If the target node doesn't appear as a row, it is created as a stub node so the edge is still valid. - All other columns are kept as node attributes.
- A node text description is composed from its attributes and neighbors and embedded with
sentence-transformersfor retrieval.
build-kg: Build a graph from a CSV.build-index: Build embeddings/index from a graph JSON.query: Query the index and return the top-k nodes and a small subgraph context for RAG.
Run python -m knowledge_graph.cli --help for full options.
- Inputs: CSV with a header row. Required: one id-like column (e.g.,
id). Optional: relation columns*_id. - Outputs:
artifacts/graph.json: nodes/edges JSONartifacts/node_texts.jsonl: node textual representationsartifacts/vectors.npy: node embeddingsartifacts/metadata.json: node ids, mapping, and model info
- You can pass
--text-columnsto explicitly indicate which columns form the node description; otherwise all non-id columns are used. - For multi-table setups (multiple CSVs), run
build-kgper CSV and then merge graphs (future enhancement). For now, the autoinference focuses on single-table exports with foreign-key-like columns.