Knowledge Graph from CSV for RAG

This project converts structured CSV data into a knowledge graph and builds a vector index over node descriptions for Retrieval-Augmented Generation (RAG).

Features

Auto-infers simple relations from columns ending with _id (excluding the primary id column)
Builds a typed multi-directed graph (nodes/edges) using NetworkX
Generates text representations of nodes and embeds them with sentence-transformers
Provides a lightweight retriever (nearest neighbors) over node embeddings
Optional FastAPI service for querying

Quickstart

Create and activate a virtual environment (Windows PowerShell):

python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt

Try the example dataset:

python -m knowledge_graph.cli build-kg --csv .\examples\employees.csv --entity-type Employee --id-column id --output-dir .\artifacts
python -m knowledge_graph.cli build-index --graph-json .\artifacts\graph.json --output-dir .\artifacts --model all-MiniLM-L6-v2
python -m knowledge_graph.cli query --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5

Optional: Run the API

uvicorn knowledge_graph.app:app --reload

Then POST a query to http://127.0.0.1:8000/query with body:

{"query": "Who manages Alice?", "k": 5}

How it works

The builder creates one node per CSV row with type --entity-type and key from --id-column.
For any other column that ends with _id, an edge is created from the row's node to a target node with the same type (unless configured otherwise). If the target node doesn't appear as a row, it is created as a stub node so the edge is still valid.
All other columns are kept as node attributes.
A node text description is composed from its attributes and neighbors and embedded with sentence-transformers for retrieval.

CLI

build-kg: Build a graph from a CSV.
build-index: Build embeddings/index from a graph JSON.
query: Query the index and return the top-k nodes and a small subgraph context for RAG.

Run python -m knowledge_graph.cli --help for full options.

Inputs/Outputs

Inputs: CSV with a header row. Required: one id-like column (e.g., id). Optional: relation columns *_id.
Outputs:
- artifacts/graph.json: nodes/edges JSON
- artifacts/node_texts.jsonl: node textual representations
- artifacts/vectors.npy: node embeddings
- artifacts/metadata.json: node ids, mapping, and model info

Notes

You can pass --text-columns to explicitly indicate which columns form the node description; otherwise all non-id columns are used.
For multi-table setups (multiple CSVs), run build-kg per CSV and then merge graphs (future enhancement). For now, the autoinference focuses on single-table exports with foreign-key-like columns.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
knowledge_graph		knowledge_graph
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Graph from CSV for RAG

Features

Quickstart

How it works

CLI

Inputs/Outputs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph from CSV for RAG

Features

Quickstart

How it works

CLI

Inputs/Outputs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages