Knowledge Graph from CSV for RAG

This project converts structured CSV data into a knowledge graph and builds a vector index over node descriptions for Retrieval-Augmented Generation (RAG).

Features

Auto-infers simple relations from columns ending with _id (excluding the primary id column)
Builds a typed multi-directed graph (nodes/edges) using NetworkX
Generates text representations of nodes and embeds them with sentence-transformers
Provides a lightweight retriever (nearest neighbors) over node embeddings
Optional FastAPI service for querying

Quickstart

Create and activate a virtual environment (Windows PowerShell):

python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt

Try the example dataset:

python -m knowledge_graph.cli build-kg --csv .\examples\employees.csv --entity-type Employee --id-column id --output-dir .\artifacts
python -m knowledge_graph.cli build-index --graph-json .\artifacts\graph.json --output-dir .\artifacts --model all-MiniLM-L6-v2
python -m knowledge_graph.cli query --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5

Optional: Run the API + UI

uvicorn knowledge_graph.app:app --host 127.0.0.1 --port 8000

Open the UI at http://127.0.0.1:8000/.

UI Walkthrough

Upload CSV and Build Graph
- Choose a CSV file
- Entity type: type name for nodes (e.g., Device)
- Primary ID column (optional): if you have a unique id column (e.g., node_id), set it here
- Composite ID columns (optional): comma-separated columns to combine into a unique id (e.g., oem,model)
- Options:
  - Autogenerate IDs: if there is no id, assign row numbers
  - Make duplicate IDs unique: append suffixes to repeated ids
- Text columns (optional): columns to embed for retrieval (e.g., model,oem,product_line); leave blank to use non-id non-*_id columns
Query: type a text query and get top-k matching nodes
Chat (Ollama): ask a question; by default the UI shows a concise answer and citations; enable Verbose to see full retrieval context

You can still POST a query to http://127.0.0.1:8000/query with body:

{"query": "Who manages Alice?", "k": 5}

RAG Chat with Ollama (Qwen3:4B)

This project can chat with a local LLM served by Ollama. Ensure Ollama is running and you have the model:

ollama pull qwen3:4b

The API exposes additional endpoints:

POST /upload-csv (multipart form): Upload a CSV and build a new graph and index
- form fields: entity_type, optional id_column, optional composite_id_columns, autogenerate_ids (bool), uniqueify_duplicates (bool), optional text_columns
POST /chat (json): Ask a question with retrieval-augmented context using an Ollama model
- default response: { "answer": "...", "citations": ["Type:Id", ...] }
- verbose mode: include "verbose": true to get { "query", "answer", "retrieval" }

CLI also includes a chat command:

python -m knowledge_graph.cli chat --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5 --model qwen3:4b

Environment variables:

OLLAMA_HOST to override the Ollama base URL (default http://127.0.0.1:11434)
OLLAMA_TIMEOUT_SECS to extend chat timeout (default 300)

How it works

The builder creates one node per CSV row with type --entity-type and key from --id-column.
For any other column that ends with _id, an edge is created from the row's node to a target node with the same type (unless configured otherwise). If the target node doesn't appear as a row, it is created as a stub node so the edge is still valid.
All other columns are kept as node attributes.
A node text description is composed from its attributes and neighbors and embedded with sentence-transformers for retrieval.

CLI

build-kg: Build a graph from a CSV.
build-index: Build embeddings/index from a graph JSON.
query: Query the index and return the top-k nodes and a small subgraph context for RAG.

Run python -m knowledge_graph.cli --help for full options.

Inputs/Outputs

Inputs: CSV with a header row. Required: one id-like column (e.g., id). Optional: relation columns *_id.
Outputs:
- artifacts/graph.json: nodes/edges JSON
- artifacts/node_texts.jsonl: node textual representations
- artifacts/vectors.npy: node embeddings
- artifacts/metadata.json: node ids, mapping, and model info

Notes

You can pass --text-columns to explicitly indicate which columns form the node description; otherwise all non-id columns are used.
For multi-table setups (multiple CSVs), run build-kg per CSV and then merge graphs (future enhancement). For now, the autoinference focuses on single-table exports with foreign-key-like columns.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
knowledge_graph		knowledge_graph
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Graph from CSV for RAG

Features

Quickstart

UI Walkthrough

RAG Chat with Ollama (Qwen3:4B)

How it works

CLI

Inputs/Outputs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph from CSV for RAG

Features

Quickstart

UI Walkthrough

RAG Chat with Ollama (Qwen3:4B)

How it works

CLI

Inputs/Outputs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages