This project converts structured CSV data into a knowledge graph and builds a vector index over node descriptions for Retrieval-Augmented Generation (RAG).
- Auto-infers simple relations from columns ending with
_id(excluding the primary id column) - Builds a typed multi-directed graph (nodes/edges) using NetworkX
- Generates text representations of nodes and embeds them with
sentence-transformers - Provides a lightweight retriever (nearest neighbors) over node embeddings
- Optional FastAPI service for querying
- Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt- Try the example dataset:
python -m knowledge_graph.cli build-kg --csv .\examples\employees.csv --entity-type Employee --id-column id --output-dir .\artifacts
python -m knowledge_graph.cli build-index --graph-json .\artifacts\graph.json --output-dir .\artifacts --model all-MiniLM-L6-v2
python -m knowledge_graph.cli query --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5- Optional: Run the API + UI
uvicorn knowledge_graph.app:app --host 127.0.0.1 --port 8000Open the UI at http://127.0.0.1:8000/.
-
Upload CSV and Build Graph
- Choose a CSV file
- Entity type: type name for nodes (e.g., Device)
- Primary ID column (optional): if you have a unique id column (e.g.,
node_id), set it here - Composite ID columns (optional): comma-separated columns to combine into a unique id (e.g.,
oem,model) - Options:
- Autogenerate IDs: if there is no id, assign row numbers
- Make duplicate IDs unique: append suffixes to repeated ids
- Text columns (optional): columns to embed for retrieval (e.g.,
model,oem,product_line); leave blank to use non-id non-*_idcolumns
-
Query: type a text query and get top-k matching nodes
-
Chat (Ollama): ask a question; by default the UI shows a concise answer and citations; enable Verbose to see full retrieval context
You can still POST a query to http://127.0.0.1:8000/query with body:
{"query": "Who manages Alice?", "k": 5}This project can chat with a local LLM served by Ollama. Ensure Ollama is running and you have the model:
ollama pull qwen3:4bThe API exposes additional endpoints:
POST /upload-csv(multipart form): Upload a CSV and build a new graph and index- form fields:
entity_type, optionalid_column, optionalcomposite_id_columns,autogenerate_ids(bool),uniqueify_duplicates(bool), optionaltext_columns
- form fields:
POST /chat(json): Ask a question with retrieval-augmented context using an Ollama model- default response:
{ "answer": "...", "citations": ["Type:Id", ...] } - verbose mode: include
"verbose": trueto get{ "query", "answer", "retrieval" }
- default response:
CLI also includes a chat command:
python -m knowledge_graph.cli chat --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5 --model qwen3:4bEnvironment variables:
OLLAMA_HOSTto override the Ollama base URL (defaulthttp://127.0.0.1:11434)OLLAMA_TIMEOUT_SECSto extend chat timeout (default 300)
- The builder creates one node per CSV row with type
--entity-typeand key from--id-column. - For any other column that ends with
_id, an edge is created from the row's node to a target node with the same type (unless configured otherwise). If the target node doesn't appear as a row, it is created as a stub node so the edge is still valid. - All other columns are kept as node attributes.
- A node text description is composed from its attributes and neighbors and embedded with
sentence-transformersfor retrieval.
build-kg: Build a graph from a CSV.build-index: Build embeddings/index from a graph JSON.query: Query the index and return the top-k nodes and a small subgraph context for RAG.
Run python -m knowledge_graph.cli --help for full options.
- Inputs: CSV with a header row. Required: one id-like column (e.g.,
id). Optional: relation columns*_id. - Outputs:
artifacts/graph.json: nodes/edges JSONartifacts/node_texts.jsonl: node textual representationsartifacts/vectors.npy: node embeddingsartifacts/metadata.json: node ids, mapping, and model info
- You can pass
--text-columnsto explicitly indicate which columns form the node description; otherwise all non-id columns are used. - For multi-table setups (multiple CSVs), run
build-kgper CSV and then merge graphs (future enhancement). For now, the autoinference focuses on single-table exports with foreign-key-like columns.