Skip to content

VickM29-bit/knowledge-graph

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Graph from CSV for RAG

This project converts structured CSV data into a knowledge graph and builds a vector index over node descriptions for Retrieval-Augmented Generation (RAG).

Features

  • Auto-infers simple relations from columns ending with _id (excluding the primary id column)
  • Builds a typed multi-directed graph (nodes/edges) using NetworkX
  • Generates text representations of nodes and embeds them with sentence-transformers
  • Provides a lightweight retriever (nearest neighbors) over node embeddings
  • Optional FastAPI service for querying

Quickstart

  1. Create and activate a virtual environment (Windows PowerShell):
python -m venv .venv
. .venv/Scripts/Activate.ps1
pip install -r requirements.txt
  1. Try the example dataset:
python -m knowledge_graph.cli build-kg --csv .\examples\employees.csv --entity-type Employee --id-column id --output-dir .\artifacts
python -m knowledge_graph.cli build-index --graph-json .\artifacts\graph.json --output-dir .\artifacts --model all-MiniLM-L6-v2
python -m knowledge_graph.cli query --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5
  1. Optional: Run the API + UI
uvicorn knowledge_graph.app:app --host 127.0.0.1 --port 8000

Open the UI at http://127.0.0.1:8000/.

UI Walkthrough

  • Upload CSV and Build Graph

    • Choose a CSV file
    • Entity type: type name for nodes (e.g., Device)
    • Primary ID column (optional): if you have a unique id column (e.g., node_id), set it here
    • Composite ID columns (optional): comma-separated columns to combine into a unique id (e.g., oem,model)
    • Options:
      • Autogenerate IDs: if there is no id, assign row numbers
      • Make duplicate IDs unique: append suffixes to repeated ids
    • Text columns (optional): columns to embed for retrieval (e.g., model,oem,product_line); leave blank to use non-id non-*_id columns
  • Query: type a text query and get top-k matching nodes

  • Chat (Ollama): ask a question; by default the UI shows a concise answer and citations; enable Verbose to see full retrieval context

You can still POST a query to http://127.0.0.1:8000/query with body:

{"query": "Who manages Alice?", "k": 5}

RAG Chat with Ollama (Qwen3:4B)

This project can chat with a local LLM served by Ollama. Ensure Ollama is running and you have the model:

ollama pull qwen3:4b

The API exposes additional endpoints:

  • POST /upload-csv (multipart form): Upload a CSV and build a new graph and index
    • form fields: entity_type, optional id_column, optional composite_id_columns, autogenerate_ids (bool), uniqueify_duplicates (bool), optional text_columns
  • POST /chat (json): Ask a question with retrieval-augmented context using an Ollama model
    • default response: { "answer": "...", "citations": ["Type:Id", ...] }
    • verbose mode: include "verbose": true to get { "query", "answer", "retrieval" }

CLI also includes a chat command:

python -m knowledge_graph.cli chat --index .\artifacts --graph .\artifacts\graph.json --q "Who manages Alice?" --k 5 --model qwen3:4b

Environment variables:

  • OLLAMA_HOST to override the Ollama base URL (default http://127.0.0.1:11434)
  • OLLAMA_TIMEOUT_SECS to extend chat timeout (default 300)

How it works

  • The builder creates one node per CSV row with type --entity-type and key from --id-column.
  • For any other column that ends with _id, an edge is created from the row's node to a target node with the same type (unless configured otherwise). If the target node doesn't appear as a row, it is created as a stub node so the edge is still valid.
  • All other columns are kept as node attributes.
  • A node text description is composed from its attributes and neighbors and embedded with sentence-transformers for retrieval.

CLI

  • build-kg: Build a graph from a CSV.
  • build-index: Build embeddings/index from a graph JSON.
  • query: Query the index and return the top-k nodes and a small subgraph context for RAG.

Run python -m knowledge_graph.cli --help for full options.

Inputs/Outputs

  • Inputs: CSV with a header row. Required: one id-like column (e.g., id). Optional: relation columns *_id.
  • Outputs:
    • artifacts/graph.json: nodes/edges JSON
    • artifacts/node_texts.jsonl: node textual representations
    • artifacts/vectors.npy: node embeddings
    • artifacts/metadata.json: node ids, mapping, and model info

Notes

  • You can pass --text-columns to explicitly indicate which columns form the node description; otherwise all non-id columns are used.
  • For multi-table setups (multiple CSVs), run build-kg per CSV and then merge graphs (future enhancement). For now, the autoinference focuses on single-table exports with foreign-key-like columns.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 78.1%
  • HTML 21.9%