✨ A Multi-Modal Knowledge Graph RAG Framework ✨
From documents to multi-modal knowledge graphs — an all-in-one MMGraphRAG solution
|
|
|
|
This diagram illustrates the complete workflow of MMGraphRAG.
This project is based on modifications to nano-graphrag to support multi-modal inputs (community-related code removed). The image processing component uses YOLO and Multi-modal Large Language Models (MLLM) to convert images into scene graphs. The fusion component then uses spectral clustering to select candidate entities, combining the textual knowledge graph and the image knowledge graph to construct a multi-modal knowledge graph.
Our Cross-Modal Entity Linking (CMEL) dataset is available here:
https://github.com/wanxueyao/CMEL-dataset
pip install openai # LLM API calls
pip install sentence-transformers # Text embeddings
pip install networkx # Graph storage
pip install numpy # Numerical computation
pip install scikit-learn # Vector similarity calculation
pip install Pillow # Image processing
pip install tqdm # Progress bar
pip install tiktoken # Text chunking token calculation
pip install ultralytics # YOLO image segmentation
pip install opencv-python # Image processing (cv2)pip install flask # Web server framework
pip install flask-cors # Cross-origin supportThis project supports two PDF parsing options. Install at least one:
| Option | Installation Command | Features |
|---|---|---|
| MinerU (Recommended) | pip install -U "mineru[all]" |
Higher parsing quality, supports complex layouts, better image context extraction |
| PyMuPDF | pip install pymupdf |
Lightweight, easy installation, suitable for simple PDFs |
Switching: Set
USE_MINERU = True/Falseinsrc/parameter.pyFallback: If MinerU is unavailable, the system automatically falls back to PyMuPDF
This project requires three types of models, all configured in src/parameter.py:
Used for text entity extraction, relationship building, etc. Requires an OpenAI-compatible API:
API_KEY = "your-api-key"
API_BASE = "https://your-api-endpoint/v1"
MODEL_NAME = "qwen3-max" # or other text modelsUsed for image understanding, visual entity extraction, etc. Requires an API that supports image input:
MM_API_KEY = "your-api-key"
MM_API_BASE = "https://your-api-endpoint/v1"
MM_MODEL_NAME = "qwen-vl-max" # or other multi-modal modelsUsed for entity vectorization and semantic retrieval. Configure in src/parameter.py:
EMBEDDING_MODEL_DIR = './models/all-MiniLM-L6-v2'
EMBED_MODEL = SentenceTransformer(EMBEDDING_MODEL_DIR, device="cpu")Tip: The embedding model can be auto-downloaded using the model name, or manually downloaded and configured with a local path.
If you choose to use MinerU:
- Install:
pip install -U "mineru[all]" - Configure: See MinerU official documentation for model file downloads
- Verify: Ensure MinerU runs independently before proceeding
All core parameters are defined in src/parameter.py:
| Parameter | Description | Default |
|---|---|---|
INPUT_PDF_PATH |
Input PDF file path | - |
CACHE_PATH |
LLM response cache directory | cache |
WORKING_DIR |
Intermediate processing files directory | working |
OUTPUT_DIR |
Final graph output directory | output |
MMKG_NAME |
Output graph name | mmkg_timestamp |
| Parameter | Description | Default |
|---|---|---|
USE_MINERU |
Whether to use MinerU for PDF preprocessing | True |
ENTITY_EXTRACT_MAX_GLEANING |
Max iterations for text entity extraction | 0 |
ENTITY_SUMMARY_MAX_TOKENS |
Max tokens for entity summary | 500 |
SUMMARY_CONTEXT_MAX_TOKENS |
Max tokens for summary context | 10000 |
| Parameter | Description | Default |
|---|---|---|
QueryParam.top_k |
Number of entities to retrieve | 5 |
QueryParam.response_type |
Response style type | Detailed System-like Response |
QueryParam.local_max_token_for_local_context |
Max tokens for local context | 4000 |
QueryParam.number_of_mmentities |
Number of multi-modal entities | 3 |
QueryParam.local_max_token_for_text_unit |
Max tokens for text unit | 4000 |
RETRIEVAL_THRESHOLD |
Retrieval similarity threshold | 0.2 |
# 1️⃣ Build knowledge graph
python main.py -i path/to/your/document.pdf
# 2️⃣ Query
python main.py -q "Your question"
# 3️⃣ Launch visualization ✨
python main.py -s
# 🌐 Visit http://localhost:8080 to explore the interactive graph# Build graph from specified PDF file
python main.py -i path/to/your/document.pdf
# Specify working and output directories
python main.py -i document.pdf -w ./working -o ./output
# Use PyMuPDF for PDF processing (instead of MinerU)
python main.py -i document.pdf -m pymupdf
# Force rebuild (clear working directory)
python main.py -i document.pdf -f
# Show verbose debug logs
python main.py -i document.pdf -v# Query the built graph
python main.py -q "Your question"
# Specify retrieval parameters
python main.py -q "Your question" --top_k 10 --response_type "Concise answer"
# If graph doesn't exist, it will be built first
python main.py -i document.pdf -q "Your question"The built-in Web visualization server lets you intuitively explore the knowledge graph:
# Start knowledge graph visualization server
python main.py -s
# Specify port and graph file
python main.py -s --port 8888 --graph path/to/graph.graphmlVisualization Highlights:
- 🔮 Force-Directed Layout: Automatically optimizes node positions for clear graph structure
- 🔍 Real-Time Search: Quickly locate entities of interest
- 🎯 Subgraph Highlighting: Enter a question to highlight relevant entities and connections
- 📋 Details Panel: Click nodes to view entity descriptions, types, and more
- 🎨 Type Coloring: Different entity types use different colors for easy identification
| Argument | Short | Description |
|---|---|---|
--input |
-i |
PDF file path |
--working |
-w |
Intermediate working directory |
--output |
-o |
Final output directory |
--method |
-m |
PDF preprocessing method (mineru/pymupdf) |
--force |
-f |
Force clear working directory and rebuild |
--verbose |
-v |
Show verbose debug logs |
--query |
-q |
Execute RAG query |
--top_k |
- | Number of entities to retrieve |
--response_type |
- | Response style |
--server |
-s |
Start visualization server |
--port |
- | Server port (default: 8080) |
--graph |
- | Specify graph file path |
The examples/ directory contains complete usage examples, demonstrating the full workflow from PDF input to knowledge graph construction and Q&A evaluation:
examples/
├── example_input/ # 📥 Input files
│ ├── 2020.acl-main.45.pdf # Sample PDF: An NLP academic paper
│ └── 13_qa.jsonl # Q&A dataset: 13 questions (Text/Multimodal) with ground truth
│
├── example_working/ # ⚙️ Intermediate results (auto-generated)
│ ├── 2020.acl-main.45/ # PDF preprocessing output (Markdown, layout info)
│ ├── images/ # Extracted images from PDF
│ ├── graph_*.graphml # Intermediate graphs (text graph, image graph)
│ └── kv_store_*.json # Key-value storage (Text Chunks, Image Descriptions, etc.)
│
├── example_output/ # 📤 Final output
│ ├── example_mmkg.graphml # Final fused multi-modal knowledge graph
│ ├── example_mmkg_emb.npy # Graph node embeddings
│ ├── example_mmkg_report.md # Build statistics report (node count, entity distribution)
│ └── retrieval_log.md # RAG query detailed logs
│
├── cache/ # 💾 Cache data
│ └── *.json # LLM API response cache for faster re-runs
│
├── paper/ # 📄 Project materials
│ ├── framework.png # System architecture diagram
│ └── mmgraphrag.pdf # Project-related paper/documentation
│
├── docqa_example.py # 🧪 Q&A evaluation script
└── docqa_results.md # 📊 Evaluation results report
- Sample Document (
2020.acl-main.45.pdf): Demonstrates the system's ability to process academic papers with rich text and charts. - Evaluation Script (
docqa_example.py): A one-click evaluation tool that:- Automatically reads the sample PDF and builds a knowledge graph
- Loads questions from
13_qa.jsonl(covering text-only and multi-modal chart Q&A) - Performs RAG retrieval and answering using the built graph
- Generates a detailed evaluation report
docqa_results.md, comparing model answers with ground truth
Run evaluation:
python examples/docqa_example.pyThe eval_reference/ directory contains reference code for document QA evaluation on two benchmark datasets:
- DocBench: https://github.com/Anni-Zou/DocBench
- MMLongBench: https://github.com/EdinburghNLP/MMLongBench
Caution
This code is for reference only and cannot be used directly.
MMGraphRAG has undergone a major refactoring that:
- Fixed compatibility issues caused by MinerU updates
- Enhanced robustness for resumable execution
- Removed redundant functionality
Even reproducing results with the previous version would be quite challenging due to the more complex MinerU configuration requirements.
If you wish to reproduce the evaluation results, we recommend rewriting based on the refactored codebase, using:
eval_reference/as a reference for evaluation logicexamples/docqa_example.pyas a template for building the QA pipeline
eval_reference/
├── docbench_eval/ # DocBench dataset evaluation
│ ├── QA.py # Main QA script (MMGraphRAG, GraphRAG, LLM, MMLLM, NaiveRAG)
│ ├── evaluate.py # Evaluation metrics calculation
│ ├── eval_llm.py # LLM-based evaluation
│ ├── mineru_docbench.py # MinerU preprocessing for DocBench
│ ├── naive_rag.py # Naive RAG baseline
│ ├── check.py # MinerU preprocessing Result checking utilities
│ ├── result.py # Result aggregation
│ └── evaluation_prompt.txt # Evaluation prompts
│
└── mmlongbench_eval/ # MMLongBench dataset evaluation
├── run.py # Main evaluation script (supports multiple methods)
├── eval_score.py # Scoring functions
├── extract_answer.py # Answer extraction utilities
├── mineru_mmlongbench.py # MinerU preprocessing for MMLongBench
└── prompt_for_answer_extraction.md # Answer extraction prompts
| File | Purpose |
|---|---|
QA.py / run.py |
Main entry points for running different QA methods (MMGraphRAG, GraphRAG, LLM, MMLLM, NaiveRAG) |
evaluate.py / eval_score.py |
Evaluation metrics (accuracy, F1, etc.) |
mineru_*.py |
MinerU-based PDF preprocessing for each dataset |
Note
Honest Disclaimer: This evaluation code has not been polished for the research community and may appear somewhat messy. We warmly welcome contributions to improve this section of the codebase!
The refactored codebase demonstrates improved performance in small-scale testing (e.g., examples from the DocBench dataset). This improvement may be attributed to:
- Enhanced parsing accuracy from MinerU updates
- Performance improvements in the models used compared to the original experiments
When the paper is published, if the codebase remains unchanged, we plan to conduct a more thorough cleanup of this evaluation code.
Letting Hues Quietly weave through knowledge graph 🎨
a small graph with big dreams ✨
