Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pages/ai-ecosystem.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,9 @@ This section of Memgraph’s documentation is your guide to using Memgraph for A
- [Agents in Memgraph](/ai-ecosystem/agents): Discover how you can leverage AI
agents to automate graph modeling and migration tasks.

- [**Unstructured2Graph**](/ai-ecosystem/integrations/unstructured2graph): Learn how to use
Unstructured2Graph to transform unstructured data into structured graph data
within Memgraph.


<CommunityLinks/>
3 changes: 2 additions & 1 deletion pages/ai-ecosystem/_meta.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@ export default {
"graph-rag": "GraphRAG",
"integrations": "Integrations",
"machine-learning": "Machine learning",
"agents": "Agents"
"agents": "Agents",
"unstructured2graph": "Unstructured2Graph"
}
243 changes: 243 additions & 0 deletions pages/ai-ecosystem/unstructured2graph.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
---
title: Unstructured2Graph
description: Learn how to use Unstructured2Graph to transform unstructured data into structured graph data within Memgraph.
---
import { Callout } from 'nextra/components'
import { Steps, Tabs } from 'nextra/components'
import { CommunityLinks } from '/components/social-card/CommunityLinks'

# Unstructured2Graph

Every company sits on a pile of unstructured documents—reports, PDFs, research
papers, policies, or meeting notes. They contain valuable knowledge, but little
of it is connected or searchable.

With **Unstructured2Graph**, part of the Memgraph AI Toolkit, you can turn that
unstructured text into a connected knowledge graph that LLMs can query and
reason over.

Unstructured2Graph combines two powerful components:

- **Unstructured IO** — extracts, cleans, and chunks documents of various
formats such as PDF, DOCX, or TXT.
- **LightRAG** — a graph-based reasoning layer that handles prompt engineering
and entity extraction automatically, mapping entities and relationships into
Memgraph.

Together, they convert raw text into a knowledge graph with nodes, edges, and
embeddings ready for retrieval.


## Getting started

In this guide, you'll learn how to use the Unstructured2Graph step by step.
You'll quickly go from setting up your project to creating your first entity
graph.

<Steps>

### Start Memgraph

Start by preparing your workspace and running Memgraph locally using Docker:

```bash
docker run -p 7687:7687 -p 7444:7444 --name memgraph memgraph/memgraph-mage
```

Open your terminal, VS Code, Cursor, or any other development environment you
prefer. This is where you'll run Python scripts connected to your Memgraph
instance.

You are now ready to start building your graph.

### Clone the Memgraph AI Toolkit

Next, clone the AI Toolkit repository, which contains the Unstructured2Graph
module:

```bash
git clone https://github.com/memgraph/ai-toolkit.git
cd ai-toolkit/unstructured2graph
```

### Install dependencies

Install `uv`, the package manager used in the AI Toolkit:

```bash
# Install dependencies using uv
uv pip install -e .
```

Detailed steps are available in the [uv
documentation](https://docs.astral.sh/uv/). Once installed, you can use it to
run the AI Toolkit packages easily.

### Configure environment variables

Create a `.env` file to configure your OpenAI API key for LLM-based entity
extraction:

```bash
# Required for LLM-based entity extraction
OPENAI_API_KEY=your_api_key_here
```

### Ingest documents

Start by selecting the documents you want to process. Unstructured2Graph
supports multiple file types through Unstructured.io, including PDF, DOCX, TXT,
and HTML. It extracts readable text, removes unwanted elements such as headers
or page numbers, and divides the content into structured chunks based on
document layout. Each chunk is then ready for LightRAG to perform entity and
relationship extraction.

Here is a complete example of how to ingest documents and create a knowledge graph:

```python
import asyncio
import logging
from memgraph_toolbox.api.memgraph import Memgraph
from lightrag_memgraph import MemgraphLightRAGWrapper
from unstructured2graph import from_unstructured, create_index, compute_embeddings, create_vector_search_index

async def ingest_documents():
# Connect to Memgraph and clear existing data
memgraph = Memgraph()
memgraph.query("MATCH (n) DETACH DELETE n;")
create_index(memgraph, "Chunk", "hash")

# Initialize LightRAG for entity extraction
lrag = MemgraphLightRAGWrapper()
await lrag.initialize()

# Define your document sources
sources = [
"docs/paper.pdf", # local file
"https://example.com/page.html" # remote URL
]

# Process documents and extract entities
await from_unstructured(
sources=sources,
memgraph=memgraph,
lightrag_wrapper=lrag,
only_chunks=False, # create chunks and extract entities
link_chunks=True # link chunks sequentially with NEXT edges
)

await lrag.afinalize()

# Create embeddings and vector index for semantic search
compute_embeddings(memgraph, "Chunk")
create_vector_search_index(memgraph, "Chunk", "embedding")

if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
asyncio.run(ingest_documents())
```

Here's what happens step by step:

1. Text is extracted, cleaned, and chunked by Unstructured IO.
2. Each chunk becomes a `Chunk` node in Memgraph with properties like `hash` and
`text`.
3. LightRAG performs entity recognition and relationship extraction, creating
`base` nodes.
4. Entities are linked to chunks with `MENTIONED_IN` edges.
5. Chunks are connected sequentially with `NEXT` edges for traversal.
6. Embeddings are generated and a vector index is created for semantic search.

After processing, your Memgraph instance will hold a complete, queryable
knowledge graph ready for GraphRAG.

### Query with GraphRAG

Once your data is ingested, you can perform GraphRAG retrieval directly inside
Memgraph with a single query. This combines semantic search with graph traversal
to retrieve the most relevant context for your questions.

```python
import os
from memgraph_toolbox.api.memgraph import Memgraph
from openai import OpenAI

def graphrag_query(prompt: str):
memgraph = Memgraph()

# Retrieve relevant chunks using vector search + graph traversal
retrieved_chunks = []
for row in memgraph.query(
f"""
CALL embeddings.text(['{prompt}']) YIELD embeddings, success
CALL vector_search.search('vs_name', 5, embeddings[0]) YIELD distance, node, similarity
MATCH (node)-[r*bfs]-(dst:Chunk)
WITH DISTINCT dst, degree(dst) AS degree ORDER BY degree DESC
RETURN dst LIMIT 5;
"""
):
if "text" in row["dst"]:
retrieved_chunks.append(row["dst"]["text"])

if not retrieved_chunks:
print("No chunks retrieved.")
return

# Send retrieved context to LLM for summarization
context = "\n\n".join(retrieved_chunks)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer the question based on the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {prompt}"},
],
temperature=0.1,
)

answer = completion.choices[0].message.content
print(f"Question: {prompt}")
print(f"Answer: {answer}")

if __name__ == "__main__":
graphrag_query("What are the key findings in the document?")
```

Here's what the GraphRAG query does:

1. Converts the input prompt into an embedding.
2. Searches for the most semantically relevant chunks using vector search.
3. Expands context through connected nodes in the graph using BFS traversal.
4. Sends the retrieved text to an LLM for summarization or question answering.

### Visualize the graph in Memgraph Lab

Open [Memgraph Lab](/memgraph-lab) and connect to your local instance. Then run:

```cypher
MATCH (n)-[r]->(m) RETURN n, r, m;
```

You'll see:

- `Chunk` nodes for text sections
- `base` nodes for extracted entities
- `MENTIONED_IN` edges linking entities to their source chunks
- `NEXT` edges connecting sequential chunks

Explore this graph visually to understand how your content has been transformed
into a connected network of knowledge.

</Steps>

<Callout type="info">
**Try it in Memgraph Cloud**

Want to skip local setup? You can also use Unstructured2Graph directly with
[Memgraph Cloud](https://cloud.memgraph.com/). Sign up, create a new project,
and start building your knowledge graph in minutes.
</Callout>

<CommunityLinks />