Index and search documents from multiple sources using local vector embeddings. Named after Odin's raven of thought — the companion to Muninn.
Huginn fetches documents from Confluence, Notion, Jira, YouTube, X/Twitter, and local files, chunks them, generates vector embeddings locally, and stores them in a FAISS index. You can then search across all your knowledge sources via CLI, HTTP API, or MCP (Model Context Protocol) for AI agents.
graph LR
subgraph Sources
S1[Confluence]
S2[Notion]
S3[Jira]
S4[YouTube]
S5[X/Twitter]
S6[Local Files]
S7[Claude Sessions]
end
subgraph Huginn
F[Fetch & Convert to Markdown]
C[Chunk Text]
E[Embed Locally<br/>multilingual-e5-base]
I[FAISS + BM25 Index]
end
subgraph Search
CLI[CLI]
API[HTTP API]
MCP[MCP for AI Agents]
end
S1 & S2 & S3 & S4 & S5 & S6 & S7 --> F --> C --> E --> I
I --> CLI & API & MCP
Key features:
- Fully local — no data leaves your machine (embeddings, indexing, search all run locally)
- Multilingual search (100+ languages via
intfloat/multilingual-e5-base) - Hybrid search: FAISS vector + BM25 keyword
- Cross-encoder reranking for precision
- Knowledge graph support for entity-aware search
- LLM-powered document tagging with constrained taxonomies
- MCP integration for AI coding assistants (Claude Code, Cursor, etc.)
- HTTP API server for low-latency search (<50ms)
git clone https://github.com/RuneLind/huginn.git
cd huginn
uv sync # install uv first: https://docs.astral.sh/uv/Pick a source and run the setup script:
# Local files (markdown, PDF, DOCX, etc.) — simplest, no auth needed
./examples/setup-local-files.sh /path/to/docs my-docs
# Notion workspace
NOTION_TOKEN="secret_..." ./examples/setup-notion.sh my-notion
# YouTube channel transcripts
./examples/setup-youtube.sh "https://www.youtube.com/@ChannelName/videos" my-channel
# Claude Code session transcripts
./examples/setup-claude-sessions.sh
# Confluence space (requires Playwright: uv run playwright install chromium)
CONF_TOKEN="Bearer ..." ./examples/setup-confluence.sh MYSPACE my-confluence
# Jira project (requires Playwright: uv run playwright install chromium)
JIRA_TOKEN="Bearer ..." ./examples/setup-jira.sh MYPROJECT my-jira# CLI search
uv run collection_search_cmd_adapter.py --collection my-notion --query "how does auth work"
# Start the API server
uv run knowledge_api_server.py --collections my-notion --port 8321
# Then: curl "http://localhost:8321/api/search?q=auth&collection=my-notion"Your data lives in data/ (gitignored) — source markdown in data/sources/, indexes in data/collections/.
uv run knowledge_api_server.py --collections my-notion my-confluence --port 8321graph LR
Q[Query] --> EMB[Embed Query]
EMB --> HS{Hybrid Search}
HS --> FAISS[FAISS Vector]
HS --> BM25[BM25 Keyword]
FAISS & BM25 --> RRF[Reciprocal Rank Fusion]
RRF --> RE[Cross-Encoder Rerank]
RE --> R[Results + Citations]
Endpoints:
GET /api/search?q=...&collection=...&limit=10— Hybrid search with rerankingGET /api/collections— List loaded collectionsGET /api/document/{collection}/{doc_id}— Full documentGET /api/graph/{node_id}— Knowledge graph nodeGET /health— Health check
sequenceDiagram
participant User
participant AI as AI Assistant<br/>(Claude Code / Cursor)
participant MCP as Huginn MCP Server
participant API as Huginn API Server
User->>AI: Ask a question
AI->>MCP: search_knowledge(query)
MCP->>API: GET /api/search?q=...
API-->>MCP: Top results + URLs
MCP-->>AI: Context documents
AI-->>User: Answer with citations
Use with Claude Code, Cursor, or any MCP-compatible client. Start the API server, then add to your MCP config:
{
"mcpServers": {
"huginn": {
"command": "uv",
"args": ["--directory", "/path/to/huginn", "run", "knowledge_api_mcp_adapter.py"],
"env": {
"KNOWLEDGE_API_URL": "http://localhost:8321",
"KNOWLEDGE_COLLECTIONS": "my-notion,my-confluence",
"KNOWLEDGE_DESCRIPTION": "Search my team's Notion workspace and Confluence docs"
}
}
}
}The KNOWLEDGE_DESCRIPTION tells the AI agent what it's searching, so it knows when to use the tool.
Set HUGINN_TRACE_POINTER=1 on both the API server and the MCP adapter to attach a per-search trace (FAISS / BM25 / RRF / cross-encoder ranks, timings, expansion terms) to every tool result via a short huginn-trace-url: pointer line. Orchestrators (e.g. Muninn) fetch the full trace from GET /api/trace/<id> and render it in their span UI; LLMs never see the trace. See docs/search-tracing-plan.md for the full env-var matrix and rollout plan.
Incremental updates (only fetches new/modified documents):
uv run collection_update_cmd_adapter.py --collection my-collectionThe setup scripts above handle everything automatically. If you prefer manual control:
Notion (manual steps)
# Download (requires NOTION_TOKEN env var)
uv run notion_collection_create_cmd_adapter.py --downloadOnly --saveMd ./data/sources/my-notion
# Clean up stubs
uv run notion_cleanup_md.py --saveMd ./data/sources/my-notion
# Index
uv run files_collection_create_cmd_adapter.py \
--basePath ./data/sources/my-notion --collection my-notion \
--excludePatterns "^\.excluded/.*"Confluence (manual steps)
# Requires: uv run playwright install chromium
uv run scripts/confluence/fetchers/confluence_fetcher_hierarchical.py \
--space MYSPACE --saveMd ./data/sources/my-confluence
uv run confluence_cleanup_md.py --saveMd ./data/sources/my-confluence --minWordCount 30 --sanitize
uv run files_collection_create_cmd_adapter.py \
--basePath ./data/sources/my-confluence --collection my-confluence \
--excludePatterns "^\.excluded/.*"Jira (manual steps)
# Requires: uv run playwright install chromium
uv run scripts/jira/fetchers/jira_fetcher.py --saveMd ./data/sources/my-jira --project MYPROJECT
uv run jira_cleanup_md.py --saveMd ./data/sources/my-jira --minWordCount 30
uv run files_collection_create_cmd_adapter.py \
--basePath ./data/sources/my-jira --collection my-jira \
--excludePatterns "^\.excluded/.*"YouTube (manual steps)
uv run youtube_fetch_cmd_adapter.py \
--channelUrl "https://www.youtube.com/@ChannelName/videos" \
--channelName my-channel
uv run youtube_preprocess_md.py --saveMd ./data/sources/my-channel/markdown/my-channel
uv run files_collection_create_cmd_adapter.py \
--basePath ./data/sources/my-channel/markdown/my-channel --collection my-channelBuild a knowledge graph from your documents for entity-aware search:
graph LR
subgraph "Graph Extraction"
D[Documents] --> GE[Graph Extractor]
GE --> N1[Entities / Nodes]
GE --> E1[Relationships / Edges]
end
subgraph "Entity-Aware Search"
Q[Query] --> S[Search]
S --> KG[Knowledge Graph]
KG --> Related[Related Entities<br/>+ Context]
end
N1 & E1 --> KG
# Extract graph from Jira issues (epics + cross-references)
uv run scripts/knowledge_graph/extract_jira_graph.py \
--source ./data/sources/my-jira --output ./my_jira_graph.json
# Start server with graph
KNOWLEDGE_GRAPH_PATH=./my_graph.json JIRA_GRAPH_PATH=./my_jira_graph.json \
uv run knowledge_api_server.py --collections my-collectionWrite your own graph extractors for domain-specific entity extraction.
When to run the LLM extractor (and when not to) — see
docs/knowledge-graph-when-to-use-what.mdfor the asymmetry between hand-curated wiki collections (don't run it) and raw-chunk collections (do run it), plus the cache-invalidation gotcha when you swap models. For the alternative pattern of building a hand-curated wiki collection instead of relying on machine extraction, seedocs/wiki-collection-pattern.md.
Tag documents with LLM-generated topic tags from constrained taxonomies:
# Discover tags (free-form exploration)
uv run scripts/tagging/discover_tags.py --source data/sources/my-docs \
--description "my domain" --sample 200 --output discovery.json
# Tag with taxonomy
uv run scripts/tagging/tag_documents.py --source data/sources/my-docs \
--taxonomy my_taxonomy.json| Source | Environment Variables |
|---|---|
| Confluence/Jira Server | CONF_TOKEN / JIRA_TOKEN (Bearer) or CONF_LOGIN + CONF_PASSWORD |
| Confluence/Jira Cloud | ATLASSIAN_EMAIL + ATLASSIAN_TOKEN |
| Notion | NOTION_TOKEN |
See .env.example for a template.
For organizing private, domain-specific collections (work projects, team knowledge, etc.), create gitignored folders inside huginn with their own git repos:
huginn/
├── main/ # open-source core
├── examples/
├── data/ # gitignored — your indexed data
├── my-work/ # gitignored — private git repo
│ ├── taxonomies/ # domain-specific tag taxonomies
│ ├── graphs/ # domain knowledge graphs
│ └── start.sh # start with work collections
├── my-personal/ # gitignored — private git repo
│ └── start.sh # start with personal collections
└── start.sh # gitignored — your combined start script
Each private folder can be its own git repo, pushed to a private remote. Add them to .gitignore:
my-work/
my-personal/
start.sh
Use --data-path when data lives outside the default ./data/collections:
uv run knowledge_api_server.py \
--data-path /path/to/shared/data/collections \
--collections my-work-docs my-personal-notesgraph TD
subgraph "Adapters (CLI / API / MCP)"
CMD[CLI Adapters<br/>collection_*_cmd_adapter.py]
API_SRV[API Server<br/>knowledge_api_server.py]
MCP_A[MCP Adapters<br/>knowledge_api_mcp_adapter.py]
end
subgraph "main/core"
Creator[Collection Creator]
Searcher[Collection Searcher]
end
subgraph "main/sources"
Conf[Confluence]
Not[Notion]
Jira[Jira]
Files[Files]
end
subgraph "main/indexes"
Embed[Embeddings<br/>multilingual-e5-base]
FAISS_I[FAISS Indexer]
BM25_I[BM25 Indexer]
Hybrid[Hybrid Search + RRF]
Rerank[Cross-Encoder Reranker]
end
subgraph "main/graph"
KG[Knowledge Graph]
end
subgraph "scripts/"
Fetch[Fetchers<br/>Confluence, Jira, YouTube, X]
Graph_E[Graph Extractors]
Tag[Document Tagging]
end
CMD & API_SRV & MCP_A --> Creator & Searcher
Creator --> Conf & Not & Jira & Files
Creator --> Embed --> FAISS_I & BM25_I
Searcher --> Hybrid --> FAISS_I & BM25_I
Hybrid --> Rerank
Searcher --> KG
Fetch --> Creator
Graph_E --> KG
main/
core/ — Collection creator + searcher
sources/ — Source adapters (Confluence, Jira, Notion, Files)
indexes/ — FAISS, BM25, hybrid search, embeddings, reranking
graph/ — Knowledge graph engine
persisters/ — Disk storage
utils/ — Batching, logging, progress
scripts/ — Fetchers, graph extractors, tagging tools
examples/ — Setup scripts and templates
See docs/HOW_IT_WORKS.md for the full architecture walkthrough.
MIT