Vatuta - Virtual Assistant for Task Understanding, Tracking & Automation

Virtual Assistant for Task Understanding, Tracking & Automation

Vatuta is a portfolio PoC that demonstrates how RAG and LLMs can act as a single source of truth for engineering managers. A manager's constant burden is knowing the status of every task, issue, and decision — instantly, accurately, and in full detail. Vatuta ingests knowledge from Jira, Confluence, Slack, and other sources into a unified vector knowledge base, making that information queryable in plain language from a single CLI.

Why Vatuta? Being a manager means being responsible for knowing the state of everything, at all times, with full context. Vatuta is a proof of concept showing how a RAG system can reduce that mental load — replacing frantic tab-switching with a single, honest, up-to-date answer.

Features

Data ingestion & knowledge base

🎫 Jira — tickets, comments, history, and changelogs with semantic chunking strategies
📄 Confluence — spaces and pages with full HTML-to-Markdown conversion
💬 Slack — channels, threads, DMs and group DMs with rate-limit-aware ingestion
🦊 GitLab — issues and MRs (PoC)
📅 Google Calendar — events and schedules (work in progress)
⚡ Incremental updates with checkpointing and caching (no full re-ingestion)
👤 Global Entity Manager — resolves the same entity (users right now) across Jira, Confluence and Slack (work in progress)

Query & retrieval

🔍 Vector-based semantic search over all your sources simultaneously
🗂️ Metadata filtering and selection of sources
🔀 Dynamic routing for applying filtering or collecting documents as interpreted from query
🧠 LangGraph-powered RAG agent with tool-based retrieval
🌐 Multiple LLM backends: AWS Bedrock, Google Gemini, Anthropic Claude
📊 Configurable k parameter and source-display for transparent answers

Observability

📈 Prometheus metrics for every source and retrieval operation

Developer experience

🏗️ Typed codebase with strict mypy, Ruff, and Black enforcement
🔒 Security checks: Bandit, Semgrep, pip-audit, detect-secrets
🪝 Pre-commit hooks and GitHub Actions CI pipeline

Architecture

graph TB
    subgraph CLI["CLI (Typer + Rich)"]
        UQ[User Query]
        IC[Ingestion command]
    end

    subgraph Agent["RAG Agent (LangGraph)"]
        AG[Agent Router]
        LLM[LLM Backend<br>Bedrock · Gemini · Anthropic]
        AG --> LLM
    end

    subgraph Retrieval["Retrieval Layer"]
        RE[RAG Engine]
        DM[Document Manager]
        RE --> DM
    end

    subgraph Storage["Storage Layer"]
        QD[(Qdrant<br>Vector DB)]
        EN[(entities.json<br>Global Identities)]
        RW[(JSONL.gz<br>Raw Source Cache)]
    end

    subgraph Ingestion["Ingestion Layer (Sources)"]
        EM[Entity Manager]
        JR[Jira Source]
        CF[Confluence Source]
        SL[Slack Source]
    end

    UQ --> AG
    IC --> Ingestion
    AG --> RE
    DM --> QD
    JR --> EM
    CF --> EM
    SL --> EM
    EM --> EN
    JR --> RW
    CF --> RW
    SL --> RW
    JR --> DM
    CF --> DM
    SL --> DM

Three main layers:

Layer	Components	Purpose
Ingestion	Sources + Entity Manager + Document Manager	Fetch, parse, resolve identities, chunk, and store sources of data
Storage	Qdrant (vectors) + JSONL cache + entities.json	Persistent vector index and raw data for queries
Query	RAG Engine + LangGraph Agent + LLM	Semantic search and natural language answers

Tech Stack

Category	Technology	Role
AI Framework	LangChain, LangGraph	Agent orchestration and RAG chains
Prompt Engineering	DSPy	Prompt formalization and optimization
LLM Providers	AWS Bedrock, Google Gemini, Anthropic Claude	Language model backends
Vector Database	Qdrant	Semantic document storage and search
Embeddings	Sentence Transformers	Local embedding generation (no API cost)
NLP	spaCy	Intelligent text chunking
Data Sources	Jira, Confluence, Slack, GitLab, Google APIs	Knowledge ingestion
CLI	Typer + Rich	Interactive command-line interface
Observability	Prometheus	Metrics for ingestion and retrieval
Dependency Mgmt	Poetry	Package and virtualenv management
Task Automation	Just	Developer task runner
Linting	Ruff, mypy	Code quality and type checking
Security	Bandit, Semgrep, pip-audit, detect-secrets	Static analysis and vulnerability scanning
CI/CD	GitHub Actions	Automated testing, linting, and SBOM generation

Quick Start

These steps get Vatuta running as a user — querying your tools from the CLI. If you want to contribute or develop, see Installation instead.

Prerequisites

Tool	Version	Notes
Python	>= 3.12	Tested up to 3.14
Poetry	>= 1.8	Dependency and virtualenv management
Just	any	Task runner
Docker	any	Optional — only needed to run Qdrant locally (see qdrant_setup.md)

You will also need credentials for:

Data sources — API keys for the tools you want to ingest (Jira, Confluence, Slack, etc.)
LLM backend — at least one of: AWS Bedrock credentials, Google Gemini API key, or Anthropic API key

See Configuration for the full list of required variables.

Steps

# 1. Clone the repository
git clone git@github.com:franjuan/vatuta.git
cd vatuta

# 2. Configure your credentials
#    Copy the template and fill in your API keys (Jira, Confluence, Slack, LLM backend, etc.)
cp env.example .env
# Edit .env — just loads it automatically, no manual source needed

# 3. Install runtime dependencies
#    Creates the virtualenv, installs packages, and downloads the spaCy NLP model
just install

# 4. Start the vector database
#    Skip this step if you already have a Qdrant instance running elsewhere
just qdrant-start

# 5. Configure your data sources
#    Tell Vatuta which Jira projects, Confluence spaces, and Slack channels to ingest
cp config/vatuta.yaml.example config/vatuta.yaml
# Edit config/vatuta.yaml to enable and configure your sources

# 6. Ingest data from your sources
#    Fetches documents, chunks them, embeds them, and stores vectors in Qdrant
just load-sources

# 7. Ask a question
#    Runs the assistant with --show-sources and --show-stats flags (defined in justfile)
just assistant "What tickets are blocking the release?" 20

Installation

These steps set up a full development environment for contributing to or extending Vatuta. If you just want to run the app, follow the Quick Start instead.

Additional prerequisites

On top of the Quick Start prerequisites:

Tool	Version	Notes
direnv	any	Optional — auto-activates the Poetry virtualenv on `cd`
pre-commit	any	Optional — required to contribute

Development setup

# Clone the repository
git clone git@github.com:franjuan/vatuta.git
cd vatuta

# (Optional) Auto-activate the Poetry virtualenv when entering the directory
# Note: .env is already loaded by just — direnv only adds virtualenv auto-activation
direnv allow

# Install all dependencies including dev tools (linters, type checkers, test suite)
just setup

# (Optional) Install pre-commit hooks
#    Runs linters, type checks, and security scans automatically before each commit
just pre-commit-install

Configuration

Vatuta uses two configuration files that work together:

1. `.env` — Credentials and secrets

Copy env.example to .env and fill in your credentials:

cp env.example .env

Variable	Required	Description
LLM backends
`GEMINI_API_KEY`	For Gemini	Google AI API key
`AWS_REGION`	For Bedrock	AWS region (e.g. `us-east-1`)
`AWS_PROFILE`	For Bedrock	AWS named profile — used when no bearer token is set
`AWS_BEARER_TOKEN_BEDROCK`	For Bedrock	Bearer token — alternative to profile auth
Data sources
`JIRA_USER`	For Jira/Confluence	Atlassian account email
`JIRA_API_TOKEN`	For Jira/Confluence	Atlassian API token
`SLACK_BOT_TOKEN`	For Slack	Bot token (`xoxb-...`)
Infrastructure
`QDRANT_API_KEY`	Recommended	Leave empty for unauthenticated local dev
Application
`LOG_LEVEL`	No	Log verbosity. Default: `INFO`

Note: Source URLs, project keys, spaces, and channel lists live in config/vatuta.yaml, not in .env. See the next section for details.

2. `config/vatuta.yaml` — Application configuration

Copy the example and edit:

cp config/vatuta.yaml.example config/vatuta.yaml

# LLM backend selection
rag:
  llm_backend:
    bedrock:
      model_id: "bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0"
      temperature: 0.2
    gemini:
      model_id: "gemini/gemini-2-flash"
      temperature: 1.0

# Cross-source identity resolution storage
entities_manager:
  storage_path: "data/entities.json"

# Qdrant vector database
qdrant:
  url: "http://localhost:6333"
  collection_name: "vatuta_documents"
  embeddings_model: "sentence-transformers/all-MiniLM-L6-v2"

# Sources to ingest
sources:
  jira:
    jira-main:
      enabled: true
      url: "https://your-domain.atlassian.net"
      projects: ["PROJECT1", "PROJECT2"]

  confluence:
    confluence-main:
      enabled: true
      url: "https://your-domain.atlassian.net"
      spaces: ["SPACE1"]
      initial_lookback_days: 30

  slack:
    slack-main:
      enabled: true
      workspace_domain: "https://your-workspace.slack.com"
      channel_types: ["public_channel", "private_channel", "im", "mpim"]
      initial_lookback_days: 30

For detailed per-source configuration, see docs/integrations.md.

Integrations

Vatuta ingests data from multiple sources into a shared Qdrant vector collection. All sources support:

✅ Vector-based semantic search
✅ Incremental updates with checkpointing (no full re-ingestion)
✅ Cross-source identity resolution via the Entity Manager
✅ Configurable date ranges and filters
✅ Prometheus metrics

Integration	Status	Doc
Jira	✅ Stable	docs/sources/jira.md
Confluence	✅ Stable	docs/sources/confluence.md
Slack	✅ Stable	docs/sources/slack.md
GitLab	🧪 PoC	—
Google Calendar	🚧 In Progress	—

For full setup and usage instructions, see docs/integrations.md.

Qdrant Vector Database

Vatuta requires a running Qdrant instance.

You can use your own instance or the one managed via Docker through just commands:

just qdrant-start      # Start the container
just qdrant-status     # Check status
just qdrant-dashboard  # Open the web UI at http://localhost:6333/dashboard
just qdrant-stop       # Stop the container

See docs/qdrant_setup.md for full setup and troubleshooting.

Project Structure

vatuta/
├── src/
│   ├── client/        # CLI entry point (Typer app)
│   ├── entities/      # Global Entity Manager (cross-source identity resolution)
│   ├── metrics/       # Prometheus metrics collectors
│   ├── models/        # Pydantic data models (Document, Chunk, Config)
│   ├── rag/           # RAG engine, LangGraph agent, Qdrant manager, tools
│   ├── sources/       # Data source connectors (Jira, Confluence, Slack, GitLab)
│   └── utils/         # Shared utilities
├── tests/             # Unit tests (pytest)
├── docs/              # Documentation
│   └── sources/       # Per-source design decisions and configuration reference
├── config/            # vatuta.yaml and vatuta.yaml.example
├── pocs/              # Proof-of-concept and experimental scripts
├── data/              # Local data storage (Qdrant, JSONL cache, entities)
├── logs/              # Application logs
├── .github/workflows/ # GitHub Actions CI pipeline
├── justfile           # Task automation commands
├── pyproject.toml     # Poetry configuration, tool settings
├── poetry.lock        # Locked dependencies versions
├── .pre-commit-config.yaml # Pre-commit hooks configuration
├── .secrets.baseline  # Baseline for detect-secrets
├── .pip-audit-ignore  # Ignored vulnerabilities for pip-audit
├── .markdownlint.json # Markdown style rules
├── .envrc             # direnv environment loading script
├── pyrefly.toml       # LSP settings
├── AGENTS.md          # AI Assistant coding instructions
├── CONTRIBUTING.md    # Contribution guidelines
├── LICENSE            # Open source license
├── THIRD_PARTY.md     # Third-party licenses
└── env.example        # Environment variables template

Development

Common commands

# Setup & Environment
just setup                            # Complete dev setup (deps + spaCy)
just clean                            # Clean up temporary/cache files
just reset                            # Full reset (removes venv + temp files)

# Running & Ingestion
just vatuta-help                      # Show vatuta help
just assistant query="..." k="20"     # Query with explicit parameters
just load-sources                     # Ingest data from all enabled sources
just qdrant-start                     # Start Qdrant local vector db instance

# Testing & Code Quality
just check                            # Run all checks (lint + format-check + test)
just test                             # Run the test suite
just test-coverage                    # Run tests with HTML coverage report
just format                           # Auto-format code (Black + Ruff + isort)
just lint                             # Run linters (Ruff, mypy, detect-secrets)
just pre-commit                       # Run all pre-commit hooks manually
just cc                               # Generate cyclomatic complexity report

# Tooling & Info
just dep-path [dep]                   # Analyze dependency chains
just --list                           # See all available project commands

Environment management (direnv)

With direnv enabled, the .env file and Poetry virtualenv are activated automatically when you cd into the project — no manual poetry shell or source .env needed.

direnv allow   # Enable once after cloning

Code quality pipeline

The project enforces quality through pre-commit hooks and GitHub Actions:

Tool	Purpose
Black	Code formatting
Ruff	Fast linting (E, W, F, I, C, B rules)
isort	Import ordering
mypy	Static type checking (strict mode)
pydocstyle	Docstring validation (Google convention)
Bandit	Security static analysis (SAST)
Semgrep	Advanced SAST rules
pip-audit	Dependency CVE scanning
Radon / Xenon	Cyclomatic complexity enforcement
jscpd	Code duplication detection
detect-secrets	Secret leak prevention
CodeQL	GitHub Advanced Security scanning

Running tests

just test                # All tests
just test -v             # Verbose
just test-coverage       # With HTML coverage report (htmlcov/)

Future Improvements

The current project is a proof of concept. The following areas represent key opportunities for future enhancement:

Embeddings Token Limit: The all-MiniLM-L6-v2 model is efficient but capped at 256 tokens. Chunk sizes must be strictly managed to prevent truncation during embedding generation.
Real-Time Ingestion: Transitioning from scheduled batch processing to event-driven streaming would enable proactive, real-time responses to new data.
Source Citations: Enforcing direct linking to original sources in the LLM prompt would improve response auditability and reduce hallucinations.
Reranking Over Filtering: Instead of dropping chunks with missing metadata prior to search, implement a reranking step to preserve potentially relevant documents in the context.
Hybrid Retrieval: Combine dense semantic vectors with sparse lexical representations in Qdrant to improve searches for exact terms, acronyms, and ticket IDs.
GraphRAG Implementation: Explore GraphRAG to better handle broad queries, relationships between entities, and aggregations that currently overwhelm the standard vector RAG approach.
Unsupervised Topic Classification: Automatically categorize queries and document chunks by topic to enhance filtering and reranking accuracy.
Testing & Validation: Expand unit test coverage and implement robust validation mechanisms across the codebase.
Automated Prompt Optimization: Leverage DSPy to automatically optimize prompts using defined metrics and evaluation samples.
Proactive Agents: Evolve the assistant from a reactive query-responder to a proactive agent capable of responding to system triggers or scheduled events.
Comprehensive Security Analysis: Supplement existing static and dependency scans with a thorough evaluation against the OWASP Top 10 for LLMs and Generative AI.
Extended LLM Support: Validate and integrate additional language models once a robust evaluation framework is established.
Slack Channel Filtering: Expose the existing internal support for filtering Slack ingestion by specific channel_ids through the application configuration and CLI.
Entity Manager Overhaul: Completely review and refactor the cross-source identity resolution architecture.

Contributing

Contributions are welcome! Please read CONTRIBUTING.md before opening a pull request.

License

Code is licensed under the Apache License 2.0. See LICENSE.
Documentation & content (docs, diagrams, and original images) are licensed under CC BY 4.0. See LICENSE-docs.
Third-party materials (dependencies, icons, fonts) may have their own licenses. See THIRD_PARTY.md.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
config		config
docs		docs
pocs		pocs
src		src
tests		tests
.envrc		.envrc
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pip-audit-ignore		.pip-audit-ignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
GithubSocialPreview.png		GithubSocialPreview.png
LICENSE		LICENSE
LICENSE-docs		LICENSE-docs
NOTICE		NOTICE
README.md		README.md
THIRD_PARTY.md		THIRD_PARTY.md
env.example		env.example
justfile		justfile
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrefly.toml		pyrefly.toml
vatuta.png		vatuta.png

Folders and files

Latest commit

History

Repository files navigation

Vatuta - Virtual Assistant for Task Understanding, Tracking & Automation

Table of Contents

Features

Data ingestion & knowledge base

Query & retrieval

Observability

Developer experience

Architecture

Tech Stack

Quick Start

Prerequisites

Steps

Installation

Additional prerequisites

Development setup

Configuration

1. .env — Credentials and secrets

2. config/vatuta.yaml — Application configuration

Integrations

Qdrant Vector Database

Project Structure

Development

Common commands

Environment management (direnv)

Code quality pipeline

Running tests

Future Improvements

Contributing

License

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. `.env` — Credentials and secrets

2. `config/vatuta.yaml` — Application configuration

Packages