Lethe

Autonomous executive assistant with persistent memory and a multi-agent architecture.

Lethe is a 24/7 AI assistant that you communicate with via Telegram. It remembers everything — your preferences, your projects, conversations from months ago. The more you use it, the more useful it becomes.

Local-first architecture — runs on your hardware with a local LLM, or with any cloud LLM API.

See CHANGELOG.md for release notes.

Architecture

User (Telegram) <-> Cortex (principal actor, user-facing)
                     │
              Brainstem (supervisor)
                     │
          ┌──────────┼──────────┬──────────┐
          ↓          ↓          ↓          ↓
        DMN       Hippocampus  Subagents   Runtime
     (background) (recall+notes) (workers)  health
          │          │          │
          └──────────┴──────────┘
                     │
                     ↓
              Actor Registry + Event Bus
                     │
                     ↓
               Memory (LanceDB)
               ├── blocks (workspace — persona, user context)
               ├── notes (~/lethe/notes/ — skills, conventions)
               ├── archival (vector + FTS)
               └── messages (conversation history)

Actor Model

Lethe uses a neuroscience-inspired actor system:

Actor	Role
Brainstem	Boot supervisor. Checks resources, releases, sends structured findings to cortex.
Cortex	Principal actor. ONLY actor that talks to the user. Handles quick tasks directly, delegates complex work to subagents.
DMN (Default Mode Network)	Periodic background cognition: scans goals, updates state, writes reflections, escalates insights.
Hippocampus	Autoassociative recall: searches notes, archival memory, and conversation history on each message.
Subagents	Spawned on demand for focused tasks. Report to parent actors. No direct user channel.

Prompt Architecture

System prompt content is split by update lifecycle:

Content	Location	Updates
Persona (identity, character, purpose)	`workspace/memory/identity.md`	User-editable. Never overwritten by updates.
System instructions (action discipline, output format, communication style)	`config/prompts/agent_instructions.md`	Always current after `git pull`.
Tools documentation (available tools, notes tags)	`config/prompts/agent_tools.md`	Always current after `git pull`.
Actor rules (preamble, rules, heartbeat prompts)	`config/prompts/actor_*.md`	Always current after `git pull`.

This ensures updates to system behavior propagate to all users without overwriting their persona customizations.

Quick Start

1. One-Line Install

curl -fsSL https://lethe.gg/install | bash

2. Manual Install

git clone https://github.com/atemerev/lethe.git
cd lethe
uv sync
cp .env.example .env
# Edit .env with your credentials
uv run lethe

3. Update

curl -fsSL https://lethe.gg/update | bash

Running Locally with Gemma 4

Lethe runs well with Google Gemma 4 31B on consumer GPUs via llama.cpp. Tested on 4x RTX 4090 (~51 tok/s).

Prerequisites

4x RTX 4090 (or equivalent ~96GB total VRAM) for Q8_0 quantization
2x RTX 4090 for Q4_K_M quantization
llama.cpp built with CUDA support
Gemma 4 31B GGUF model (e.g. from bartowski)

Build llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build --target llama-server -j$(nproc)

Start the server

./build/bin/llama-server \
    --model /path/to/gemma-4-31B-it-Q8_0.gguf \
    --host 0.0.0.0 --port 8090 \
    --n-gpu-layers 999 \
    --split-mode tensor \
    --ctx-size 98304 \
    --flash-attn on \
    --parallel 2 \
    --cache-ram 32768 \
    --slot-save-path /path/to/slots \
    --jinja \
    --reasoning-budget 4096 \
    --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 \
    --metrics \
    -fit off

Key flags:

--split-mode tensor — true tensor parallelism across GPUs (~51 tok/s vs ~25 with layer split). Requires -fit off.
--jinja — required for Gemma 4's native tool calling format (peg-gemma4 parser).
--reasoning-budget 4096 — enables thinking mode for better tool selection accuracy.
--parallel 2 — 2 concurrent slots (cortex + aux). Use 4 only if VRAM allows (~20GB free after model).
--cache-ram 32768 — 32GB prompt cache so different prompts don't evict each other.
--spec-type ngram-mod — lightweight speculative decoding, shared across all slots.

Note: --split-mode tensor does not support KV cache quantization (-ctk/-ctv). Use f16 KV cache (default).

Configure Lethe

# In your .env
LLM_PROVIDER=openai
LLM_MODEL=openai/gemma-4-31B-it-Q8_0.gguf
LLM_MODEL_AUX=openai/gemma-4-31B-it-Q8_0.gguf
LLM_API_BASE=http://localhost:8090/v1
LLM_CONTEXT_LIMIT=96000
OPENAI_API_KEY=local

Performance tips

Tool count matters: Gemma 4 works best with <15 tools. Lethe's two-tier tool system registers 15 core tools, with extended tools available via request_tool().
Thinking improves tool selection: --reasoning-budget 4096 lets the model reason before choosing tools. Costs ~100-500 extra tokens per response but significantly improves tool calling accuracy.
Prompt cache warms over time: The 32GB cache and 4 parallel slots mean each actor's prompt stays warm. First request is slower.
Speculative decoding improves with use: The ngram pool fills as the model generates, benefiting from repeated patterns (tool schemas, JSON structures).

LLM Providers

Provider	Env Variable	Default Model
Local (llama.cpp)	`LLM_API_BASE` + `OPENAI_API_KEY=local`	(your GGUF)
OpenRouter	`OPENROUTER_API_KEY`	`moonshotai/kimi-k2.5-0127`
Anthropic (API key)	`ANTHROPIC_API_KEY`	`claude-opus-4-5-20251101`
Anthropic (subscription)	`ANTHROPIC_AUTH_TOKEN`	`claude-opus-4-5-20251101`
OpenAI	`OPENAI_API_KEY`	`gpt-5.2`

Set LLM_PROVIDER to force a specific provider, or let it auto-detect from available keys.

Multi-model support: Set LLM_MODEL_AUX for a cheaper/faster model used in summarization and hippocampus analysis.

Memory System

Notes (Persistent Knowledge)

Tagged markdown files in ~/lethe/notes/ — the primary store for procedural knowledge:

~/lethe/notes/
├── unige_email_via_graph_api.md   # tags: [skill, email, graph-api]
├── use_uv_not_pip.md              # tags: [convention, python]
└── phd_defense_requirements.md    # tags: [education, PhD]

Skills: procedures for external systems (APIs, services, auth flows)
Conventions: how things should be done (user preferences, toolchain choices)
Searched by hippocampus during recall and via note_search tool
Auto-extracted from archival memory by the memory organizer on startup

Memory Blocks (Core Memory)

Always in context. Stored in workspace/memory/:

identity.md — Agent persona (user-customizable)
human.md — What the agent knows about you
project.md — Current project context (agent updates this)

Archival Memory

Long-term semantic storage with hybrid search (vector + full-text). The memory organizer runs on startup to extract valuable entries into notes and clean out noise.

Message History

Full conversation history stored locally. Searchable via conversation_search tool. Hippocampus searches this during recall.

Tools

Two-Tier Tool System

Gemma 4 works best with fewer tools. Lethe registers ~15 core tools with full schemas, with additional tools available on demand via request_tool().

Core tools (always available): bash, read_file, write_file, edit_file, note_search, note_create, note_list, telegram_send_message, telegram_react, conversation_search, spawn_actor, send_message, discover_actors, kill_actor, request_tool

Extended tools (via request_tool("name")): list_directory, grep_search, web_search, fetch_webpage, memory_read, memory_update, memory_append, archival_search, archival_insert, browser_open, browser_snapshot, browser_click, browser_fill, telegram_send_file, and more.

Web Search

Web search uses Exa with a subagent synthesis pattern — raw results are processed in a separate LLM call and only a concise summary enters the conversation context, preserving window space for conversation history.

Hippocampus (Autoassociative Memory)

On each message, the hippocampus automatically searches for relevant context:

LLM decides whether recall would help (skips greetings, simple questions)
Generates concise 2-5 word search queries
Searches notes first (pre-distilled, highest signal)
Searches archival memory and past conversations
Filters for relevance (LLM-based)
Summarizes and reviews for stale state before injection

Disable with HIPPOCAMPUS_ENABLED=false.

Configuration

Environment Variables

Variable	Description	Default
`TELEGRAM_BOT_TOKEN`	Bot token from BotFather	(required)
`TELEGRAM_ALLOWED_USER_IDS`	Comma-separated user IDs	(required)
`LLM_PROVIDER`	Force provider (`openrouter`, `anthropic`, `openai`)	(auto-detect)
`LLM_MODEL`	Main model	(provider default)
`LLM_MODEL_AUX`	Aux model for summarization/analysis	(same as main)
`LLM_API_BASE`	Custom API URL (for local llama.cpp)	(none)
`LLM_CONTEXT_LIMIT`	Context window size	`128000`
`EXA_API_KEY`	Exa web search API key	(optional)
`HIPPOCAMPUS_ENABLED`	Enable memory recall	`true`
`ACTORS_ENABLED`	Enable actor model	`true`
`HEARTBEAT_INTERVAL`	Main heartbeat interval (seconds)	`900`
`LETHE_CONSOLE`	Enable web console	`false`

Persona Configuration

Edit workspace/memory/identity.md to customize the agent's personality, purpose, and background. This file is never overwritten by updates.

System instructions (communication style, action discipline, output format) are in config/prompts/agent_instructions.md — edit if you need different behavior rules.

Run as Service

mkdir -p ~/.config/systemd/user
cat > ~/.config/systemd/user/lethe.service << EOF
[Unit]
Description=Lethe Autonomous AI Agent
After=network.target

[Service]
Type=simple
WorkingDirectory=$(pwd)
ExecStart=$(which uv) run lethe
Restart=on-failure
RestartSec=10

[Install]
WantedBy=default.target
EOF

systemctl --user daemon-reload
systemctl --user enable --now lethe

Development

uv run pytest
uv run pytest tests/test_notes.py -v

Project Structure

Source in src/lethe/: actor/ (actor model), agent/ (init + tools), memory/ (LanceDB, notes, hippocampus, organizer, LLM client), tools/ (bash, files, web, browser, notes), telegram/, main.py.

Config: config/blocks/ (persona seeds, copied to workspace once), config/prompts/ (system instructions, always loaded fresh — updates propagate).

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 641 Commits
config		config
examples		examples
gateway		gateway
scripts		scripts
src/lethe		src/lethe
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.gateway.yml		docker-compose.gateway.yml
docker-compose.steel.yml		docker-compose.steel.yml
entrypoint.sh		entrypoint.sh
install.sh		install.sh
pyproject.toml		pyproject.toml
resolve_session.session		resolve_session.session
uninstall.sh		uninstall.sh
update.sh		update.sh
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Lethe

Architecture

Actor Model

Prompt Architecture

Quick Start

1. One-Line Install

2. Manual Install

3. Update

Running Locally with Gemma 4

Prerequisites

Build llama.cpp

Start the server

Configure Lethe

Performance tips

LLM Providers

Memory System

Notes (Persistent Knowledge)

Memory Blocks (Core Memory)

Archival Memory

Message History

Tools

Two-Tier Tool System

Web Search

Hippocampus (Autoassociative Memory)

Configuration

Environment Variables

Persona Configuration

Run as Service

Development

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 49

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages