Dynamic episodic memory for LLMs. Not a search index — a brain.
"Every memory solution right now is just an MCP server — a thin wrapper around vector search. Store some chunks, cosine similarity, return top-k. That's a search index pretending to be a brain."
Flashback is a self-contained memory microservice that gives any LLM genuine episodic memory: dynamic retrieval that updates within a conversation, temporal history that preserves how beliefs evolved, and a four-tier memory hierarchy modeled on how human cognition actually works.
It speaks REST. It plugs into any stack. The complexity is inside.
Current AI memory systems are all variations on the same pattern: chunk some text, embed it, store the vector, return top-k at query time. It works. It's also the wrong model.
Static RAG treats memory as a frozen index. The system retrieves the same way on your first message as on your tenth — oblivious to everything you established in between. By turn five of a conversation, a human has already filed away your preferences, made inferences, updated their mental model. A static retriever is still looking at the same snapshot from before you said hello.
Consider how a real conversation evolves:
- "Help me debug this issue."
- "I'm using the Acme framework — here's the error."
- "Actually, this is the same pattern as last week's problem with the database layer."
- "Right, and that was caused by the migration we ran in February."
At message 4, a static RAG system is still retrieving from an index that was built before this conversation started. It doesn't know messages 2 and 3 exist as context. It can't connect "the migration" to the thread you just established — unless that context happened to be in a prior session that was already indexed.
A human with good memory would have been updating their mental model continuously. By message 3, they've already cross-referenced last week's problem. By message 4, they're reaching back further and surfacing the root cause.
"The memory isn't dynamic or grows as you converse more — that's what we need to add to make every LLM technically a dynamic RAG... you start generic in one thread then as memories pop up you tailor"
The fix is to close the loop within the conversation. After each exchange, new context is ingested immediately — so the next query runs against an index that already knows what was just established.
"Human speech is dynamic RAG"
The flow isn't:
Query memory → Talk → Ingest after conversation ends
It's:
User says something
→ Query memory
→ LLM responds
→ Ingest that exchange immediately
→ Next message retrieves against an updated index
The context window is alive. Not a snapshot.
"We actively modify our context based upon what we are currently working on... with AI there's multiple parts that have to work together"
Human cognition and computer architecture solved the same problem independently. Both landed on the same answer: a tiered hierarchy with distinct latency, capacity, and persistence tradeoffs at each level.
"There's long term memory... there's short term memory... there's 'subconscious' memory... even computers — short term (working memory), long term (hard drive) and fast access (instinct, cache) and subconscious (ROMs... built-in things that are just known)"
Flashback maps this hierarchy directly onto its architecture:
┌─────────────────────────────────────────────────────────────────┐
│ TIER 1 — ROM / Subconscious │
│ The LLM's training weights. Language, world knowledge, reason- │
│ ing. Baked in. We don't store this — we build on top of it. │
├─────────────────────────────────────────────────────────────────┤
│ TIER 2 — Cache / Instinct │
│ Core memory. Always injected. User prefs, active project, │
│ behavioral rules. Zero retrieval cost. Always in context. │
├─────────────────────────────────────────────────────────────────┤
│ TIER 3 — RAM / Working Memory │
│ Active context with TTL. In-progress tasks, recent turns, │
│ things mentioned this session. High relevance, fast decay. │
├─────────────────────────────────────────────────────────────────┤
│ TIER 4 — Disk / Long-Term Memory │
│ Episodic + semantic store. Facts, decisions, project history, │
│ supersede chains. Searched via hybrid retrieval on demand. │
└─────────────────────────────────────────────────────────────────┘
Most memory systems today collapse this entire hierarchy into a single tier: the disk. There's no cache. There's no RAM. There's no concept of what's immediately relevant versus what might be dug up later. Everything is equally distant.
The subtler problem: most memory systems treat storage as a database of current truths. Old facts get deleted when new ones supersede them.
But human memory isn't a current-facts database. It's a web of episodes. You don't just know that you use a particular tool — you remember the moment you chose it, the alternatives you considered, the pivot you made six months in. Dead ends matter. Pivots matter. The decision trail is part of the memory.
"The evolution of an idea is not simply a replace — there's a temporal component"
Flashback never deletes. Old records are marked superseded; new records carry a pointer back. A query for "what's current" returns the latest node. A query for "how did this evolve" walks the chain. The narrative is preserved.
- Episodic memory — conversation snapshots with entities, timestamps, and session context, not just extracted facts
- Supersede-not-delete — full temporal history of how beliefs evolved, queryable at any point in time
- Five memory types — episodic, semantic, working, document, and procedural, each with distinct decay behavior
- Hybrid retrieval — weighted combination of semantic similarity, BM25 keyword match, recency, importance, project context, and entity overlap
- Two retrieval modes — answer mode for questions, manager mode for reconstructing "what's going on" at session start
- Consolidation pipeline — scheduled promotion from working memory to long-term; episodic records distilled into semantic facts over time
- Decay model — retrieval scores degrade by half-life (not deletion); pinned memories never decay
- Automatic task extraction — candidate tasks identified from conversation text without explicit commands
- 3D memory visualization — UMAP-projected interactive graph with time scrubbing, type filtering, and supersede chain highlighting
- Self-contained — REST API, no external dependencies beyond PostgreSQL and the Python sidecar
┌──────────────────────────────────────┐
│ Rust / Axum (port 8080) │
│ Core API — memory CRUD, hybrid │
│ retrieval, context assembly, │
│ consolidation, visualization API │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Python Sidecar (port 8081) │
│ Embedding, NER, LLM summarization │
│ for consolidation + extraction │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ PostgreSQL + pgvector │
│ Primary store — vector similarity, │
│ BM25 full-text, temporal graph via │
│ adjacency table + recursive CTEs │
└──────────────────────────────────────┘
Stack: Rust (Axum, SQLx, Tokio) · Python (FastAPI) · PostgreSQL 16 + pgvector · Docker Compose
git clone https://github.com/your-org/flashback
cd flashback
# Set your embedding API key
export OPENAI_API_KEY=sk-...
# Start everything
docker compose upThe API is live at http://localhost:8080. PostgreSQL at 5432, Python sidecar at 8081.
Run migrations on first boot:
docker compose exec server flashback migrateFlashback is designed around a two-call pattern that wraps your existing LLM calls.
POST /context/assemble
{
"session_id": "sess_abc123",
"user_id": "user_xyz",
"project_id": "proj_flashback",
"query": "what's the current deploy target?"
}Returns fully assembled layered context — procedural patterns, project state, retrieved memories, relevant document chunks, recent conversation — ready to inject into your system prompt. Not top-k chunks. A structured prompt layer.
POST /memory/ingest
{
"session_id": "sess_abc123",
"user_id": "user_xyz",
"project_id": "proj_flashback",
"user_turn": "what's the current deploy target?",
"assistant_turn": "The current deploy target is production..."
}That's it. The exchange is chunked, embedded, entity-extracted, and written to working memory. The next /context/assemble call retrieves against an index that already includes this exchange. The loop is closed.
# Pin a fact that appears in every prompt — no retrieval required
POST /memory/core
{
"content": "Always use TypeScript. Never suggest full rewrites.",
"user_id": "user_xyz"
}POST /memory/search
{
"query": "auth middleware changes",
"mode": "answer",
"project_id": "proj_flashback",
"top_k": 10
}# Walk the supersede chain for any entity — see how a belief evolved
GET /lineage/auth_middleware| Type | Analogy | Decay | Description |
|---|---|---|---|
working |
RAM | Fast (48h TTL) | Active session context; promotes to episodic on close |
episodic |
Short-term recall | Medium (14d) | Conversation snapshots; raw material for consolidation |
semantic |
Long-term facts | Slow (90d) | Distilled beliefs extracted from multiple episodes |
document |
Reference shelf | Slow / versioned | Ingested files, chunked and re-indexed on change |
procedural |
Muscle memory | Slow (90d) | Learned workflows extracted from repeated patterns |
Mid-conversation, something triggers a memory. The system surfaces it — not because you queried for it explicitly, but because the current context activated it. A flashback.
That's the experience the system is trying to produce. Not "here are the top 5 relevant documents." Not "here's what you told me to remember." But: the right memory, at the right moment, because the context demanded it.
Human conversations aren't transactions. They're journeys. Flashback is memory that travels with you.
- docs/VISION.md — the dynamic RAG thesis; why current memory is broken and what fixing it looks like
- docs/ARCHITECTURE.md — full system design: memory hierarchy, five types, decay model, consolidation pipeline, hybrid retrieval, supersede chains, 3D visualization, database schema
Early development / alpha. The architecture is fully specified; core infrastructure is being built out. Not production-ready. API surfaces may change.
If the thesis resonates, watch the repo. Contributions and feedback welcome — open an issue.
Apache 2.0 — see LICENSE.