A TypeScript implementation of an adaptive memory compression system for Large Language Models (LLMs). This system automatically compresses conversation history to keep memory under a token limit without hurting answer quality.
Problem: Conversational agents store every turn in their memory. After prolonged use, this storage can grow past hundreds of thousands of tokens, causing:
- Slow retrieval
- Decreased search quality
- LLM context window overflow
- Higher operational costs
Goal: Build an automatic compression layer that keeps memory under a configurable token limit without hurting answer quality. The layer intelligently decides when to compress, what to keep verbatim, and how to summarize content.
Success Metrics:
- ≥ 60% token-count reduction
- ≤ 5% drop in retrieval F1 on held-out QA benchmarks
If you want to quickly see the memory compressor in action:
# Clone the repository
git clone https://github.com/berkdurmus/adaptive-llm-memory-compressor.git
cd adaptive-llm-memory-compressor
# Install dependencies
npm install
# Create a .env file with your OpenAI API key
echo "OPENAI_API_KEY=your_api_key_here" > .env
# Run the demo with the sample conversation
npm run demoThis will run a demonstration that:
- Loads a sample conversation
- Adds all messages to memory
- Triggers compression
- Shows before/after token counts
- Tests retrieval with sample queries
┌───────── user/agent turns ─────────┐
│ │
│ 1. Ingestion │
│ • normalize message │
│ • store raw text + embeddings │
│ │
│ 2. Adaptive Compression Job │ (runs async)
│ • trigger = token_count > T │
│ • select least-salient blocks │
│ • LLM summarizes ↓ │
│ • replace with {summary, vec} │
│ │
│ 3. Retrieval │
│ • hybrid search (BM25 + vec) │
│ • returns raw or summary text │
└────────────────────────────────────┘
- PostgreSQL for metadata (id, timestamp, role, tokenCount, salienceScore, isSummary)
- ChromaDB for vector embeddings
- Fastify server providing REST endpoints
The compression algorithm works as follows:
-
Rank: Compute salience scores for messages based on:
- Cosine similarity to recent queries
- Recency factor
1/(1+ageDays)
-
Select: Pick the oldest 20% of messages with salience scores below a threshold.
-
Merge: Chunk selected messages into blocks of 1-2k tokens.
-
Summarize: Use LLM to summarize each block with a prompt that emphasizes preserving facts, names, and numbers.
-
Replace: Delete original messages, insert summary row (flagged
isSummary = true). -
Audit: Keep hash of deleted text for verification and potential rollback.
- Node.js 18+
- PostgreSQL
- ChromaDB (local or remote instance)
- OpenAI API key
-
Clone the repository:
git clone https://github.com/berkdurmus/adaptive-llm-memory-compressor.git cd adaptive-llm-memory-compressor -
Install dependencies:
npm install
-
Create a
.envfile with your configuration:OPENAI_API_KEY=your_openai_api_key DATABASE_URL=postgres://user:password@localhost:5432/memory_compressor CHROMA_HOST=localhost CHROMA_PORT=8000 MAX_MEMORY_TOKENS=10000 COMPRESSION_THRESHOLD=8000 COMPRESSION_TARGET=5000 SUMMARY_MODEL=gpt-4o-mini EMBEDDING_MODEL=text-embedding-3-small -
Build the project:
npm run build
Start the API server:
npm startRun the demo with a sample conversation:
npm run demoYou can also provide your own conversation file:
npm run demo -- path/to/your/conversation.json- POST /initialize - Initialize the memory manager
- POST /message - Add a new message to memory
- POST /retrieve - Retrieve messages based on a query
- POST /conversation - Get all messages for a conversation
- POST /compress - Force compression of a conversation
- GET /health - Health check endpoint
The project includes an evaluation script that measures:
- Token reduction percentage
- F1 score before and after compression
- Retrieval latency
Run the evaluation:
npm run evaluateThis project is licensed under the ISC License.
- Inspired by research on long-term memory in LLM applications
- Uses OpenAI's embedding and completion APIs