Adaptive LLM Memory Compressor

A TypeScript implementation of an adaptive memory compression system for Large Language Models (LLMs). This system automatically compresses conversation history to keep memory under a token limit without hurting answer quality.

Problem & Goal

Problem: Conversational agents store every turn in their memory. After prolonged use, this storage can grow past hundreds of thousands of tokens, causing:

Slow retrieval
Decreased search quality
LLM context window overflow
Higher operational costs

Goal: Build an automatic compression layer that keeps memory under a configurable token limit without hurting answer quality. The layer intelligently decides when to compress, what to keep verbatim, and how to summarize content.

Success Metrics:

≥ 60% token-count reduction
≤ 5% drop in retrieval F1 on held-out QA benchmarks

Quick Start

If you want to quickly see the memory compressor in action:

# Clone the repository
git clone https://github.com/berkdurmus/adaptive-llm-memory-compressor.git
cd adaptive-llm-memory-compressor

# Install dependencies
npm install

# Create a .env file with your OpenAI API key
echo "OPENAI_API_KEY=your_api_key_here" > .env

# Run the demo with the sample conversation
npm run demo

This will run a demonstration that:

Loads a sample conversation
Adds all messages to memory
Triggers compression
Shows before/after token counts
Tests retrieval with sample queries

Architecture

┌───────── user/agent turns ─────────┐
│                                    │
│ 1. Ingestion                       │
│    • normalize message             │
│    • store raw text + embeddings   │
│                                    │
│ 2. Adaptive Compression Job        │  (runs async)
│    • trigger = token_count > T     │
│    • select least-salient blocks   │
│    • LLM summarizes ↓              │
│    • replace with {summary, vec}   │
│                                    │
│ 3. Retrieval                       │
│    • hybrid search (BM25 + vec)    │
│    • returns raw or summary text   │
└────────────────────────────────────┘

Storage

PostgreSQL for metadata (id, timestamp, role, tokenCount, salienceScore, isSummary)
ChromaDB for vector embeddings

API

Fastify server providing REST endpoints

Compression Algorithm

The compression algorithm works as follows:

Rank: Compute salience scores for messages based on:
- Cosine similarity to recent queries
- Recency factor 1/(1+ageDays)
Select: Pick the oldest 20% of messages with salience scores below a threshold.
Merge: Chunk selected messages into blocks of 1-2k tokens.
Summarize: Use LLM to summarize each block with a prompt that emphasizes preserving facts, names, and numbers.
Replace: Delete original messages, insert summary row (flagged isSummary = true).
Audit: Keep hash of deleted text for verification and potential rollback.

Getting Started

Prerequisites

Node.js 18+
PostgreSQL
ChromaDB (local or remote instance)
OpenAI API key

Installation

Clone the repository:

git clone https://github.com/berkdurmus/adaptive-llm-memory-compressor.git
cd adaptive-llm-memory-compressor

Install dependencies:
```
npm install
```

Create a .env file with your configuration:

OPENAI_API_KEY=your_openai_api_key
DATABASE_URL=postgres://user:password@localhost:5432/memory_compressor
CHROMA_HOST=localhost
CHROMA_PORT=8000
MAX_MEMORY_TOKENS=10000
COMPRESSION_THRESHOLD=8000
COMPRESSION_TARGET=5000
SUMMARY_MODEL=gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-small

Build the project:
```
npm run build
```

Running the Server

Start the API server:

npm start

Running the Demo

Run the demo with a sample conversation:

npm run demo

You can also provide your own conversation file:

npm run demo -- path/to/your/conversation.json

API Endpoints

POST /initialize - Initialize the memory manager
POST /message - Add a new message to memory
POST /retrieve - Retrieve messages based on a query
POST /conversation - Get all messages for a conversation
POST /compress - Force compression of a conversation
GET /health - Health check endpoint

Evaluation

The project includes an evaluation script that measures:

Token reduction percentage
F1 score before and after compression
Retrieval latency

Run the evaluation:

npm run evaluate

License

This project is licensed under the ISC License.

Acknowledgments

Inspired by research on long-term memory in LLM applications
Uses OpenAI's embedding and completion APIs

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
src		src
test_data		test_data
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive LLM Memory Compressor

Problem & Goal

Quick Start

Architecture

Storage

API

Compression Algorithm

Getting Started

Prerequisites

Installation

Running the Server

Running the Demo

API Endpoints

Evaluation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive LLM Memory Compressor

Problem & Goal

Quick Start

Architecture

Storage

API

Compression Algorithm

Getting Started

Prerequisites

Installation

Running the Server

Running the Demo

API Endpoints

Evaluation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages