Skip to content

berkdurmus/adaptive-llm-memory-compressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive LLM Memory Compressor

A TypeScript implementation of an adaptive memory compression system for Large Language Models (LLMs). This system automatically compresses conversation history to keep memory under a token limit without hurting answer quality.

Problem & Goal

Problem: Conversational agents store every turn in their memory. After prolonged use, this storage can grow past hundreds of thousands of tokens, causing:

  • Slow retrieval
  • Decreased search quality
  • LLM context window overflow
  • Higher operational costs

Goal: Build an automatic compression layer that keeps memory under a configurable token limit without hurting answer quality. The layer intelligently decides when to compress, what to keep verbatim, and how to summarize content.

Success Metrics:

  • ≥ 60% token-count reduction
  • ≤ 5% drop in retrieval F1 on held-out QA benchmarks

Quick Start

If you want to quickly see the memory compressor in action:

# Clone the repository
git clone https://github.com/berkdurmus/adaptive-llm-memory-compressor.git
cd adaptive-llm-memory-compressor

# Install dependencies
npm install

# Create a .env file with your OpenAI API key
echo "OPENAI_API_KEY=your_api_key_here" > .env

# Run the demo with the sample conversation
npm run demo

This will run a demonstration that:

  1. Loads a sample conversation
  2. Adds all messages to memory
  3. Triggers compression
  4. Shows before/after token counts
  5. Tests retrieval with sample queries

Architecture

┌───────── user/agent turns ─────────┐
│                                    │
│ 1. Ingestion                       │
│    • normalize message             │
│    • store raw text + embeddings   │
│                                    │
│ 2. Adaptive Compression Job        │  (runs async)
│    • trigger = token_count > T     │
│    • select least-salient blocks   │
│    • LLM summarizes ↓              │
│    • replace with {summary, vec}   │
│                                    │
│ 3. Retrieval                       │
│    • hybrid search (BM25 + vec)    │
│    • returns raw or summary text   │
└────────────────────────────────────┘

Storage

  • PostgreSQL for metadata (id, timestamp, role, tokenCount, salienceScore, isSummary)
  • ChromaDB for vector embeddings

API

  • Fastify server providing REST endpoints

Compression Algorithm

The compression algorithm works as follows:

  1. Rank: Compute salience scores for messages based on:

    • Cosine similarity to recent queries
    • Recency factor 1/(1+ageDays)
  2. Select: Pick the oldest 20% of messages with salience scores below a threshold.

  3. Merge: Chunk selected messages into blocks of 1-2k tokens.

  4. Summarize: Use LLM to summarize each block with a prompt that emphasizes preserving facts, names, and numbers.

  5. Replace: Delete original messages, insert summary row (flagged isSummary = true).

  6. Audit: Keep hash of deleted text for verification and potential rollback.

Getting Started

Prerequisites

  • Node.js 18+
  • PostgreSQL
  • ChromaDB (local or remote instance)
  • OpenAI API key

Installation

  1. Clone the repository:

    git clone https://github.com/berkdurmus/adaptive-llm-memory-compressor.git
    cd adaptive-llm-memory-compressor
  2. Install dependencies:

    npm install
  3. Create a .env file with your configuration:

    OPENAI_API_KEY=your_openai_api_key
    DATABASE_URL=postgres://user:password@localhost:5432/memory_compressor
    CHROMA_HOST=localhost
    CHROMA_PORT=8000
    MAX_MEMORY_TOKENS=10000
    COMPRESSION_THRESHOLD=8000
    COMPRESSION_TARGET=5000
    SUMMARY_MODEL=gpt-4o-mini
    EMBEDDING_MODEL=text-embedding-3-small
    
  4. Build the project:

    npm run build

Running the Server

Start the API server:

npm start

Running the Demo

Run the demo with a sample conversation:

npm run demo

You can also provide your own conversation file:

npm run demo -- path/to/your/conversation.json

API Endpoints

  • POST /initialize - Initialize the memory manager
  • POST /message - Add a new message to memory
  • POST /retrieve - Retrieve messages based on a query
  • POST /conversation - Get all messages for a conversation
  • POST /compress - Force compression of a conversation
  • GET /health - Health check endpoint

Evaluation

The project includes an evaluation script that measures:

  • Token reduction percentage
  • F1 score before and after compression
  • Retrieval latency

Run the evaluation:

npm run evaluate

License

This project is licensed under the ISC License.

Acknowledgments

  • Inspired by research on long-term memory in LLM applications
  • Uses OpenAI's embedding and completion APIs

About

An Adaptive Memory Compression system for Large Language Models (LLMs): This system automatically compresses conversation history to keep memory under a token limit without hurting answer quality.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages