ICRL

In-Context Reinforcement Learning for LLM Agents

ICRL implements the In-Context Reinforcement Learning algorithm, enabling LLM agents to bootstrap their own performance by learning from successful trajectories. The agent accumulates successful experiences and retrieves relevant examples at each decision point to improve future task completion.

Installation

Install from PyPI

pip install icrl-py
# or with uv
uv add icrl-py

Install from source

git clone https://github.com/SuperAce100/icrl.git
cd icrl

# Create & activate a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows (PowerShell)

# Install in editable mode
pip install -e .

If you use uv:

git clone https://github.com/SuperAce100/icrl.git
cd icrl
uv sync
# or: uv pip install -e .

Verify the install:

python -c "import icrl; print(icrl.__version__)"
# If the CLI entrypoint is installed:
icrl --help

Dependencies: pydantic, litellm, sentence-transformers, faiss-cpu, aiofiles, rich, python-dotenv

Quick Start

import asyncio
from icrl import Agent, LiteLLMProvider

# Create the agent
agent = Agent(
    llm=LiteLLMProvider(model="gpt-4o-mini"),
    db_path="./trajectories",
    plan_prompt="Goal: {goal}\n\nExamples:\n{examples}\n\nCreate a plan:",
    reason_prompt="Goal: {goal}\nPlan: {plan}\nObservation: {observation}\nThink step by step:",
    act_prompt="Goal: {goal}\nPlan: {plan}\nReasoning: {reasoning}\nNext action:",
    k=3,           # number of examples to retrieve
    max_steps=30,  # max steps per episode
)

# Training: successful trajectories are stored for future use
trajectory = asyncio.run(agent.train(env, goal="Complete the task"))

# Inference: uses stored examples but doesn't add new ones
trajectory = asyncio.run(agent.run(env, goal="Complete another task"))

Core Concepts

The ICRL Algorithm

Bootstrap Phase: The agent attempts tasks, storing successful trajectories
Retrieval: At each decision point, semantically similar examples are retrieved
Generation: The LLM generates plans/reasoning/actions informed by examples
Curation: Low-utility trajectories are automatically pruned over time

ReAct Loop

Each episode follows a Plan → Reason → Act loop:

┌─────────────────────────────────────────────────────────┐
│  1. PLAN: Generate high-level strategy using examples   │
├─────────────────────────────────────────────────────────┤
│  2. REASON: Analyze observation with retrieved context  │
├─────────────────────────────────────────────────────────┤
│  3. ACT: Execute action based on reasoning              │
├─────────────────────────────────────────────────────────┤
│  4. OBSERVE: Get environment feedback                   │
│     └─→ Loop back to REASON until done                  │
└─────────────────────────────────────────────────────────┘

API Reference

`Agent`

The main class for training and running the ICRL agent.

from icrl import Agent

agent = Agent(
    llm: LLMProvider,              # LLM for generating completions
    db_path: str,                  # Path to trajectory database
    plan_prompt: str,              # Template with {goal}, {examples}
    reason_prompt: str,            # Template with {goal}, {plan}, {observation}, {history}, {examples}
    act_prompt: str,               # Template with {goal}, {plan}, {reasoning}, {history}, {examples}
    k: int = 3,                    # Number of examples to retrieve
    max_steps: int = 30,           # Maximum steps per episode
    seed_trajectories: list[Trajectory] | None = None,  # Initial examples
    on_step: Callable[[Step, StepContext], None] | None = None,  # Step callback
    curation_threshold: float = 0.3,      # Utility threshold for pruning
    curation_min_retrievals: int = 5,     # Min retrievals before pruning
    verify_trajectory: Callable[[Trajectory], bool] | None = None,  # Verification callback
)

Methods

Method	Description
`await agent.train(env, goal)`	Run training episode, store successful trajectories (with optional verification)
`await agent.run(env, goal)`	Run inference episode (database frozen)
`agent.train_sync(env, goal)`	Synchronous wrapper for `train`
`agent.run_sync(env, goal)`	Synchronous wrapper for `run`
`await agent.train_batch(env_factory, goals)`	Train on multiple goals
`await agent.run_batch(env_factory, goals)`	Run inference on multiple goals
`agent.get_stats()`	Get database statistics
`agent.database`	Access the underlying `TrajectoryDatabase`

`LiteLLMProvider`

Built-in LLM provider supporting 100+ models via LiteLLM.

from icrl import LiteLLMProvider

llm = LiteLLMProvider(
    model: str = "gpt-4o-mini",    # Model identifier
    temperature: float = 0.7,      # Sampling temperature
    max_tokens: int | None = None, # Max tokens (None for model default)
    **kwargs,                      # Additional LiteLLM arguments
)

Supported models include:

OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo
Anthropic: claude-3-5-sonnet-20241022, claude-3-opus-20240229
Google: gemini/gemini-pro, gemini/gemini-1.5-pro
Azure, Cohere, Replicate, and many more

`Environment` Protocol

Implement this protocol for your custom environment:

from icrl import Environment

class MyEnvironment:
    def reset(self, goal: str) -> str:
        """Reset environment and return initial observation.
        
        Args:
            goal: The goal description for this episode.
        
        Returns:
            Initial observation as a string.
        """
        self._goal = goal  # Store for use in step()
        return "Initial state description"

    def step(self, action: str) -> tuple[str, bool, bool]:
        """Execute an action.
        
        Args:
            action: The action string to execute.
        
        Returns:
            Tuple of (observation, done, success):
            - observation: Result of the action
            - done: Whether episode has ended
            - success: Whether goal was achieved
        """
        # Execute action and check if goal is met
        observation = execute(action)
        success = check_goal(self._goal)
        done = success or max_steps_reached
        return observation, done, success

`LLMProvider` Protocol

Implement for custom LLM integrations:

from icrl import LLMProvider, Message

class MyLLMProvider:
    async def complete(self, messages: list[Message]) -> str:
        """Generate completion from messages.
        
        Args:
            messages: List of Message(role, content) objects.
        
        Returns:
            Generated text as a string.
        """
        # Call your LLM
        return await my_llm_call(messages)

Data Models

All models are Pydantic BaseModel classes for type safety and serialization.

`Trajectory`

A complete episode trajectory:

from icrl import Trajectory, Step

trajectory = Trajectory(
    id: str,                    # Auto-generated UUID
    goal: str,                  # Goal description
    plan: str,                  # Generated plan
    steps: list[Step],          # List of steps taken
    success: bool,              # Whether goal was achieved
    metadata: dict[str, Any],   # Custom metadata
)

# Convert to example string for prompts
example_str = trajectory.to_example_string()

`Step`

A single step in a trajectory:

from icrl import Step

step = Step(
    observation: str,  # What the agent observed
    reasoning: str,    # Agent's reasoning
    action: str,       # Action taken
)

`StepContext`

Context available during prompt formatting:

from icrl import StepContext

context = StepContext(
    goal: str,
    plan: str,
    observation: str,
    reasoning: str = "",
    history: list[Step] = [],
    examples: list[Trajectory] = [],
)

# Format for prompts
context.format_examples()  # → "Goal: ...\nPlan: ...\nSteps: ..."
context.format_history()   # → "Step 1: action -> observation\n..."

`Message`

A chat message:

from icrl import Message

message = Message(role="user", content="Hello")

Prompt Templates

Prompts use Python format strings with these placeholders:

Placeholder	Available In	Description
`{goal}`	All prompts	The current goal
`{examples}`	All prompts	Formatted retrieved trajectories
`{plan}`	reason, act	The generated plan
`{observation}`	reason, act	Current observation
`{reasoning}`	act	Generated reasoning
`{history}`	reason, act	Previous steps in episode

Example Prompts

PLAN_PROMPT = """You are a helpful agent.

Goal: {goal}

Here are examples of similar tasks that were completed successfully:
{examples}

Create a step-by-step plan to accomplish the goal."""

REASON_PROMPT = """Goal: {goal}
Plan: {plan}

Previous steps:
{history}

Current observation:
{observation}

Examples of similar situations:
{examples}

Think step by step about what you observe and what to do next."""

ACT_PROMPT = """Goal: {goal}
Plan: {plan}

Steps so far:
{history}

Current observation: {observation}
Your reasoning: {reasoning}

What is the next action? Respond with only the action."""

Step Callbacks

Monitor agent progress with step callbacks:

from icrl import Step, StepContext

def my_callback(step: Step, context: StepContext) -> None:
    print(f"Observation: {step.observation[:100]}...")
    print(f"Reasoning: {step.reasoning}")
    print(f"Action: {step.action}")
    print(f"Using {len(context.examples)} examples")

agent = Agent(
    ...,
    on_step=my_callback,
)

Trajectory Database

The agent stores trajectories on disk with FAISS-based semantic search.

# Access the database directly
db = agent.database

# Search for similar trajectories
similar = db.search("find config files", k=3)

# Get all trajectories
all_trajs = db.get_all()

# Get a specific trajectory
traj = db.get("trajectory-id")

# Remove a trajectory
db.remove("trajectory-id")

Database Structure

./trajectories/
├── trajectories/
│   ├── <uuid-1>.json
│   ├── <uuid-2>.json
│   └── ...
├── index.faiss         # FAISS vector index
├── index_ids.json      # ID mapping
└── curation.json       # Utility tracking

Curation

The agent automatically prunes low-utility trajectories. A trajectory is pruned when:

It has been retrieved at least min_retrievals times
Its utility score (success rate when used) falls below threshold

agent = Agent(
    ...,
    curation_threshold=0.3,       # Prune if utility < 30%
    curation_min_retrievals=5,    # After at least 5 retrievals
)

Advanced Usage

Seed Trajectories

Initialize with pre-existing examples:

from icrl import Trajectory, Step

seed = Trajectory(
    goal="Example task",
    plan="1. Do A\n2. Do B",
    steps=[
        Step(observation="Started", reasoning="Need to do A", action="do_a"),
        Step(observation="A done", reasoning="Now do B", action="do_b"),
    ],
    success=True,
)

agent = Agent(
    ...,
    seed_trajectories=[seed],
)

Batch Training

Train on multiple tasks efficiently:

def make_env():
    return MyEnvironment()

goals = ["Task 1", "Task 2", "Task 3"]

# Training mode - learns from each successful episode
trajectories = await agent.train_batch(make_env, goals)

# Inference mode - frozen database
trajectories = await agent.run_batch(make_env, goals)

Custom Embeddings

The database uses sentence-transformers with all-MiniLM-L6-v2 by default (as used in the paper). For custom embeddings, subclass the database:

from icrl.embedder import SentenceTransformerEmbedder
from icrl.database import TrajectoryDatabase

embedder = SentenceTransformerEmbedder(model_name="your-model")
db = TrajectoryDatabase(path="./trajectories", embedder=embedder)

Examples

Demo scripts are in examples/ (see examples/README.md). Mock/offline verification scripts and test-focused walkthroughs are in tests/ (see tests/README.md).

Minimal OpenAI Demo

export OPENAI_API_KEY=your-key
uv run python examples/basic_openai_demo.py

Minimal Anthropic Demo

export ANTHROPIC_API_KEY=your-key
uv run python examples/basic_anthropic_demo.py

File System Navigation Agent

See examples/demo_with_real_llm.py for a complete example of an agent that navigates a virtual file system:

# Set your API key
export OPENAI_API_KEY=your-key

# Run the demo
uv run python examples/demo_with_real_llm.py

Mock LLM for Testing

Use the mock provider for fast iteration without API calls:

from examples.mock_llm import MockLLMProvider

llm = MockLLMProvider(success_rate=1.0)
agent = Agent(llm=llm, ...)

Run the full offline mock demo:

uv run python tests/test_with_mock.py

Agent API Walkthrough (Offline)

Deterministic walkthrough of Agent APIs:

train / run
train_sync / run_sync
train_batch / run_batch
seed_trajectories
verify_trajectory

uv run python tests/agent_api_walkthrough.py

Database API Walkthrough (Offline)

Deterministic walkthrough of storage/retrieval/curation/validation APIs:

TrajectoryDatabase CRUD/search
TrajectoryRetriever
CurationManager
HashEmbedder
extract_code_artifacts and validation helpers

uv run python tests/database_api_walkthrough.py

Harbor Coding Agent (Terminal-Bench 2.0 Compatible)

See examples/harbor_coding_agent.py for a coding agent example compatible with Harbor and Terminal-Bench 2.0. This demonstrates:

A sandboxed coding environment with shell commands (ls, cat, grep, sed, etc.)
Realistic software engineering tasks (debugging, refactoring, testing)
Performance improvement tracking before/after ICRL training

export OPENAI_API_KEY=your-key
uv run python examples/harbor_coding_agent.py

The Harbor example shows how ICRL improves agent performance on coding tasks:

Baseline Evaluation: Agent attempts tasks without learned examples
Training Phase: Agent learns from successful coding task trajectories
Improved Evaluation: Re-test shows performance gains from trajectory learning

This pattern integrates with Harbor's agent evaluation framework, allowing you to:

Benchmark coding agents on Terminal-Bench 2.0 tasks
Use ICRL's self-generated examples to improve agent performance
Track improvements across training iterations

Architecture

icrl/
├── agent.py        # Main Agent class
├── loop.py         # ReAct loop implementation
├── database.py     # FAISS-backed trajectory storage
├── retriever.py    # Semantic example retrieval
├── curation.py     # Automatic trajectory pruning
├── embedder.py     # Sentence transformer embeddings
├── models.py       # Pydantic data models
├── protocols.py    # Environment and LLMProvider protocols
└── providers/
    └── litellm.py  # LiteLLM integration

Reference

This implementation is based on the algorithm described in:

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

The key insight is that LLM agents can bootstrap their own performance by:

Attempting tasks and recording successful trajectories
Using semantic retrieval to find relevant examples at each decision point
Automatically curating the example database to retain high-utility examples

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.codex/environments		.codex/environments
docs		docs
examples		examples
icrl-ts		icrl-ts
src/icrl		src/icrl
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

ICRL

Installation

Install from PyPI

Install from source

Quick Start

Core Concepts

The ICRL Algorithm

ReAct Loop

API Reference

Agent

Methods

LiteLLMProvider

Environment Protocol

LLMProvider Protocol

Data Models

Trajectory

Step

StepContext

Message

Prompt Templates

Example Prompts

Step Callbacks

Trajectory Database

Database Structure

Curation

Advanced Usage

Seed Trajectories

Batch Training

Custom Embeddings

Examples

Minimal OpenAI Demo

Minimal Anthropic Demo

File System Navigation Agent

Mock LLM for Testing

Agent API Walkthrough (Offline)

Database API Walkthrough (Offline)

Harbor Coding Agent (Terminal-Bench 2.0 Compatible)

Architecture

Reference

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Agent`

`LiteLLMProvider`

`Environment` Protocol

`LLMProvider` Protocol

`Trajectory`

`Step`

`StepContext`

`Message`

Packages