Skip to content

Unload Model from Memory on Agent Switch for Local Providers #2636

@k33g

Description

@k33g

Feature Request: Unload Model from Memory on Agent Switch for Local Providers

Summary

When using local model providers (such as dmr or ollama) with multi-agent configurations, switching between agents that use different models can cause significant memory pressure because the previously loaded model is not evicted from the inference engine's memory. This feature request proposes a mechanism to automatically unload the current model when switching to a different agent that uses a different model.

Problem

Docker Agent supports multi-agent architectures where different agents can be configured with different models. When using local providers like dmr (Docker Model Runner), each model is loaded into GPU/CPU memory when first used. If two agents use different models, both models end up resident in memory simultaneously — even though only one is active at any given time.

On machines with limited VRAM or RAM (common for local inference setups), this leads to:

  • Memory pressure: two or more large models competing for the same memory pool.
  • Degraded performance: thrashing, swapping, or OOM conditions.
  • Unnecessary resource usage: the idle model occupies memory that the active model could use.

DMR already exposes an API endpoint to explicitly unload a model from memory:

POST /engines/unload

This endpoint exists precisely to allow consumers to free memory when a model is no longer needed, but docker-agent currently has no way to trigger it automatically when switching between agents.

Proposed Solution

New provider_opts key: unload_on_switch

Add a boolean provider_opts key that tells docker-agent to call the engine's unload API before loading a different model.

models:
  local-large:
    provider: dmr
    model: ai/qwen3:14B
    provider_opts:
      unload_on_switch: true   # unload this model from memory when switching away

  local-small:
    provider: dmr
    model: ai/qwen3:0.6B
    provider_opts:
      unload_on_switch: true

When unload_on_switch: true is set on a model, docker-agent will call the engine's unload endpoint before activating a different model on the same provider.

New provider key: unload_api

To make the mechanism generic and reusable with other local-inference engines (e.g. Ollama), expose the unload endpoint as a configurable field on the provider configuration:

providers:
  my-local-runner:
    provider: dmr
    base_url: http://localhost:12434/engines/llama.cpp/v1
    unload_api: /engines/unload   # POST to {base_url_root}{unload_api} to unload a model

  my-ollama:
    provider: ollama
    base_url: http://localhost:11434/v1
    unload_api: /api/delete       # Ollama's equivalent endpoint

When unload_api is set on a provider, and the active agent switches to an agent using a different model on the same provider, docker-agent will issue a POST (or DELETE, configurable) request to that endpoint before initializing the new model.

Interaction with keep_alive

DMR already supports keep_alive via provider_opts (values like "0" to unload immediately, "-1" to keep forever). The unload_on_switch option is complementary: keep_alive controls TTL-based eviction, while unload_on_switch triggers an explicit eviction at agent-switch time regardless of the TTL.

Setting keep_alive: "0" currently unloads the model after each request, which is too aggressive for normal use. unload_on_switch would provide a middle ground: keep the model loaded during a session with a given agent, but release it when the runtime moves to a different agent with a different model.

Metadata

Metadata

Assignees

Labels

area/providersFor features/issues/fixes related to LLM providers (Bedrock, LiteLLM, Qwen, custom, etc.)area/providers/docker-model-runnerDocker Model Runner (DMR) local inferenceeffort:mediumMultiple files or components, some design decisions neededpriority:mediumNormal priority, standard sprint work

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions