RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

Package to build MCP server which applies retrieval augmented generation (RAG) principles to tool selection. Instead of flooding the LLM with all MCP descriptions, we maintain an external vector index of all available MCP metadata.

Key Points from Research Paper

Problem Statement

Prompt Bloat: As the number of available tools (MCPs) grows, including all their descriptions in a single prompt leads to context window saturation and poor tool selection
Decision Complexity: LLMs struggle with choosing the correct tool when presented with many similar options
Performance Degradation: Tool selection accuracy drops from >90% with few tools to ~13.62% with many tools

RAG-MCP Solution

RAG-MCP applies Retrieval-Augmented Generation principles to tool selection:

Retrieval Phase: A lightweight retriever encodes the user query and performs semantic search over the MCP index, returning top-k candidate MCPs
Validation Phase: For each retrieved MCP, can generate few-shot examples to test compatibility
Invocation Phase: Only the single best MCP description is injected into the LLM prompt

Key Benefits

Reduced Prompt Size: Cuts prompt tokens by over 50% (from 2133.84 to 1084 tokens)
Improved Accuracy: More than triples tool selection accuracy (43.13% vs 13.62% baseline)
Lower Cognitive Load: LLM no longer needs to sift through hundreds of distractors
Resource Efficiency: Only activates the selected MCP server, not all registered ones
Extensibility: New tools can be added to the index without retraining the LLM

Experimental Results

The MCP stress test shows:

High success rates (>90%) when candidate pool is small (<30 MCPs)
Performance degradation becomes significant beyond ~100 MCPs
RAG-MCP maintains better performance even with large tool registries

Architecture Overview

RAG-MCP operates in three core steps:

User query → Encoded and submitted to retriever
Retriever → Searches vector index, ranks candidates by semantic similarity
LLM → Receives only selected MCP schema and executes task

RAG-MCP Pipeline Diagram

graph LR
    A[Query<br/>Encoding] --> B[Vector Search<br/>& Validation]
    B --> C[LLM<br/>Invocation]
    
    A1[User Query] --> A
    A --> A2[Qwen Retriever]
    
    B1[Vector Index] --> B
    B --> B2[Top-k MCPs]
    
    C --> C1[Use selected MCP]
    C --> C2[f&#40;&#41;]
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px

The pipeline demonstrates how RAG-MCP:

Encodes user queries with Qwen-max retriever
Retrieves & validates top-k MCPs from the vector index
Invokes the chosen MCP with the LLM using only the selected tool's schema

Goal

The main goal is to enable LLMs to effectively utilize large numbers of external tools without suffering from prompt bloat or decision fatigue.

How It Works

Traditional Approach (Without RAG-MCP)

User Query → LLM sees ALL tools → LLM selects tools → Execute tools
              (15+ tools, 2000+ tokens)

RAG-MCP Approach

User Query → RAG-MCP filters tools → LLM sees ONLY relevant tools → Execute tools
              (using embeddings)        (3-5 tools, ~1000 tokens)

Usage Example

Imagine you have an LLM assistant with 20+ tools available:

Web search tools
Database query tools
Calculator tools
File manipulation tools
API integration tools
And many more...

Without RAG-MCP

# Current approach - ALL tools sent to LLM every time
response = llm.query(
    user_input="What's the weather in Tokyo?",
    available_tools=[
        weather_api,
        web_search,
        database_query,
        calculator,
        file_reader,
        file_writer,
        api_caller,
        ... # 15+ more tools
    ]
)
# Result: LLM processes 2000+ tokens of tool descriptions
# Often selects wrong tools or gets confused

With RAG-MCP

# RAG-MCP approach - intelligent pre-filtering
relevant_tools = rag_mcp.get_relevant_tools(
    query="What's the weather in Tokyo?",
    all_tools=tool_registry,
    top_k=3
)
# Returns: [weather_api, web_search, api_caller]

response = llm.query(
    user_input="What's the weather in Tokyo?", 
    available_tools=relevant_tools  # Only 3 tools!
)
# Result: LLM processes ~1000 tokens
# Higher accuracy in tool selection

Real-World Use Cases

1. Multi-Tool AI Assistants

If your AI assistant has tools for:

Code analysis
Documentation search
Database queries
API calls
File operations

RAG-MCP ensures that a query like "analyze this Python function" only gets code analysis tools, not database or API tools.

2. Trading Bots (Trade-Copilot Specific Example)

Trade-Copilot has 17 specialized tools:

Trading Memory: get_current_bias, store_trading_decision, check_consistency
Market Analysis: market_structure, equity_order_flow, volume_profile
Options Analysis: options_order_flow, options_greeks_v2, greeks_metrics
Intraday Tools: ORB_analysis, FVG_analysis, technical_zones

Current Problem: Every query loads ALL 17 tools (~3000 tokens), even simple ones.

Real Examples with RAG-MCP:

# Example 1: Simple Market Check
query = "What's the market structure for SPY?"
# Without RAG-MCP: All 17 tools loaded
# With RAG-MCP: Only 3 tools selected
selected_tools = ["get_current_bias", "market_structure_tool", "equity_order_flow_tool"]
# Result: 82% reduction in tool overhead

# Example 2: Options Analysis
query = "Show me Greeks and volatility skew for SPY 0DTE"
# Without RAG-MCP: All 17 tools (including unneeded equity tools)
# With RAG-MCP: Only options-specific tools
selected_tools = ["get_current_bias", "options_greeks_tool_v2", "greeks_metrics_tool", "greeks_insights_tool"]
# Result: 76% reduction, faster Greeks calculations

# Example 3: Opening Range Breakout
query = "Is SPY breaking out of opening range?"
# Without RAG-MCP: All 17 tools (including all options tools)
# With RAG-MCP: Only intraday tools
selected_tools = ["get_current_bias", "financial_orb_analysis", "financial_fvg_analysis"]
# Result: 82% reduction, critical for fast 0DTE decisions

Tool Dependencies Handled: RAG-MCP understands that greeks_metrics_tool requires options_greeks_tool_v2 to run first, ensuring proper tool selection.

When to Use RAG-MCP

Use RAG-MCP when you have:

✅ More than 10 tools/functions available to your LLM
✅ Tools with overlapping or similar functionalities
✅ Performance concerns due to large prompt sizes
✅ Cost concerns from excessive token usage
✅ Users reporting that the LLM picks wrong tools

Expected Improvements

Based on research and real-world implementations:

Token Usage: 50% reduction in prompt tokens
Tool Selection Accuracy: 3x improvement (from ~14% to ~43%)
Response Time: 30-40% faster due to reduced processing
Cost Savings: Proportional to token reduction
Scalability: Can handle 100+ tools without degrading performance

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
case-study		case-study
paper		paper
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
TECHNICAL_README.md		TECHNICAL_README.md
prism_client_integration.py		prism_client_integration.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

Key Points from Research Paper

Problem Statement

RAG-MCP Solution

Key Benefits

Experimental Results

Architecture Overview

RAG-MCP Pipeline Diagram

Goal

How It Works

Traditional Approach (Without RAG-MCP)

RAG-MCP Approach

Usage Example

Without RAG-MCP

With RAG-MCP

Real-World Use Cases

1. Multi-Tool AI Assistants

2. Trading Bots (Trade-Copilot Specific Example)

When to Use RAG-MCP

Expected Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

Key Points from Research Paper

Problem Statement

RAG-MCP Solution

Key Benefits

Experimental Results

Architecture Overview

RAG-MCP Pipeline Diagram

Goal

How It Works

Traditional Approach (Without RAG-MCP)

RAG-MCP Approach

Usage Example

Without RAG-MCP

With RAG-MCP

Real-World Use Cases

1. Multi-Tool AI Assistants

2. Trading Bots (Trade-Copilot Specific Example)

When to Use RAG-MCP

Expected Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages