Tool Selection for LLM Agents

Reaserach Question:

This repo aims to answer the question - "Can we improve the tool selection process of an LLM-based agent using search techniques?"

This repository explores efficient tool selection for LLM-based agents, with a focus on narrowing the available tool set before an agent decides which tools to call. The project evaluates embedding-based retrieval and LLM-based verification strategies for selecting the correct tools from a tool catalog, emphasizing multi-tool requests and agent control-flow correctness. The scope is intentionally limited to tool reasoning and selection, not tool execution.

Motivation

As LLM agents are integrated with an increasing number of tools, exposing the full tool catalog to the model becomes:

Inefficient (larger prompts, higher inference cost)
Error-prone (irrelevant tool calls and hallucinations)
Difficult to control (weak separation of responsibilities)

This project investigates whether pre-selecting a small, relevant subset of tools—based on tool descriptions and user intent—can:

Improve agent reliability
Reduce unnecessary tool consideration
Preserve correctness for both single-tool and multi-tool requests

Key Ideas

Embedding-based tool retrieval with high recall
Middleware-level enforcement of tool visibility in LangChain agents
Evaluation without executing real APIs

Dataset

This project uses xLAM-function-calling-60k, a synthetic open-source dataset for evaluating function calling and tool usage.

Why xLAM?

Tools are abstract and cost-free
Tool definitions include names, descriptions, and schemas
Ground-truth tool calls are provided
Supports both single-tool and multi-tool prompts
The dataset enables deterministic and reproducible evaluation without requiring any API credentials.

Metrics

Among the important datapoints that were collected are:

Accuracy - cacluated by the overlap of the expected tools to the actually tools used devided by the the maximum of them.

overlap = len(set(expected_tools).intersection(set(selected_tools)))
            result_single_test["accuracy"] = overlap / max(
                len(expected_tools), len(selected_tools)
            )

Test pass - Set to True if at least one of the tool expected were picked.

Results

The results were measured by:

The avarage accuracy.
The ratio between the test that passed to all the tests.
The ratio between the test that had an accuracy score above 0.5 to all the tests.

Baseline (Normal Agent with no Middleware)

Normal Agent Mean Accuracy: 67.91%

Normal Agent test pass to all tests ratio: 92.00%

Normal Agent above threshold accuracy ratio: 51.60%

Normal Agent Mean Total tokens: 24369.10

Normal Agent Mean Time per Prompt: 81.25 sec

Embeddings Search:

Filtering Agent Mean Accuracy: 83.87%

Filtering Agent test pass to all tests ratio: 95.28%

Filtering Agent above threshold accuracy ratio: 74.34%

BM25 Search (Filtering Agent)

Filtering Agent Mean Accuracy: 78.12%

Filtering Agent test pass to all tests ratio: 91.20%

Filtering Agent above threshold accuracy ratio: 68.80%

Filtering Agent Mean Total tokens: 7477.96

Filtering Agent Mean Time per Prompt: 82.21 sec

Discussion

As we can see, the Filtering agent (AKA the agent that is using the BM25 search-based tools selection) can actually improve the accuracy of the agent when using tools.

About 10% Accuracy improvement and about 17% increase in the accuracy above threshold 0.5 means that the tool selection can work, and by improving the search technique, we might achieve better results.
Additional test that counted the total tokens shows dramatic decrease of ~69%.

Restrictions

I run this program locally using LM-Studio LLM meta-llama-3.1-8b-instruct-128k on a M4 Mac.
Due to hardware limitations, I used the first 50 tools and tested with the queries that included them.

High-Level Architecture

User Prompt
     ↓
Embedding-Based Tool Retrieval
     ↓
Top-K Candidate Tools
     ↓
Final Allowed Tool Set
     ↓
LangChain Agent (No Tool Execution)
     ↓
Evaluation Against Ground Truth

Repo Structure

├── README.md
├── embeddings.py
├── prototype
│   ├── intialize_agent.py
│   ├── main.py
│   └── tools.py
├── pyproject.toml
├── test
│   ├── agent_tool_calling_init.py
│   ├── data
│   │   ├── test_suite.json
│   │   └── xlam_processed.json
│   ├── generate_data.py
│   ├── generated_tools.py
│   ├── post_processing.py
│   ├── results
│   │   ├── backup
│   ├── test_runner.py
│   └── tools_embeddings.py
└── uv.lock

LangChain Integration

This project integrates with LangChain’s agent middleware, including:

LLMToolSelectorMiddleware
wrap_model_call for tool visibility restriction

Agents are restricted to the filtered tool set per request, enforcing correctness by construction.

Requirements

Python 3.11
uv package manager

tools-selector v0.1.0
├── datasets v4.4.1
├── docx2txt v0.9
├── langchain v1.0.8
├── langchain-core v1.1.0
├── langchain-openai v1.0.3
├── llama-index v0.14.8
├── llama-index-embeddings-huggingface v0.6.1
├── llama-index-llms-openai v0.6.9
├── llama-index-llms-openai-like v0.5.3
├── loguru v0.7.3
├── numpy v2.3.5
├── openai v2.8.1
├── pandas v2.2.3
├── pydantic v2.12.4
├── torch v2.9.1
├── tqdm v4.67.1
└── transformers v4.57.1

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
prototype		prototype
test		test
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
bm_search.py		bm_search.py
embeddings.py		embeddings.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tool Selection for LLM Agents

Reaserach Question:

Motivation

Key Ideas

Dataset

Why xLAM?

Metrics

Results

Discussion

Restrictions

High-Level Architecture

Repo Structure

LangChain Integration

Requirements

About

Uh oh!

Releases

Packages

Languages

License

Hadar301/tools-selector

Folders and files

Latest commit

History

Repository files navigation

Tool Selection for LLM Agents

Reaserach Question:

Motivation

Key Ideas

Dataset

Why xLAM?

Metrics

Results

Discussion

Restrictions

High-Level Architecture

Repo Structure

LangChain Integration

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages