Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Kuan-Yi Lee, Tsung-En Lin, Hung-yi Lee
📝 Paper: https://arxiv.org/abs/2510.11454

This repository contains the official implementation of Audio-Maestro, a framework that enables Large Audio-Language Models (LALMs) to autonomously call external tools for audio understanding and reasoning.
Our work extends tool-augmented reasoning from text-based systems to the audio domain, allowing models to dynamically analyze, transform, and interpret audio signals via structured, timestamp-aware tool invocation.

Overview

Audio-Maestro bridges audio-language understanding and tool-based reasoning through a modular two-phase framework:

Phase 1: The LALM processes input speech and decides whether to answer directly or call one or more external tools.
Phase 2: When a tool is invoked, it is executed externally and returns structured, timestamp-aware outputs. These results are fed back into the LALM to produce the final, tool-informed response.

This design enables interpretable, low-redundancy, and extensible audio reasoning without retraining large models.

Key Results

The Audio-Maestro framework consistently outperforms baselines (Text Only + Tool, Audio Without Tool) across all tested models, including DeSTA-2.5, Gemini-2.5-flash, and GPT-4o on the MMAU benchmark.

Environment Setup

1. Prerequisites

You must have a Conda installation.

2. Create the Environment

Clone this repository (or download the environment.yml file).
Open your terminal or Anaconda Prompt and navigate to the project directory.
Run the following command to create the environment:
```
conda env create -f environment.yml
```
This will create a new Conda environment named audio, downloading and installing all the specific package versions listed in the file. This may take several minutes.

3. Activate the Environment

Once the creation is complete, activate the new environment:

conda activate audio

Script Usage

Prerequisites

You should have you own google gemini api key. Please set the environment variable GEMINI_API_KEY before running the script. Run 1000 test samples from MMAU benchmark with Gemini-2.5-flash model may cost no more than $10 USD.

Script Execution

To use the scripts provided in this repository, you can directly run them within the activated Conda environment:

python script/tool_execute_gemini.py

🚧 TODO (Planned Open-Source Release)

The following components will be released:

Tool Interfaces: integration templates for Whisper, CosyVoice, TitaNet, emotion2vec, etc.
Script for Gemini
Script for DeSTA & GPT
Evaluation Scripts: ablation studies and related metrics script

Citation

If you find this repository useful for your research, please consider citing our paper:

@article{lee2025audio,
  title={Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning},
  author={Lee, Kuan-Yi and Lin, Tsung-En and Lee, Hung-Yi},
  journal={arXiv preprint arXiv:2510.11454},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
audio_maestro		audio_maestro
image		image
scripts		scripts
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Overview

Key Results

Environment Setup

1. Prerequisites

2. Create the Environment

3. Activate the Environment

Script Usage

Prerequisites

Script Execution

🚧 TODO (Planned Open-Source Release)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Overview

Key Results

Environment Setup

1. Prerequisites

2. Create the Environment

3. Activate the Environment

Script Usage

Prerequisites

Script Execution

🚧 TODO (Planned Open-Source Release)

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages