Kuan-Yi Lee, Tsung-En Lin, Hung-yi Lee
📝 Paper: https://arxiv.org/abs/2510.11454
This repository contains the official implementation of Audio-Maestro, a framework that enables Large Audio-Language Models (LALMs) to autonomously call external tools for audio understanding and reasoning.
Our work extends tool-augmented reasoning from text-based systems to the audio domain, allowing models to dynamically analyze, transform, and interpret audio signals via structured, timestamp-aware tool invocation.
Audio-Maestro bridges audio-language understanding and tool-based reasoning through a modular two-phase framework:
- Phase 1: The LALM processes input speech and decides whether to answer directly or call one or more external tools.
- Phase 2: When a tool is invoked, it is executed externally and returns structured, timestamp-aware outputs. These results are fed back into the LALM to produce the final, tool-informed response.
This design enables interpretable, low-redundancy, and extensible audio reasoning without retraining large models.
The Audio-Maestro framework consistently outperforms baselines (Text Only + Tool, Audio Without Tool) across all tested models, including DeSTA-2.5, Gemini-2.5-flash, and GPT-4o on the MMAU benchmark.
You must have a Conda installation.
-
Clone this repository (or download the
environment.ymlfile). -
Open your terminal or Anaconda Prompt and navigate to the project directory.
-
Run the following command to create the environment:
conda env create -f environment.yml
This will create a new Conda environment named
audio, downloading and installing all the specific package versions listed in the file. This may take several minutes.
Once the creation is complete, activate the new environment:
conda activate audioYou should have you own google gemini api key. Please set the environment variable GEMINI_API_KEY before running the script.
Run 1000 test samples from MMAU benchmark with Gemini-2.5-flash model may cost no more than $10 USD.
To use the scripts provided in this repository, you can directly run them within the activated Conda environment:
python script/tool_execute_gemini.pyThe following components will be released:
- Tool Interfaces: integration templates for Whisper, CosyVoice, TitaNet, emotion2vec, etc.
- Script for Gemini
- Script for DeSTA & GPT
- Evaluation Scripts: ablation studies and related metrics script
If you find this repository useful for your research, please consider citing our paper:
@article{lee2025audio,
title={Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning},
author={Lee, Kuan-Yi and Lin, Tsung-En and Lee, Hung-Yi},
journal={arXiv preprint arXiv:2510.11454},
year={2025}
}

