Skip to content

gary920209/Audio-Maestro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning

Kuan-Yi Lee, Tsung-En Lin, Hung-yi Lee
📝 Paper: https://arxiv.org/abs/2510.11454

This repository contains the official implementation of Audio-Maestro, a framework that enables Large Audio-Language Models (LALMs) to autonomously call external tools for audio understanding and reasoning.
Our work extends tool-augmented reasoning from text-based systems to the audio domain, allowing models to dynamically analyze, transform, and interpret audio signals via structured, timestamp-aware tool invocation.


Overview

Audio-Maestro bridges audio-language understanding and tool-based reasoning through a modular two-phase framework:

  • Phase 1: The LALM processes input speech and decides whether to answer directly or call one or more external tools.
  • Phase 2: When a tool is invoked, it is executed externally and returns structured, timestamp-aware outputs. These results are fed back into the LALM to produce the final, tool-informed response.

This design enables interpretable, low-redundancy, and extensible audio reasoning without retraining large models.

Audio-Maestro Framework


Key Results

The Audio-Maestro framework consistently outperforms baselines (Text Only + Tool, Audio Without Tool) across all tested models, including DeSTA-2.5, Gemini-2.5-flash, and GPT-4o on the MMAU benchmark.

Key Results

Environment Setup

1. Prerequisites

You must have a Conda installation.

2. Create the Environment

  1. Clone this repository (or download the environment.yml file).

  2. Open your terminal or Anaconda Prompt and navigate to the project directory.

  3. Run the following command to create the environment:

    conda env create -f environment.yml

    This will create a new Conda environment named audio, downloading and installing all the specific package versions listed in the file. This may take several minutes.

3. Activate the Environment

Once the creation is complete, activate the new environment:

conda activate audio

Script Usage

Prerequisites

You should have you own google gemini api key. Please set the environment variable GEMINI_API_KEY before running the script. Run 1000 test samples from MMAU benchmark with Gemini-2.5-flash model may cost no more than $10 USD.

Script Execution

To use the scripts provided in this repository, you can directly run them within the activated Conda environment:

python script/tool_execute_gemini.py

🚧 TODO (Planned Open-Source Release)

The following components will be released:

  • Tool Interfaces: integration templates for Whisper, CosyVoice, TitaNet, emotion2vec, etc.
  • Script for Gemini
  • Script for DeSTA & GPT
  • Evaluation Scripts: ablation studies and related metrics script

Citation

If you find this repository useful for your research, please consider citing our paper:

@article{lee2025audio,
  title={Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning},
  author={Lee, Kuan-Yi and Lin, Tsung-En and Lee, Hung-Yi},
  journal={arXiv preprint arXiv:2510.11454},
  year={2025}
}

About

Official implementation of Audio-Maestro

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages