Skip to content

yushiran/SpaAudioLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpaAudioLM

Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding

Paper Page Model Dataset License

Overview

Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. SpaAudioLM bridges this gap by jointly reasoning over audio and geospatial Point-of-Interest (POI) metadata, enabling spatially grounded sound understanding across 28 environmental sound categories.

We fine-tune Qwen2.5-Omni-7B through a three-phase pipeline:

  1. Difficulty Profiling - derive per-class difficulty weights from zero-shot recall
  2. Constrained CoT SFT - supervised fine-tuning with answer-weighted loss
  3. Difficulty-Aware GRPO - reinforcement learning with composite reward (weighted F1 + format + POI consistency)

SpaAudioLM Training Pipeline

Results

Comparison on multi-label audio event classification (mean +/- std over 5 runs, %):

Model F1-Micro F1-Macro F1-Weighted Jaccard Exact Match
Qwen2-Audio-7B 4.73 2.86 5.27 1.96 0.00
Qwen2.5-Omni-7B 34.36 25.90 37.35 18.31 9.97
Qwen3-Omni-30B 29.66 20.26 28.80 14.81 14.02
GPT-4o Audio 30.09 26.47 34.07 17.18 9.43
Gemini 2.5 Pro 44.24 40.35 47.65 28.04 15.58
SpaAudioLM (Ours) 73.36 63.48 72.98 53.57 54.47

Quick Start

1. Clone and Set Up Environment

git clone https://github.com/yushiran/SpaAudioLM.git
cd SpaAudioLM

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv sync
source .venv/bin/activate

2. Download Dataset

# Clone from HuggingFace (includes audio files, ~2GB)
git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset data

This places the dataset at data/, which is the default DATASET_ROOT. The directory contains:

  • audio/ - 3,854 WAV files
  • sft_train.json - SFT training data with Chain-of-Thought annotations (2,697 samples)
  • grpo_train.json - GRPO training data (2,697 samples)
  • test.json - Test set (579 samples)
  • validation.json - Validation set (578 samples)
  • poi_features.json - POI metadata for all audio samples
  • class_labels.json - List of 28 sound event categories
  • splits/ - Train/test/validation CSV splits

3. Configure Environment

cp .env.example .env
# Edit .env with your paths:
#   MODEL_PATH          - HuggingFace model cache directory
#   R1_STRENGTH_SFT_MODEL_PATH  - SFT output directory
#   R1_STRENGTH_GRPO_MODEL_PATH - GRPO output directory

4. Download Pre-trained Model (Optional)

To run inference without training:

# Download SpaAudioLM weights from HuggingFace
huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM

5. Training

Phase 1: Supervised Fine-Tuning (SFT)

# Single-node multi-GPU training (requires 4x GPUs, 32GB+ VRAM each)
bash app/src/sft/GeoOmniR1Strength-sft.sh

# Or submit to SLURM cluster
sbatch app/src/sft/GeoOmniR1Strength-sft-sbatch.batch

Key SFT hyperparameters (configured in the script):

  • Base model: Qwen/Qwen2.5-Omni-7B
  • Epochs: 6, LR: 1e-5, Batch size: 4/GPU
  • DeepSpeed Zero-2, full parameter fine-tuning
  • Custom loss: answer tokens weighted 5x over thinking tokens

Phase 2: GRPO Reinforcement Learning

# Requires SFT checkpoint from Phase 1
# Ensure R1_STRENGTH_SFT_MODEL_PATH in .env points to your SFT checkpoint
bash app/src/grpo/GeoOmniR1-grpo-strength.sh

# Or submit to SLURM cluster
sbatch app/src/grpo/GeoOmniR1-grpo-strength-sbatch.batch

Key GRPO hyperparameters:

  • Epochs: 3, LR: 1e-6, Group size: 8
  • Rewards: weighted F1 (1.0) + format (0.1) + POI consistency (0.3)
  • KL coefficient: 0.05

6. Inference

# SFT model inference
bash app/src/sft/GeoOmniR1Strength-sft-infer.sh

# GRPO model inference
bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh

# 5-run statistical evaluation
bash app/src/grpo/5timesInfer.sh

7. Evaluation

# Evaluate a single inference output
uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results

# Aggregate 5-run results (mean +/- std)
uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir>

Project Structure

SpaAudioLM/
├── app/
│   ├── config.py                 # Central configuration (env vars, paths)
│   ├── prompts/                  # Jinja2 prompt templates
│   │   ├── prompt_manager.py
│   │   └── scripts/GeoOmniR1/   # Audio+POI, SFT, GRPO, ablation prompts
│   ├── utils/
│   │   ├── evaluate.py           # Multi-label metrics & visualization
│   │   └── poi_filter.py         # POI filtering + acoustic scene generation
│   └── src/
│       ├── sft/                  # SFT training & inference scripts
│       │   ├── plugin/loss_scale.py  # Token-level weighted loss
│       │   └── *.sh / *.batch
│       ├── grpo/                 # GRPO training & inference scripts
│       │   ├── plugin.py        # Reward functions (weighted F1, format, POI)
│       │   └── *.sh / *.batch
│       ├── dataset/              # Dataset generation pipeline
│       ├── GeoOmniR1Strength_evaluate.py
│       └── evaluateAverageScore.py
├── pyproject.toml
├── uv.lock
├── .env.example
└── LICENSE

Configuration

All paths are configured via environment variables (see .env.example):

Variable Default Description
DATASET_ROOT data Root directory of the dataset
DATASET_PATH data/audio Audio files directory
MODEL_PATH - HuggingFace model cache
R1_STRENGTH_SFT_MODEL_PATH - SFT training output
R1_STRENGTH_GRPO_MODEL_PATH - GRPO training output

Citation

@article{hou2025spaaudioLM,
  title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
  author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
  year={2025}
}

About

Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors