SpaAudioLM

Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding

Overview

Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. SpaAudioLM bridges this gap by jointly reasoning over audio and geospatial Point-of-Interest (POI) metadata, enabling spatially grounded sound understanding across 28 environmental sound categories.

We fine-tune Qwen2.5-Omni-7B through a three-phase pipeline:

Difficulty Profiling - derive per-class difficulty weights from zero-shot recall
Constrained CoT SFT - supervised fine-tuning with answer-weighted loss
Difficulty-Aware GRPO - reinforcement learning with composite reward (weighted F1 + format + POI consistency)

Results

Comparison on multi-label audio event classification (mean +/- std over 5 runs, %):

Model	F1-Micro	F1-Macro	F1-Weighted	Jaccard	Exact Match
Qwen2-Audio-7B	4.73	2.86	5.27	1.96	0.00
Qwen2.5-Omni-7B	34.36	25.90	37.35	18.31	9.97
Qwen3-Omni-30B	29.66	20.26	28.80	14.81	14.02
GPT-4o Audio	30.09	26.47	34.07	17.18	9.43
Gemini 2.5 Pro	44.24	40.35	47.65	28.04	15.58
SpaAudioLM (Ours)	73.36	63.48	72.98	53.57	54.47

Quick Start

1. Clone and Set Up Environment

git clone https://github.com/yushiran/SpaAudioLM.git
cd SpaAudioLM

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv sync
source .venv/bin/activate

2. Download Dataset

# Clone from HuggingFace (includes audio files, ~2GB)
git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset data

This places the dataset at data/, which is the default DATASET_ROOT. The directory contains:

audio/ - 3,854 WAV files
sft_train.json - SFT training data with Chain-of-Thought annotations (2,697 samples)
grpo_train.json - GRPO training data (2,697 samples)
test.json - Test set (579 samples)
validation.json - Validation set (578 samples)
poi_features.json - POI metadata for all audio samples
class_labels.json - List of 28 sound event categories
splits/ - Train/test/validation CSV splits

3. Configure Environment

cp .env.example .env
# Edit .env with your paths:
#   MODEL_PATH          - HuggingFace model cache directory
#   R1_STRENGTH_SFT_MODEL_PATH  - SFT output directory
#   R1_STRENGTH_GRPO_MODEL_PATH - GRPO output directory

4. Download Pre-trained Model (Optional)

To run inference without training:

# Download SpaAudioLM weights from HuggingFace
huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM

5. Training

Phase 1: Supervised Fine-Tuning (SFT)

# Single-node multi-GPU training (requires 4x GPUs, 32GB+ VRAM each)
bash app/src/sft/GeoOmniR1Strength-sft.sh

# Or submit to SLURM cluster
sbatch app/src/sft/GeoOmniR1Strength-sft-sbatch.batch

Key SFT hyperparameters (configured in the script):

Base model: Qwen/Qwen2.5-Omni-7B
Epochs: 6, LR: 1e-5, Batch size: 4/GPU
DeepSpeed Zero-2, full parameter fine-tuning
Custom loss: answer tokens weighted 5x over thinking tokens

Phase 2: GRPO Reinforcement Learning

# Requires SFT checkpoint from Phase 1
# Ensure R1_STRENGTH_SFT_MODEL_PATH in .env points to your SFT checkpoint
bash app/src/grpo/GeoOmniR1-grpo-strength.sh

# Or submit to SLURM cluster
sbatch app/src/grpo/GeoOmniR1-grpo-strength-sbatch.batch

Key GRPO hyperparameters:

Epochs: 3, LR: 1e-6, Group size: 8
Rewards: weighted F1 (1.0) + format (0.1) + POI consistency (0.3)
KL coefficient: 0.05

6. Inference

# SFT model inference
bash app/src/sft/GeoOmniR1Strength-sft-infer.sh

# GRPO model inference
bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh

# 5-run statistical evaluation
bash app/src/grpo/5timesInfer.sh

7. Evaluation

# Evaluate a single inference output
uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results

# Aggregate 5-run results (mean +/- std)
uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir>

Project Structure

SpaAudioLM/
├── app/
│   ├── config.py                 # Central configuration (env vars, paths)
│   ├── prompts/                  # Jinja2 prompt templates
│   │   ├── prompt_manager.py
│   │   └── scripts/GeoOmniR1/   # Audio+POI, SFT, GRPO, ablation prompts
│   ├── utils/
│   │   ├── evaluate.py           # Multi-label metrics & visualization
│   │   └── poi_filter.py         # POI filtering + acoustic scene generation
│   └── src/
│       ├── sft/                  # SFT training & inference scripts
│       │   ├── plugin/loss_scale.py  # Token-level weighted loss
│       │   └── *.sh / *.batch
│       ├── grpo/                 # GRPO training & inference scripts
│       │   ├── plugin.py        # Reward functions (weighted F1, format, POI)
│       │   └── *.sh / *.batch
│       ├── dataset/              # Dataset generation pipeline
│       ├── GeoOmniR1Strength_evaluate.py
│       └── evaluateAverageScore.py
├── pyproject.toml
├── uv.lock
├── .env.example
└── LICENSE

Configuration

All paths are configured via environment variables (see .env.example):

Variable	Default	Description
`DATASET_ROOT`	`data`	Root directory of the dataset
`DATASET_PATH`	`data/audio`	Audio files directory
`MODEL_PATH`	-	HuggingFace model cache
`R1_STRENGTH_SFT_MODEL_PATH`	-	SFT training output
`R1_STRENGTH_GRPO_MODEL_PATH`	-	GRPO training output

Citation

@article{hou2025spaaudioLM,
  title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
  author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
docs		docs
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpaAudioLM

Overview

Results

Quick Start

1. Clone and Set Up Environment

2. Download Dataset

3. Configure Environment

4. Download Pre-trained Model (Optional)

5. Training

Phase 1: Supervised Fine-Tuning (SFT)

Phase 2: GRPO Reinforcement Learning

6. Inference

7. Evaluation

Project Structure

Configuration

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpaAudioLM

Overview

Results

Quick Start

1. Clone and Set Up Environment

2. Download Dataset

3. Configure Environment

4. Download Pre-trained Model (Optional)

5. Training

Phase 1: Supervised Fine-Tuning (SFT)

Phase 2: GRPO Reinforcement Learning

6. Inference

7. Evaluation

Project Structure

Configuration

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages