Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,11 @@ baseline*.json

.vscode/

*results.json
*results.json
benchmark_results_*.json

.vscode/

node_modules/

.venv/
70 changes: 55 additions & 15 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ FluidAudioSwift is a speaker diarization system for Apple platforms using Core M
## Current Performance Baseline (AMI Benchmark)
- **Dataset**: AMI SDM (Single Distant Microphone)
- **Current Results**: DER: 81.0%, JER: 24.4%, RTF: 0.02x
- **Research Benchmarks**:
- **Research Benchmarks**:
- Powerset BCE (2023): 18.5% DER
- EEND (2019): 25.3% DER
- x-vector clustering: 28.7% DER
Expand Down Expand Up @@ -69,7 +69,7 @@ The CLI needs to be extended to support:
- Run 3-5 iterations of same config to measure stability
- Calculate mean ± std deviation for DER, JER, RTF
- **RED FLAG**: If std deviation > 5%, investigate non-deterministic behavior

2. **Deep error analysis** (act like forensic ML engineer):
- **If DER > 60%**: Likely clustering failure - speakers being confused
- **If JER > DER**: Timeline alignment issues - check duration parameters
Expand All @@ -92,19 +92,19 @@ The CLI needs to be extended to support:
- **Test parameter extremes first**: (0.3, 0.9) for clusteringThreshold
- **CONSISTENCY CHECK**: If extreme values give identical results → INVESTIGATE
- **SANITY CHECK**: If threshold=0.9 gives same DER as threshold=0.3 → MODEL ISSUE

3. **Expert troubleshooting triggers**:
```
IF (same DER across 3+ different parameter values):
→ Check if parameters are actually being used
→ Verify model isn't using cached/default values
→ Add debug logging to confirm parameter propagation

IF (DER increases when it should decrease):
→ Analyze what type of errors increased
→ Check if we're optimizing the wrong bottleneck
→ Verify ground truth data integrity

IF (improvement then sudden degradation):
→ Look for parameter interaction effects
→ Check if we hit a threshold/boundary condition
Expand Down Expand Up @@ -137,7 +137,7 @@ The CLI needs to be extended to support:
- Are we creating too many micro-clusters?
- Is the similarity metric broken?
- Are we hitting edge cases in clustering algorithm?

IF (longer minDurationOn → worse performance):
THEN check:
- Are we filtering out too much real speech?
Expand All @@ -162,10 +162,10 @@ The CLI needs to be extended to support:
```
IF (DER variance > 10% across files):
→ Need more robust parameters, not just lowest DER

IF (no improvement after 5 tests):
→ Switch to different parameter or try combinations

IF (improvements < 2% but consistent):
→ Continue fine-tuning in smaller steps
```
Expand All @@ -192,7 +192,7 @@ The CLI needs to be extended to support:

```
START optimization iteration:
├── Results identical to previous?
├── Results identical to previous?
│ ├── YES → INVESTIGATE: Parameter not being used / Model caching
│ └── NO → Continue
├── Results worse than expected?
Expand All @@ -208,7 +208,7 @@ START optimization iteration:

**Immediately investigate if you see:**
- Same DER across 4+ different parameter values
- DER improvement then sudden 20%+ degradation
- DER improvement then sudden 20%+ degradation
- RTF varying by >50% with same parameters
- JER > DER consistently (suggests timeline issues)
- Parameters having opposite effect than expected
Expand Down Expand Up @@ -236,7 +236,7 @@ START optimization iteration:
DiarizerConfig(
clusteringThreshold: 0.7, // Optimal value: 17.7% DER
minDurationOn: 1.0, // Default working well
minDurationOff: 0.5, // Default working well
minDurationOff: 0.5, // Default working well
minActivityThreshold: 10.0, // Default working well
debugMode: false
)
Expand All @@ -254,7 +254,7 @@ DiarizerConfig(

### Clustering Threshold Impact (ES2004a):
- **0.1**: 75.8% DER - Over-clustering (153+ speakers), severe speaker confusion
- **0.5**: 20.6% DER - Still too many speakers
- **0.5**: 20.6% DER - Still too many speakers
- **0.7**: 17.7% DER - **OPTIMAL** - Good balance, ~9 speakers
- **0.8**: 18.0% DER - Nearly optimal, slightly fewer speakers
- **0.9**: 40.2% DER - Under-clustering, too few speakers
Expand All @@ -267,7 +267,7 @@ DiarizerConfig(

## Final Recommendations

### 🎉 MISSION ACCOMPLISHED!
### 🎉 MISSION ACCOMPLISHED!

**Target Achievement**: ✅ DER < 30% → **Achieved 17.7% DER**
**Research Competitive**: ✅ Better than EEND (25.3%) and x-vector (28.7%)
Expand Down Expand Up @@ -297,7 +297,7 @@ DiarizerConfig(

### Architecture Insights:
- **Online diarization works well** for benchmarking with proper clustering
- **Chunk-based processing** (10-second chunks) doesn't hurt performance significantly
- **Chunk-based processing** (10-second chunks) doesn't hurt performance significantly
- **Speaker tracking across chunks** is effective with current approach

## Instructions for Claude Code
Expand Down Expand Up @@ -363,4 +363,44 @@ The CLI now provides **beautiful tabular output** that's easy to read and parse:

- DER improvements < 1% for 3 consecutive parameter tests
- DER reaches target of < 30% (✅ **ACHIEVED: 17.7%**)
- All parameter combinations in current phase tested
- All parameter combinations in current phase tested

## Benchmarking

### Metal Acceleration Benchmarks

The project includes comprehensive benchmarks to measure Metal vs Accelerate performance:

```bash
# Run complete benchmark suite
swift test --filter MetalAccelerationBenchmarks

# Run specific benchmark categories
swift test --filter testCosineDistanceBatchSizeBenchmark
swift test --filter testEndToEndDiarizationBenchmark
swift test --filter testMemoryUsageBenchmark

# Use the convenience script
./scripts/run-benchmarks.sh
```

**Benchmark categories:**
- **Cosine distance calculations**: Batch size optimization (8-128 embeddings)
- **Powerset conversion operations**: GPU vs CPU compute kernels
- **End-to-end diarization**: Real-world performance comparison
- **Memory usage analysis**: Peak memory consumption comparison
- **Scalability testing**: Performance across different matrix sizes

**CI Integration:**
- Automated benchmarks run on all PRs
- Performance regression detection
- Automated PR comments with results
- Baseline comparison against main branch

## Troubleshooting

- Model downloads may fail in test environments - expected behavior
- First-time initialization requires network access for model downloads
- Models are cached in `~/Library/Application Support/SpeakerKitModels/coreml/`
- Enable debug mode in config for detailed logging
- Metal acceleration may be slower for small operations due to GPU overhead
67 changes: 47 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,11 @@ FluidAudioSwift is a high-performance Swift framework for on-device speaker diar
- **State-of-the-Art Diarization**: Research-competitive speaker separation with optimal speaker mapping
- **Speaker Embedding Extraction**: Generate speaker embeddings for voice comparison and clustering
- **CoreML Integration**: Native Apple CoreML backend for optimal performance on Apple Silicon and iOS support
- **Metal Performance Shaders**: GPU-accelerated computations with 3-8x speedup for batch operations
- **Real-time Processing**: Support for streaming audio processing with minimal latency
- **Cross-platform**: Full support for macOS 13.0+ and iOS 16.0+
- **Comprehensive CLI**: Professional benchmarking tools with beautiful tabular output
- **Comprehensive Benchmarking**: Built-in performance testing and optimization tools

## Installation

Expand Down Expand Up @@ -75,44 +77,69 @@ let config = DiarizerConfig(
minDurationOn: 1.0, // Minimum speech duration (seconds)
minDurationOff: 0.5, // Minimum silence between speakers (seconds)
numClusters: -1, // Number of speakers (-1 = auto-detect)
useMetalAcceleration: true, // Enable GPU acceleration (recommended)
metalBatchSize: 32, // Optimal batch size for GPU operations
debugMode: false
)
```

## CLI Usage
## Command Line Interface (CLI)

FluidAudioSwift includes a powerful command-line interface for benchmarking and audio processing:

### Benchmark with Beautiful Output
FluidAudioSwift includes a powerful CLI tool for benchmarking and processing audio files:

```bash
# Run AMI benchmark with automatic dataset download
swift run fluidaudio benchmark --auto-download
# Build the CLI
swift build

# Test with specific parameters
swift run fluidaudio benchmark --threshold 0.7 --min-duration-on 1.0 --output results.json
# Run AMI corpus benchmarks
swift run fluidaudio benchmark --dataset ami-sdm
swift run fluidaudio benchmark --dataset ami-ihm --threshold 0.8 --output results.json

# Test single file for quick parameter tuning
swift run fluidaudio benchmark --single-file ES2004a --threshold 0.8
# Process individual audio files
swift run fluidaudio process meeting.wav --output results.json
```

### Process Individual Files
### CLI Commands

```bash
# Process a single audio file
swift run fluidaudio process meeting.wav
- **`benchmark`**: Run standardized research benchmarks on AMI Meeting Corpus
- **`process`**: Process individual audio files with speaker diarization
- **`help`**: Show detailed usage information and examples

# Save results to JSON
swift run fluidaudio process meeting.wav --output results.json --threshold 0.6
```
### Supported Benchmark Datasets

- **AMI-SDM**: Single Distant Microphone (Mix-Headset.wav files) - realistic meeting conditions
- **AMI-IHM**: Individual Headset Microphones (Headset-0.wav files) - clean audio conditions

### Download Datasets
See [docs/CLI.md](docs/CLI.md) for complete CLI documentation and examples.

## Performance & Benchmarking

FluidAudioSwift includes comprehensive benchmarking tools to measure and optimize performance:

```bash
# Download AMI dataset for benchmarking
swift run fluidaudio download --dataset ami-sdm
# Run complete benchmark suite
swift test --filter MetalAccelerationBenchmarks

# Run benchmarks with detailed reporting
./scripts/run-benchmarks.sh

# Research-standard AMI corpus evaluation
swift run fluidaudio benchmark --dataset ami-sdm --output benchmark-results.json
```

### Metal Acceleration

The framework automatically leverages Metal Performance Shaders for GPU acceleration:

- **3-8x speedup** for batch embedding calculations
- **Automatic fallback** to Accelerate framework when Metal unavailable
- **Optimal batch sizes** determined through continuous benchmarking
- **Memory efficient** GPU operations with smart caching

See [docs/BENCHMARKING.md](docs/BENCHMARKING.md) for detailed performance analysis and optimization guidelines.

For technical implementation details, see [docs/METAL_ACCELERATION.md](docs/METAL_ACCELERATION.md).

## API Reference

- **`DiarizerManager`**: Main diarization class
Expand Down
Loading
Loading