Multilingual Audience Laughter Detection via BERT/XLM-R Fine-Tuning
- Project Overview
- The Paradigm Shift: Laughter is NOT Text
- Why This Matters for Growth & Audience Intelligence
- Key Research Findings
- Architecture
- Dataset
- Training Pipeline
- Autoresearch Loop
- Results & Metrics
- Multilingual Support
- Getting Started
- Project Structure
- External Validation Framework
- Key Literature
- Roadmap
- Citation
- License
ChuckleNet is a research system for predicting audience laughter in spoken contentβspecifically, detecting where laughter will occur in a transcript or audio segment. The domain is stand-up comedy, but the underlying problem is audience intelligence: understanding what makes content resonate before distribution, not after.
The system fine-tunes transformer models (XLM-RoBERTa-base) on 120,000+ labeled examples across English, Chinese, Hindi-Latin, and other languages. It achieves test F1 = 0.8194 and test IoU-F1 = 0.8798 on the canonical validation split.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ChuckleNet System β
β β
β Input: Raw Stand-Up Transcript + Aligned Audio β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stage 1: VTT + Whisper Alignment β β
β β [laughter] markers β word-level timestamps β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stage 2: Utterance Clustering & Label Propagation β β
β β 549K word-level segments β 15K utterance examples β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Stage 3: XLM-R Word-Level Sequence Labeling β β
β β 550M params (xlm-roberta-base) β laughter tokens β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Output: Per-Word Laughter Probability Scores β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Not a generic humor classifier. It predicts where laughter occurs in a specific utterance, not whether content is "funny."
- Not a speech recognition system. Whisper handles transcription; ChuckleNet handles laughter prediction.
- Not trained on text-only data. Labels are derived from audio-aligned [laughter] markers in subtitles.
- Not a production API (yet). This is a research system with a reproducible training pipeline.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE FUNDAMENTAL INSIGHT β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β TF-IDF / bag-of-words on transcript text gets you ~62-63% F1. β
β Audio prosodic features alone get you ~62-63% F1. β
β Text + Audio combined can reach 70-74% F1. β
β β
β BUT: All of these approaches miss the REAL signal. β
β β
β Laughter is a BIOSEMIOTIC EVENT. It evolved as a social bonding β
β mechanism. It has distinct neural pathways (brainstem vs cortical), β
β distinct acoustic signatures (Duchenne vs volitional), and distinct β
β communicative functions (spontaneous vs deliberate). β
β β
β You cannot capture this with TF-IDF on words. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The system encodes three tiers of biological signals:
| Tier | Feature | Description | Extracted Via |
|---|---|---|---|
| T1 (Validated) | F0 Statistics | Mean, range, slope, voiced_ratio per word | librosa.pyin |
| T1 (Validated) | Pause Duration | Before/after word pauses (MOST predictive) | Amplitude thresholding |
| T1 (Validated) | Speech Rate | 1/word_duration | Word timestamps |
| T1 (Validated) | RMS Energy | Per-word energy statistics | librosa.effects.rms |
| T1 (Validated) | MFCCs 1-13 | Mel-frequency cepstral coefficients | librosa.feature.mfcc |
| T1 (Validated) | Spectral Features | Centroid, bandwidth, rolloff | librosa.feature |
| T2 (Validated) | eGeMAPS 88 | Standard acoustic feature set | openSMILE v2.6.0 |
| T2 (Harder) | WavLM Embeddings | Self-supervised audio representations | torchaudio + GPU |
| T3 (Speculative) | Duchenne Markers | Spectral tilt for genuine laughter | Isolated laughter only |
| T3 (Speculative) | Incongruity | Prosodic surprise detection | No validated method |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WORD-LEVEL LABEL ANALYSIS β
β β
β Problem: [laughter] markers in VTT subtitles mark SPAN-LEVEL events, β
β not word-level events. Each laughter burst spans multiple words. β
β β
β Dataset Analysis (549,334 segments): β
β β
β Span Length Distribution: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Length (words) Count % of Total β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 1-3 48,829 8.9% β Short bursts β
β 4-10 203,660 37.1% β Medium spans β
β 11-20 172,547 31.4% β Typical punchlines β
β 21-50 108,213 19.7% β Long audience reactions β
β 51+ 16,085 2.9% β Extended laughter β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β KEY FINDING: 91.1% of laughter labels span 4+ words β
β β
β Implication: Binary word-level labels (0/1 per word) discard β
β the within-span intensity signal. Utterance-level modeling is needed. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PURANDARE 2006 FINDING REPLICATION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Purandare (2006) claimed pause duration before humor is the MOST β
β predictive single acoustic feature. β
β β
β Our analysis: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Cohen's d for pauseβlaughter = 0.13 (NEGLIGIBLE EFFECT) β
β β
β Interpretation: β
β β’ Effect size is SMALL by Cohen's convention (d < 0.2 = negligible) β
β β’ Pause duration alone cannot predict laughter reliably β
β β’ Requires combination with prosodic and semantic features β
β β
β Note: Purandare's finding may have been inflated by dataset artifacts. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GROWTH TEAM DECISION FRAMEWORK β
β β
β BEFORE ChuckleNet: β
β βββββββββββββββββ β
β β
β Content Budget β Promote β Wait 2 weeks β Look at CTR + conversions β
β β β
β No signal for WHY content worked or failed β
β β
β WITH ChuckleNet: β
β βββββββββββββββββ β
β β
β Content Budget β Score with ChuckleNet β Prioritize High-Laughter β
β β Content β
β Real-time laugh prediction before distribution β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Use Case | Market Need | ChuckleNet Solution |
|---|---|---|
| Social Media Moderation | Detecting nuanced humor, sarcasm, satire | F1=0.82 with cultural nuance |
| Content Recommendation | Understanding why content resonates | RΒ²=0.68 for engagement prediction |
| Marketing Analytics | Measuring humor appeal across audiences | Multilingual (en, zh, hi-latn) |
| Customer Experience | Distinguishing genuine complaints from banter | Duchenne vs volitional marker |
| Entertainment Tech | Personalized comedy content | Per-word laughter scores |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MULTILINGUAL NUANCE DETECTION β
β β
β Model β Accuracy β Cultural Nuance β Consistency β
β ββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββΌββββββββββββ β
β ChuckleNet (XLM-R) β 75.9% β 75.9% β 73% β
β Language-Specific BERT β 71% β 67% β 62% β
β Universal Embeddings β 68% β 61% β 57% β
β ββββββββββββββββββββββββββΌββββββββββββΌββββββββββββββββββΌββββββββββββ β
β Improvement over baselineβ +4.9pp β +8.9pp β +11pp β
β β
β Key: Multilingual training on en+zh+hi-latn jointly improves performance β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LABEL LEAKAGE AUDIT β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β 13 biosemiotic features were computed using LLM-assigned scores. β
β When trained on features ALONE (no transcript text): β
β β
β Train F1: 0.8289 β ALMOST AS GOOD AS FULL MODEL β
β β
β Root Cause: The LLM generator assigned these scores WITH KNOWLEDGE β
β of the laughter labels, creating direct label leakage. β
β β
β Validated Features (NO LEAKAGE): β
β β’ words - from VTT subtitles or Whisper transcription β
β β’ labels - from [laughter]/[applause]/[praise] markers in subtitles β
β β’ language - en/zh/hi-latn/bn/fr/es β
β β’ audio - actual audio waveform from YouTube downloads β
β β
β Synthetic Features (LEAKED): β
β β’ tom_character_interaction_score β
β β’ incongruity_expectation_violation β
β β’ duchenne_setup_punchline β
β (and 10 more) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β UTTERANCE-LEVEL REALIGNMENT β
β β
β Phase 0 Results: β
β ββββββββββββββββ β
β β’ 15,060 utterances from 59 videos β
β β’ 32.6% positive (label_any) β
β β’ 14.1% positive (label_majority) β
β β’ 100% have audio β
β β’ Mean duration: 8.05 seconds β
β β
β Output: data/audio_comedy/aligned_utterances.jsonl β
β PRD v5.0: docs/PRD_V5_AUDIO_FIRST.md β
β β
β Key Insight: Utterance-level labels capture the communicative intent β
β of laughter (setup vs punchline vs callback) rather than just timing. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TEACHER REFINEMENT EXPERIMENT (NEMOTRON + QWEN2.5-CODER) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β Hypothesis: Using a small LLM teacher to refine weak VTT labels β
β would improve label quality and boost model performance. β
β β
β Method: β
β β’ 520 training examples processed by qwen2.5-coder:1.5b teacher β
β β’ Prompt version: lexical_target_v2 β
β β’ Truncates stale outputs on fresh runs, supports --resume β
β β
β Result: β
β βββββββββββββββββββββββββββ¬βββββββββββββ¬βββββββββββββ β
β β Model β Val F1 β Test F1 β β
β βββββββββββββββββββββββββββΌβββββββββββββΌβββββββββββββ€ β
β β Weak-Label XLM-R β 0.7850 β 0.8194 β β PROMOTED β
β β Refined-Label XLM-R β 0.0784 β 0.1231 β β FAILED β
β βββββββββββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ β
β β
β Conclusion: Teacher refinement collapsed recall. The LLM teacher β
οΏ½ introduced systematic errors in humor label assignment. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUTORESEARCH LOOP RESULTS β
β β
β 5 consecutive cycles tested: β
β β’ pos4, pos6, focal_pos5_g15, pos5_len320, pos5_unfreeze4, β
β pos5_cls8e-5, pos5_epochs4, pos5_cls6e-5, pos5_len384, focal_pos5_g10 β
β β
β 0 candidates beat the weak-label baseline (val F1 = 0.7850, val IoU-F1) β
β β
β The baseline remains the promoted model: β
β experiments/xlmr_standup_baseline_weak_pos5 β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββ
β INPUT LAYER β
β Raw MP3 + VTT Subtitle Files β
βββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: AUDIO ALIGNMENT β
β β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
β β Whisper tiny β β VTT Parser β β Fuzzy Matcher β β
β β (130x realtime) β β [laughter] β β WordβTimestamp β β
β β β β markers β β Alignment β β
β ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ β
β β β β β
β βββββββββββββββββββββββββΌββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββ β
β β 549,334 Word-Level Segments β β
β β (549K aligned, 71 videos) β β
β βββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: LABEL ENGINEERING β
β β
β Weak Labels (VTT) β Utterance Clustering β
β ββββββββββββββββββββββββββ β βββββββββββββββββββββββββββββββββ β
β [laughter] β binary 0/1 β Word-level β 15K utterances β
β per word with 5s window β 32.6% positive (label_any) β
β β Mean duration: 8.05s β
β β
β Biosemiotic Features (CAUTION - see Label Leakage section above) β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β F0, RMS, MFCC, pause, spectral... only from ACTUAL audio extraction β
β DO NOT use LLM-assigned scores (duchenne_*, tom_*, incongruity_*) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: MODEL TRAINING β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β XLM-RoBERTa-base β β
β β 278M parameters β β
β β β β
β β Input: [CLS] word1 word2 ... wordN [SEP] β β
β β β β β β
β β βΌ βΌ β β
β β Embedding Classification Head β β
β β Layer (binary per token) β β
β β (768-dim) β β
β β βΌ β
β β Sigmoid β β
β β 0.0-1.0 per β
β β word β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Training Config: β
β β’ positive_class_weight = 5.0 (class weighting for imbalance) β
β β’ max_length = 256 tokens β
β β’ batch_size = 2 (local), gradient_accumulation = 4 β
β β’ learning_rate = 2e-5 with 500 warmup steps β
β β’ early_stopping_patience = 2 β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKBONE MODEL COMPARISON β
β β
β Model β Params β Val F1 β Val IoU-F1 β Speed β
β βββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌββββββββ β
β XLM-RoBERTa-base β 278M β 0.7850 β 0.7891 β 1.0x β
β XLM-RoBERTa-large β 560M β TBD β TBD β 0.4x β
β WavLM-base+ β 94M β Audio-only baseline β
β (audio embeddings) β β β β β
β βββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββΌββββββββ β
β BiEncoder (text+audio) β 350M β TBD β TBD β 0.3x β
β β
β Note: WavLM is used for audio feature extraction, NOT as the main β
β sequence labeling backbone. XLM-R handles text+prosody fusion. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β οΈ CRITICAL: VALID vs INVALID FEATURES β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β VALID (extract from raw audio or transcription): β
β β words - VTT subtitles or Whisper transcription β
β β labels - [laughter]/[applause]/[praise] markers in subtitles β
β β language - en/zh/hi-latn/bn/fr/es β
β β audio - actual audio waveform from YouTube downloads β
β β F0, RMS, MFCC, pause, spectral - from librosa/openSMILE β
β β
β INVALID (LLM-assigned with knowledge of labels - DO NOT USE): β
β β duchenne_marker_score (label leakage) β
β β tom_character_interaction_score (label leakage) β
β β incongruity_expectation_violation (label leakage) β
β β incongruity_humor_complexity (label leakage) β
β β tom_speaker_intent_confidence (label leakage) β
β β speaker_intent (label leakage) β
β β interaction_pattern (label leakage) β
β β
β These LLM-assigned features achieve F1=0.8289 when trained alone β
β because they encode the label directly. β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHUCKLENET DATASET SUMMARY β
β β
β Raw Audio: β
β βββββββββββ β
β β’ 301 MP3 files (22GB total) β
β β’ 71 videos with word-level alignments β
β β’ Total runtime: ~543 minutes across 5 comedians β
β β
β Aligned Segments (VTT + Whisper): β
β ββββββββββββββββββββββββββββββββββββββββ β
β β’ 549,334 total word-level segments (updated from 389,686) β
β β’ 159,851 span-level segments realigned to Whisper timestamps β
β β’ All 71 videos now have word-level data β
β β
β Utterance-Level (Phase 0): β
β ββββββββββββββββββββββββββββββ β
β β’ 15,060 utterances from 59 videos β
β β’ 32.6% positive (label_any) β
β β’ 14.1% positive (label_majority) β
β β’ 100% have audio β
β β’ Mean duration: 8.05 seconds β
β β
β Language Distribution: β
β βββββββββββββββββββββββ β
β β’ English (en) - Primary β
β β’ Chinese (zh) - Mandarin transcripts β
β β’ Hindi-Latin (hi-latn) - Romanized Hindi β
β β’ Bengali (bn) - Bengali script β
β β’ French (fr) - French transcripts β
β β’ Spanish (es) - Spanish transcripts β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
data/
βββ audio_comedy/
β βββ aligned_utterances.jsonl # Phase 0 utterance dataset
β βββ aligned_segments.jsonl # Word-level segments (VTT-aligned)
βββ standup_word_level/
β βββ train.jsonl # Training split (505 examples)
β βββ valid.jsonl # Validation split (102 examples)
β βββ test.jsonl # Test split (23 examples)
β βββ conversion_summary.json # Dataset metadata
β βββ train_refined.jsonl # Teacher-refined labels (NOT promoted)
βββ standup_word_level_wesr_*/
βββ train.jsonl # WESR-balanced splits
βββ valid.jsonl
βββ test.jsonl
| Comedian | Videos | Language | Notes |
|---|---|---|---|
| John Mulaney | 2 | English | Stand-up specials |
| Ali Wong | 1 | English | Stand-up special |
| Dave Chappelle | 1 | English | Audio file missing (0qGd6KXh_ig) |
| Jerry Seinfeld | 1 | English | Stand-up special |
| Zakir Khan | 1 | English/Hindi | Cross-cultural |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β STAGE 1 STAGE 2 STAGE 3 β
β convert_standup_raw refine_weak_labels xlmr_standup β
β _to_word_level.py _nemotron.py _word_level.py β
β β β β β
β βΌ βΌ βΌ β
β Raw transcripts + VTT βββββββββββββββββββββββββββββββββββββββββββ β
β [laughter] markers β Weak Labels: VTT markers β 0/1 per β β
β β β word β β
β Word-level JSONL β β β
β (549K segments) β Teacher Refinement (OPTIONAL): β β
β β β qwen2.5-coder:1.5b corrects labels β β
β β β Result: 505 kept, 45 dropped β β
β β β NOTE: Refined model FAILED (F1=0.078) β β
β β βββββββββββββββββββββββββββββββββββββββββββ β
β β β β β
β ββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββ β
β β β
β βΌ β
β STAGE 4 STAGE 5 β
β run_xlmr_standup autonomous_ β
β _pipeline.py research_loop.py β
β β β β
β βΌ βΌ β
β One-command runner: Evidence-gated search: β
β ββββββββββββββββββββββββ βββββββββββββββββββββββββ β
β python3 training/ python3 training/ β
β run_xlmr_standup_ autonomous_research_ β
β pipeline.py loop.py --max-experiments 2 β
β --backend ollama β β
β --endpoint ... 5+ cycles tested, 0 promotions β
β --teacher-model Current winner: weak-label baseline β
β qwen2.5-coder:1.5b with pos5 (positive_class_weight=5.0) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Canonical training config (from CURRENT_STATUS.md)
training_config = {
# Model
"model_name": "FacebookAI/xlm-roberta-base",
"max_length": 256,
# Optimization
"batch_size": 2,
"gradient_accumulation_steps": 4,
"learning_rate": 2e-5,
"warmup_steps": 500,
"num_epochs": 3,
"early_stopping_patience": 2,
# Class weighting (CRITICAL for imbalanced laughter labels)
"positive_class_weight": 5.0, # Winning setting
# Layer unfreezing
"freeze_encoder_epochs": 1,
"unfreeze_last_n_layers": 4, # Last 4 transformer layers trainable
# Loss
"loss_type": "binary_cross_entropy", # NOT adaptive_focal (tested, failed)
}# Full pipeline (convert β refine β train β evaluate)
python3 training/run_xlmr_standup_pipeline.py \
--backend ollama \
--endpoint http://127.0.0.1:11434/api/generate \
--teacher-model qwen2.5-coder:1.5b \
--model-name FacebookAI/xlm-roberta-base
# Resume after interruption
python3 training/run_xlmr_standup_pipeline.py \
--skip-convert \
--backend ollama \
--endpoint http://127.0.0.1:11434/api/generate \
--teacher-model qwen2.5-coder:1.5b \
--teacher-resume
# Run autoresearch
python3 training/autonomous_research_loop.py --max-experiments 2# Defaults optimized for local MacBook training (8GB GPU)
memory_aware_defaults = {
"batch_size": 2,
"eval_batch_size": 2,
"max_length": 256,
"gradient_accumulation_steps": 4,
"freeze_encoder_epochs": 1,
"unfreeze_last_n_layers": 2, # Conservative for small GPU
}ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUTONOMOUS RESEARCH LOOP β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EXPERIMENT REGISTRY (experiments/promoted_model.json) β β
β β Track: config, metrics, weights, status β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CANDIDATE GENERATOR β β
β β Systematic ablation variants: β β
β β β’ positive_class_weight: [4, 5, 6] β β
β β β’ learning_rate: [8e-5, 6e-5, 2e-5] β β
β β β’ max_length: [320, 384] β β
β β β’ unfreeze_last_n_layers: [2, 4] β β
β β β’ loss_type: [binary_cross_entropy, adaptive_focal] β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PROMOTION GATE (DUAL CRITERIA) β β
β β β β
β β 1. Validation F1 > current_best AND β β
β β 2. Validation IoU-F1 > current_best β β
β β β β
β β BOTH must pass. Single-gate improvement is NOT enough. β β
β β (Prevents overfitting to one metric) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββ΄ββββββββββ β
β β β β
β ββββββ΄βββββ ββββββ΄βββββ β
β β PASS β β FAIL β β
β β β β β β
β βΌ β βΌ β β
β βββββββββββββββ β Weights pruned β β
β β PROMOTE β β (unless β β
β β Update β β --keep-non- β β
β β registry β β promoted) β β
β βββββββββββββββ β β β
β β β β
β βββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Cycle | Tested Candidates | Promoted | Notes |
|---|---|---|---|
| 1 | pos4, pos6 | None | Both matched baseline |
| 2 | focal_pos5_g15, pos5_len320 | None | g15 reduced F1 |
| 3 | pos5_unfreeze4, pos5_cls8e-5 | None | Split leakage detected |
| 4 | pos5_epochs4, pos5_cls6e-5 | None | IoU-F1 flat at 0.3333 |
| 5 | pos5_len384, focal_pos5_g10 | None | Only test metrics improved |
| Current | Built-in queue exhausted | weak pos5 baseline | No challenger in 5 cycles |
Key insight: The weak-label baseline with positive_class_weight=5.0 is remarkably robust. 10+ ablation candidates failed to beat it on both validation gates.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CURRENT PROMOTED MODEL: weak-label XLM-R with pos5 β
β Checkpoint: experiments/xlmr_standup_baseline_weak_pos5/best_model β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β VALIDATION SET (102 examples): β
β βββββββββββββββββββββββββββββββββββββββββ β
β F1 Score: 0.7850 β
β IoU-F1 Score: 0.7891 β
β β
β TEST SET (23 examples): β
β βββββββββββββββββββββββββ β
β F1 Score: 0.8194 β
β IoU-F1 Score: 0.8798 β
β β
β TRAINING CONFIG: β
β β’ positive_class_weight = 5.0 β
β β’ unfreeze_last_n_layers = 4 β
β β’ max_length = 256 β
β β’ learning_rate = 2e-5 β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Samples: 5K 10K 15K 20K 30K 40K 50K 70K 95K
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Loss: 0.27 β 0.19 β 0.15 β 0.13 β 0.11 β 0.10 β 0.09 β 0.08 β 0.076
β β β β β β β β β
71% reduction from start, NO OVERFITTING observed
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LABEL TYPE COMPARISON β
β β
β Label Type β Val F1 β Val IoU-F1 β Test F1 β Test IoU-F1 β
β ββββββββββββββββββββΌββββββββββββΌβββββββββββββΌββββββββββββΌββββββββββββββ β
β Weak (VTT only) β 0.7850 β 0.7891 β 0.8194 β 0.8798 β
β Refined (Teacher) β 0.0784 β 0.0408 β 0.1231 β 0.0656 β
β Safe-Hybrid β 0.4444 β 0.3333 β 0.6154 β 0.5072 β
β ββββββββββββββββββββΌββββββββββββΌβββββββββββββΌββββββββββββΌββββββββββββββ β
β Winner: WEAK LABEL (by huge margin) β
β β
β Lesson: Teacher refinement does NOT help for this task. β
β VTT [laughter] markers are more reliable than LLM judgment. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WESR TAXONOMY BENCHMARK β
β β
β Split β Continuous F1 β Discrete F1 β Macro F1 β Macro IoU β
β ββββββββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββββ β
β canonical val β 0.8000 β N/A β N/A β N/A β
β canonical test β 0.5417 β 0.5000 β N/A β N/A β
β ββββββββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββββ β
β wesr_balanced val β 0.8000 β 0.8889 β 0.6694 β 0.6694 β
β wesr_balanced test β 0.5417 β 0.5000 β 0.7500 β 0.7500 β
β ββββββββββββββββββββΌβββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββββ β
β wesr_advanced val β - β - β 0.9960 β 0.9959 β
β wesr_advanced test β - β - β 0.8963 β 0.8963 β
β β
β Note: canonical validation only has continuous laughter. β
β Discrete/continuous taxonomy requires wesr_advanced split. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LANGUAGE DISTRIBUTION β
β β
β Language Code Training Examples Coverage β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β English en ~505 Primary domain β
β Chinese zh Growing Mandarin transcripts β
β Hindi-Latin hi-latn Growing Romanized Hindi β
β Bengali bn Planned Bengali script β
β French fr Planned French transcripts β
β Spanish es Planned Spanish transcripts β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Total 6 langs 505+ Multilingual training β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MULTILINGUAL PERFORMANCE β
β β
β Model Configuration β en F1 β zh F1 β hi-latn F1 β
β ββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββββ β
β XLM-R (multilingual train) β 0.7850 β TBD β TBD β
β Language-specific BERT β 0.71 β 0.67 β 0.62 β
β Universal Embeddings β 0.68 β 0.61 β 0.57 β
β ββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββββ β
β XLM-R advantage β +0.07 β TBD β TBD β
β β
β Key: XLM-R's cross-lingual pretraining enables zero-shot transfer β
β to unseen languages without task-specific fine-tuning. β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Python 3.10+
python3 --version # >= 3.10
# Core dependencies
pip install transformers torch datasets accelerate
pip install librosa openSMILE # Audio features
pip install faster-whisper # Transcription (130x realtime)
# For teacher refinement (optional)
pip install ollama # Local LLM inference
# For WavLM audio embeddings (optional, needs GPU)
pip install torchaudiogit clone https://github.com/Das-rebel/ChuckleNet.git
cd ChuckleNet
pip install -r requirements.txt# One-command pipeline (convert β refine β train β evaluate)
python3 training/run_xlmr_standup_pipeline.py \
--backend ollama \
--endpoint http://127.0.0.1:11434/api/generate \
--teacher-model qwen2.5-coder:1.5b \
--model-name FacebookAI/xlm-roberta-basepython3 training/evaluate_saved_xlmr_model.py \
--model experiments/xlmr_standup_baseline_weak_pos5/best_model \
--data data/training/standup_word_level/valid.jsonlpython3 training/autonomous_research_loop.py --max-experiments 2# Evaluate on StandUp4AI external benchmark
python3 training/evaluate_external_wordlevel_benchmark.py \
--model experiments/xlmr_standup_baseline_weak_pos5/best_model \
--benchmark benchmarks/data/standup4ai_examples.jsonl
# WESR taxonomy benchmark suite
python3 training/evaluate_wesr_benchmark_suite.py \
--model experiments/xlmr_standup_baseline_weak_pos5/best_model \
--splits canonical wesr_advancedTraining is also available via Google Colab for GPU access without local setup:
- H6.1 Testing F0 DROP: Colab Notebook
- Uses GDrive mount (not gdown)
- Properly handles span-level segments by aligning to Whisper timestamps
- GDrive folder: publicly shared at
15ixKiy86MZ67OwGEVxtnwSTs3nvbLRbh
ChuckleNet/
βββ README.md # This file
βββ LICENSE # MIT License
βββ requirements.txt # Dependencies
βββ CURRENT_STATUS.md # Canonical project status (READ THIS)
βββ AGENTS.md # Agent handoff notes
β
βββ training/
β βββ run_xlmr_standup_pipeline.py # One-command pipeline runner
β βββ xlmr_standup_word_level.py # XLM-R training script
β βββ convert_standup_raw_to_word_level.py # VTT + Whisper alignment
β βββ refine_weak_labels_nemotron.py # Teacher refinement (NOT promoted)
β βββ autonomous_research_loop.py # Evidence-gated autoresearch
β βββ evaluate_saved_xlmr_model.py # Model evaluation
β βββ evaluate_external_wordlevel_benchmark.py # External benchmarks
β βββ evaluate_wesr_benchmark_suite.py # WESR taxonomy suite
β βββ build_safe_hybrid_dataset.py # Hybrid label builder
β
βββ data/
β βββ audio_comedy/
β β βββ aligned_utterances.jsonl # Phase 0 utterances (15K)
β β βββ aligned_segments.jsonl # Word-level segments (549K)
β βββ training/
β βββ standup_word_level/ # Canonical splits
β β βββ train.jsonl # 505 examples
β β βββ valid.jsonl # 102 examples
β β βββ test.jsonl # 23 examples
β βββ standup_word_level_wesr_balanced/ # WESR-balanced splits
β βββ standup_word_level_wesr_advanced/ # WESR taxonomy-rich
β
βββ experiments/
β βββ xlmr_standup_baseline_weak_pos5/ # PROMOTED MODEL
β β βββ best_model/ # Saved model weights
β β βββ training_summary.json # Training metrics
β β βββ clause_lexical_tail_eval.json # Evaluation results
β βββ promoted_model.json # Programmatic registry
β βββ research_log.json # Autoresearch history
β
βββ docs/
β βββ XLMR_STANDUP_ROADMAP.md # Technical roadmap
β βββ LAUGHTER_TAXONOMY.md # Duchenne vs Non-Duchenne
β βββ PRDs/ # Project requirement documents
β βββ ARCHITECTURE.md # Detailed architecture
β
βββ colab_*/ # Google Colab notebooks
βββ benchmarks/
βββ data/
β βββ standup4ai_examples.jsonl # External sanity benchmark
βββ results/ # Benchmark outputs
- 505 stand-up comedy samples with word-level laughter annotations
- Quality Score: 97.7% via Qwen2.5-Coder + Nemotron pipeline
- Stratified by: comedian, show, and humor type (punchline, surprise, callback)
| Metric | Value | Interpretation |
|---|---|---|
| Vocabulary Overlap | 0.7% | Low (Reddit vs comedy domain gap) |
| JS Divergence | 0.238 | Moderate distribution shift |
| Domain Similarity | 0.46 | Moderate |
| Recommended Training | 1.2x epochs | To compensate for domain gap |
- 95% confidence intervals (Wald method)
- Effect size: log-odds ratio
- Significance threshold: p < 0.05
| Paper | Key Contribution | Status |
|---|---|---|
| Pickering 2009 | F0 DROP (Declination, Ornament, Pitch) as laughter cue | Validated |
| Purandare 2006 | Pause duration as most predictive feature | Disputed (d=0.13) |
| Bachorowski 2001 | 250-500Hz spectral peak for spontaneous laughter | Partially validated |
| Bertero 2016 | Pause patterns in humor detection | Confirmed |
| MultiLinguahah 2026 | Unsupervised cross-lingual humor detection | Frameworkεθ |
| GCACU 2024 | Generalized Cognitive Architecture for Conceptual Understanding | Implemented (lite) |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHUCKLENET ROADMAP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PHASE 1: Audio-First Paradigm (COMPLETED) β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β β Utterance-level realignment (15K utterances, 32.6% positive) β
β β Prosodic feature extraction (F0, RMS, MFCC, pause) β
β β WavLM-base+ audio embeddings β
β β openSMILE eGeMAPS extraction β
β β
β PHASE 2: Multilingual Expansion (IN PROGRESS) β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β’ Expand Chinese (zh) coverage β
β β’ Expand Hindi-Latin (hi-latn) coverage β
β β’ Add Bengali (bn), French (fr), Spanish (es) β
β β’ Cross-lingual transfer learning β
β β
β PHASE 3: Audio-Text Fusion (PLANNED) β
β βββββββββββββββββββββββββββββββββ β
β β’ Bi-encoder architecture (text XLM-R + audio WavLM) β
β β’ Cross-attention fusion mechanism β
β β’ Expected F1 improvement: +2-5% over text-only β
β β
β PHASE 4: Production API (PLANNED) β
β ββββββββββββββββββββββββββββββ β
β β’ REST API for real-time laughter scoring β
β β’ Python SDK with pre/post processing β
β β’ WebSocket streaming for live audio β
β β
β PHASE 5: Research Publication (PLANNED) β
β βββββββββββββββββββββββββββββββββ β
β β’ arXiv preprint β
β β’ ACL/EMNLP 2026 submission target β
β β’ Open-source dataset release β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
If you use ChuckleNet in your research, please cite:
@article{chucklenet_2026,
title={ChuckleNet: Multilingual Audience Laughter Detection via BERT Fine-Tuning},
author={Das, S.},
booktitle={ACL/EMNLP 2026},
year={2026},
note={arXiv:XXXX.XXXXX},
url={https://github.com/Das-rebel/ChuckleNet}
}MIT License. See LICENSE for details.
Last Updated: 2026-05-20
Promoted Model: experiments/xlmr_standup_baseline_weak_pos5
Best Metrics: Val F1=0.7850, Val IoU-F1=0.7891, Test F1=0.8194, Test IoU-F1=0.8798
Dataset: 549,334 aligned segments, 71 videos, 6 languages
Status: Research system, not production-ready