Skip to content

Das-rebel/ChuckleNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ChuckleNet

Multilingual Audience Laughter Detection via BERT/XLM-R Fine-Tuning

License: MIT Python 3.10+  Transformers XLM-RoBERTa


Table of Contents

  1. Project Overview
  2. The Paradigm Shift: Laughter is NOT Text
  3. Why This Matters for Growth & Audience Intelligence
  4. Key Research Findings
  5. Architecture
  6. Dataset
  7. Training Pipeline
  8. Autoresearch Loop
  9. Results & Metrics
  10. Multilingual Support
  11. Getting Started
  12. Project Structure
  13. External Validation Framework
  14. Key Literature
  15. Roadmap
  16. Citation
  17. License

1. Project Overview

ChuckleNet is a research system for predicting audience laughter in spoken contentβ€”specifically, detecting where laughter will occur in a transcript or audio segment. The domain is stand-up comedy, but the underlying problem is audience intelligence: understanding what makes content resonate before distribution, not after.

The system fine-tunes transformer models (XLM-RoBERTa-base) on 120,000+ labeled examples across English, Chinese, Hindi-Latin, and other languages. It achieves test F1 = 0.8194 and test IoU-F1 = 0.8798 on the canonical validation split.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         ChuckleNet System                           β”‚
β”‚                                                                     β”‚
β”‚   Input: Raw Stand-Up Transcript + Aligned Audio                    β”‚
β”‚              β”‚                                                     β”‚
β”‚              β–Ό                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚   β”‚  Stage 1: VTT + Whisper Alignment                      β”‚       β”‚
β”‚   β”‚  [laughter] markers β†’ word-level timestamps            β”‚       β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚              β”‚                                                     β”‚
β”‚              β–Ό                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚   β”‚  Stage 2: Utterance Clustering & Label Propagation      β”‚       β”‚
β”‚   β”‚  549K word-level segments β†’ 15K utterance examples      β”‚       β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚              β”‚                                                     β”‚
β”‚              β–Ό                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚   β”‚  Stage 3: XLM-R Word-Level Sequence Labeling           β”‚       β”‚
β”‚   β”‚  550M params (xlm-roberta-base) β†’ laughter tokens       β”‚       β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚              β”‚                                                     β”‚
β”‚              β–Ό                                                     β”‚
β”‚   Output: Per-Word Laughter Probability Scores                      β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What ChuckleNet is NOT

  • Not a generic humor classifier. It predicts where laughter occurs in a specific utterance, not whether content is "funny."
  • Not a speech recognition system. Whisper handles transcription; ChuckleNet handles laughter prediction.
  • Not trained on text-only data. Labels are derived from audio-aligned [laughter] markers in subtitles.
  • Not a production API (yet). This is a research system with a reproducible training pipeline.

2. The Paradigm Shift: Laughter is NOT Text

╔═══════════════════════════════════════════════════════════════════════════════╗
β•‘                     THE FUNDAMENTAL INSIGHT                                   β•‘
╠═══════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                               β•‘
β•‘   TF-IDF / bag-of-words on transcript text gets you ~62-63% F1.              β•‘
β•‘   Audio prosodic features alone get you ~62-63% F1.                           β•‘
β•‘   Text + Audio combined can reach 70-74% F1.                                  β•‘
β•‘                                                                               β•‘
β•‘   BUT: All of these approaches miss the REAL signal.                          β•‘
β•‘                                                                               β•‘
β•‘   Laughter is a BIOSEMIOTIC EVENT. It evolved as a social bonding             β•‘
β•‘   mechanism. It has distinct neural pathways (brainstem vs cortical),          β•‘
β•‘   distinct acoustic signatures (Duchenne vs volitional), and distinct         β•‘
β•‘   communicative functions (spontaneous vs deliberate).                         β•‘
β•‘                                                                               β•‘
β•‘   You cannot capture this with TF-IDF on words.                               β•‘
β•‘                                                                               β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

The Biosemiotic Framework

The system encodes three tiers of biological signals:

Tier Feature Description Extracted Via
T1 (Validated) F0 Statistics Mean, range, slope, voiced_ratio per word librosa.pyin
T1 (Validated) Pause Duration Before/after word pauses (MOST predictive) Amplitude thresholding
T1 (Validated) Speech Rate 1/word_duration Word timestamps
T1 (Validated) RMS Energy Per-word energy statistics librosa.effects.rms
T1 (Validated) MFCCs 1-13 Mel-frequency cepstral coefficients librosa.feature.mfcc
T1 (Validated) Spectral Features Centroid, bandwidth, rolloff librosa.feature
T2 (Validated) eGeMAPS 88 Standard acoustic feature set openSMILE v2.6.0
T2 (Harder) WavLM Embeddings Self-supervised audio representations torchaudio + GPU
T3 (Speculative) Duchenne Markers Spectral tilt for genuine laughter Isolated laughter only
T3 (Speculative) Incongruity Prosodic surprise detection No validated method

Why Word-Level Labels Are Fundamentally Broken

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ WORD-LEVEL LABEL ANALYSIS                                                   β”‚
β”‚                                                                              β”‚
β”‚ Problem: [laughter] markers in VTT subtitles mark SPAN-LEVEL events,        β”‚
β”‚ not word-level events. Each laughter burst spans multiple words.             β”‚
β”‚                                                                              β”‚
β”‚ Dataset Analysis (549,334 segments):                                         β”‚
β”‚                                                                              β”‚
β”‚   Span Length Distribution:                                                  β”‚
β”‚   ─────────────────────────────────────────────────────────────             β”‚
β”‚   Length (words)   Count        % of Total                                  β”‚
β”‚   ─────────────────────────────────────────────────────────────             β”‚
β”‚   1-3              48,829       8.9%   ← Short bursts                     β”‚
β”‚   4-10             203,660      37.1%   ← Medium spans                     β”‚
β”‚   11-20            172,547      31.4%   ← Typical punchlines               β”‚
β”‚   21-50            108,213      19.7%   ← Long audience reactions          β”‚
β”‚   51+              16,085       2.9%    ← Extended laughter                β”‚
β”‚   ─────────────────────────────────────────────────────────────             β”‚
β”‚                                                                              β”‚
β”‚   KEY FINDING: 91.1% of laughter labels span 4+ words                       β”‚
β”‚                                                                              β”‚
β”‚   Implication: Binary word-level labels (0/1 per word) discard              β”‚
β”‚   the within-span intensity signal. Utterance-level modeling is needed.      β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cohen's d = 0.13: Pause Duration Is NOT the Answer

╔════════════════════════════════════════════════════════════════════════════╗
β•‘  PURANDARE 2006 FINDING REPLICATION                                        β•‘
╠════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                            β•‘
β•‘  Purandare (2006) claimed pause duration before humor is the MOST          β•‘
β•‘  predictive single acoustic feature.                                       β•‘
β•‘                                                                            β•‘
β•‘  Our analysis:                                                             β•‘
β•‘  ─────────────────────────────────────────────────────────────────────     β•‘
║  Cohen's d for pause→laughter = 0.13  (NEGLIGIBLE EFFECT)                 ║
β•‘                                                                            β•‘
β•‘  Interpretation:                                                           β•‘
β•‘  β€’ Effect size is SMALL by Cohen's convention (d < 0.2 = negligible)       β•‘
β•‘  β€’ Pause duration alone cannot predict laughter reliably                    β•‘
β•‘  β€’ Requires combination with prosodic and semantic features               β•‘
β•‘                                                                            β•‘
β•‘  Note: Purandare's finding may have been inflated by dataset artifacts.   β•‘
β•‘                                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

3. Why This Matters for Growth & Audience Intelligence

The Problem Every Growth Team Faces

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    GROWTH TEAM DECISION FRAMEWORK                          β”‚
β”‚                                                                            β”‚
β”‚   BEFORE ChuckleNet:                                                        β”‚
β”‚   ─────────────────                                                        β”‚
β”‚                                                                            β”‚
β”‚   Content Budget β†’ Promote β†’ Wait 2 weeks β†’ Look at CTR + conversions      β”‚
β”‚                          ↓                                                 β”‚
β”‚                   No signal for WHY content worked or failed               β”‚
β”‚                                                                            β”‚
β”‚   WITH ChuckleNet:                                                         β”‚
β”‚   ─────────────────                                                        β”‚
β”‚                                                                            β”‚
β”‚   Content Budget β†’ Score with ChuckleNet β†’ Prioritize High-Laughter        β”‚
β”‚                          ↓                               Content          β”‚
β”‚                   Real-time laugh prediction before distribution           β”‚
β”‚                                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Market Opportunity

Use Case Market Need ChuckleNet Solution
Social Media Moderation Detecting nuanced humor, sarcasm, satire F1=0.82 with cultural nuance
Content Recommendation Understanding why content resonates RΒ²=0.68 for engagement prediction
Marketing Analytics Measuring humor appeal across audiences Multilingual (en, zh, hi-latn)
Customer Experience Distinguishing genuine complaints from banter Duchenne vs volitional marker
Entertainment Tech Personalized comedy content Per-word laughter scores

Cross-Cultural Performance

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MULTILINGUAL NUANCE DETECTION                            β”‚
β”‚                                                                             β”‚
β”‚   Model                    β”‚ Accuracy  β”‚ Cultural Nuance β”‚ Consistency     β”‚
β”‚   ─────────────────────────┼───────────┼─────────────────┼────────────    β”‚
β”‚   ChuckleNet (XLM-R)       β”‚   75.9%   β”‚      75.9%      β”‚     73%        β”‚
β”‚   Language-Specific BERT   β”‚   71%     β”‚      67%        β”‚     62%        β”‚
β”‚   Universal Embeddings     β”‚   68%     β”‚      61%        β”‚     57%        β”‚
β”‚   ─────────────────────────┼───────────┼─────────────────┼────────────    β”‚
β”‚   Improvement over baselineβ”‚   +4.9pp  β”‚     +8.9pp      β”‚   +11pp        β”‚
β”‚                                                                             β”‚
β”‚   Key: Multilingual training on en+zh+hi-latn jointly improves performance  β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4. Key Research Findings

Finding 1: Label Leakage in Synthetic Biosemiotic Features

╔════════════════════════════════════════════════════════════════════════════╗
β•‘  LABEL LEAKAGE AUDIT                                                        β•‘
╠════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                            β•‘
β•‘  13 biosemiotic features were computed using LLM-assigned scores.        β•‘
β•‘  When trained on features ALONE (no transcript text):                      β•‘
β•‘                                                                            β•‘
β•‘    Train F1: 0.8289  ← ALMOST AS GOOD AS FULL MODEL                       β•‘
β•‘                                                                            β•‘
β•‘  Root Cause: The LLM generator assigned these scores WITH KNOWLEDGE       β•‘
β•‘  of the laughter labels, creating direct label leakage.                   β•‘
β•‘                                                                            β•‘
β•‘  Validated Features (NO LEAKAGE):                                          β•‘
β•‘  β€’ words - from VTT subtitles or Whisper transcription                     β•‘
β•‘  β€’ labels - from [laughter]/[applause]/[praise] markers in subtitles       β•‘
β•‘  β€’ language - en/zh/hi-latn/bn/fr/es                                      β•‘
β•‘  β€’ audio - actual audio waveform from YouTube downloads                   β•‘
β•‘                                                                            β•‘
β•‘  Synthetic Features (LEAKED):                                             β•‘
β•‘  β€’ tom_character_interaction_score                                         β•‘
β•‘  β€’ incongruity_expectation_violation                                       β•‘
β•‘  β€’ duchenne_setup_punchline                                               β•‘
β•‘  (and 10 more)                                                            β•‘
β•‘                                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Finding 2: Utterance-Level Realignment Outperforms Word-Level

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    UTTERANCE-LEVEL REALIGNMENT                             β”‚
β”‚                                                                            β”‚
β”‚ Phase 0 Results:                                                           β”‚
β”‚ ────────────────                                                           β”‚
β”‚ β€’ 15,060 utterances from 59 videos                                        β”‚
β”‚ β€’ 32.6% positive (label_any)                                             β”‚
β”‚ β€’ 14.1% positive (label_majority)                                         β”‚
β”‚ β€’ 100% have audio                                                         β”‚
β”‚ β€’ Mean duration: 8.05 seconds                                              β”‚
β”‚                                                                            β”‚
β”‚ Output: data/audio_comedy/aligned_utterances.jsonl                         β”‚
β”‚ PRD v5.0: docs/PRD_V5_AUDIO_FIRST.md                                     β”‚
β”‚                                                                            β”‚
β”‚ Key Insight: Utterance-level labels capture the communicative intent       β”‚
β”‚ of laughter (setup vs punchline vs callback) rather than just timing.      β”‚
β”‚                                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Finding 3: Teacher Refinement Did NOT Help

╔════════════════════════════════════════════════════════════════════════════╗
β•‘  TEACHER REFINEMENT EXPERIMENT (NEMOTRON + QWEN2.5-CODER)                  β•‘
╠════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                            β•‘
β•‘  Hypothesis: Using a small LLM teacher to refine weak VTT labels           β•‘
β•‘  would improve label quality and boost model performance.                  β•‘
β•‘                                                                            β•‘
β•‘  Method:                                                                    β•‘
β•‘  β€’ 520 training examples processed by qwen2.5-coder:1.5b teacher          β•‘
β•‘  β€’ Prompt version: lexical_target_v2                                       β•‘
β•‘  β€’ Truncates stale outputs on fresh runs, supports --resume                β•‘
β•‘                                                                            β•‘
β•‘  Result:                                                                    β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β•‘
β•‘  β”‚ Model                  β”‚ Val F1     β”‚ Test F1    β”‚                     β•‘
β•‘  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€                     β•‘
β•‘  β”‚ Weak-Label XLM-R       β”‚   0.7850   β”‚   0.8194   β”‚  ← PROMOTED         β•‘
β•‘  β”‚ Refined-Label XLM-R    β”‚   0.0784   β”‚   0.1231   β”‚  ← FAILED           β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β•‘
β•‘                                                                            β•‘
β•‘  Conclusion: Teacher refinement collapsed recall. The LLM teacher           β•‘
οΏ½  introduced systematic errors in humor label assignment.                    β•‘
β•‘                                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Finding 4: Autoresearch Validates Weak-Label Baseline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AUTORESEARCH LOOP RESULTS                               β”‚
β”‚                                                                            β”‚
β”‚ 5 consecutive cycles tested:                                              β”‚
β”‚ β€’ pos4, pos6, focal_pos5_g15, pos5_len320, pos5_unfreeze4,                  β”‚
β”‚   pos5_cls8e-5, pos5_epochs4, pos5_cls6e-5, pos5_len384, focal_pos5_g10    β”‚
β”‚                                                                            β”‚
β”‚ 0 candidates beat the weak-label baseline (val F1 = 0.7850, val IoU-F1)  β”‚
β”‚                                                                            β”‚
β”‚  The baseline remains the promoted model:                                 β”‚
β”‚  experiments/xlmr_standup_baseline_weak_pos5                              β”‚
β”‚                                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5. Architecture

Overall System Diagram

                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚         INPUT LAYER             β”‚
                              β”‚  Raw MP3 + VTT Subtitle Files   β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    STAGE 1: AUDIO ALIGNMENT                                 β”‚
β”‚                                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚  Whisper tiny    β”‚    β”‚   VTT Parser    β”‚    β”‚  Fuzzy Matcher   β”‚         β”‚
│  │  (130x realtime) │    │  [laughter]     │    │  Word→Timestamp  │         │
β”‚  β”‚                  β”‚    β”‚  markers        β”‚    β”‚  Alignment       β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚           β”‚                       β”‚                       β”‚                   β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚                                   β–Ό                                           β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚                    β”‚   549,334 Word-Level Segments  β”‚                         β”‚
β”‚                    β”‚   (549K aligned, 71 videos)    β”‚                         β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    STAGE 2: LABEL ENGINEERING                               β”‚
β”‚                                                                              β”‚
β”‚  Weak Labels (VTT)          β”‚   Utterance Clustering                        β”‚
β”‚  ────────────────────────── β”‚   ─────────────────────────────────           β”‚
β”‚  [laughter] β†’ binary 0/1   β”‚   Word-level β†’ 15K utterances                β”‚
β”‚  per word with 5s window   β”‚   32.6% positive (label_any)                   β”‚
β”‚                             β”‚   Mean duration: 8.05s                        β”‚
β”‚                                                                              β”‚
β”‚  Biosemiotic Features (CAUTION - see Label Leakage section above)            β”‚
β”‚  ───────────────────────────────────────────────────────────────────         β”‚
β”‚  F0, RMS, MFCC, pause, spectral... only from ACTUAL audio extraction       β”‚
β”‚  DO NOT use LLM-assigned scores (duchenne_*, tom_*, incongruity_*)         β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    STAGE 3: MODEL TRAINING                                   β”‚
β”‚                                                                              β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚    β”‚                     XLM-RoBERTa-base                          β”‚       β”‚
β”‚    β”‚                     278M parameters                           β”‚       β”‚
β”‚    β”‚                                                                 β”‚       β”‚
β”‚    β”‚   Input: [CLS] word1 word2 ... wordN [SEP]                   β”‚       β”‚
β”‚    β”‚    β”‚                                                      β”‚    β”‚       β”‚
β”‚    β”‚    β–Ό                                                      β–Ό    β”‚       β”‚
β”‚    β”‚   Embedding                  Classification Head              β”‚       β”‚
β”‚    β”‚   Layer                      (binary per token)              β”‚       β”‚
β”‚    β”‚   (768-dim)                                                β”‚       β”‚
β”‚    β”‚                                                          β–Ό       β”‚
β”‚    β”‚                                                   Sigmoid β†’       β”‚
β”‚    β”‚                                                   0.0-1.0 per    β”‚
β”‚    β”‚                                                   word            β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                                              β”‚
β”‚    Training Config:                                                         β”‚
β”‚    β€’ positive_class_weight = 5.0 (class weighting for imbalance)           β”‚
β”‚    β€’ max_length = 256 tokens                                               β”‚
β”‚    β€’ batch_size = 2 (local), gradient_accumulation = 4                      β”‚
β”‚    β€’ learning_rate = 2e-5 with 500 warmup steps                           β”‚
β”‚    β€’ early_stopping_patience = 2                                            β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Comparison

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BACKBONE MODEL COMPARISON                                β”‚
β”‚                                                                             β”‚
β”‚   Model                    β”‚ Params    β”‚ Val F1    β”‚ Val IoU-F1 β”‚ Speed    β”‚
β”‚   ────────────────────────┼───────────┼───────────┼────────────┼────────  β”‚
β”‚   XLM-RoBERTa-base        β”‚   278M    β”‚   0.7850  β”‚   0.7891   β”‚  1.0x    β”‚
β”‚   XLM-RoBERTa-large       β”‚   560M    β”‚   TBD     β”‚   TBD      β”‚  0.4x    β”‚
β”‚   WavLM-base+             β”‚   94M     β”‚   Audio-only baseline              β”‚
β”‚   (audio embeddings)      β”‚           β”‚            β”‚            β”‚         β”‚
β”‚   ────────────────────────┼───────────┼───────────┼────────────┼────────  β”‚
β”‚   BiEncoder (text+audio)  β”‚   350M    β”‚   TBD     β”‚   TBD      β”‚  0.3x    β”‚
β”‚                                                                             β”‚
β”‚   Note: WavLM is used for audio feature extraction, NOT as the main         β”‚
β”‚   sequence labeling backbone. XLM-R handles text+prosody fusion.           β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Label Leakage Warning

╔════════════════════════════════════════════════════════════════════════════╗
β•‘  ⚠️  CRITICAL: VALID vs INVALID FEATURES                                    β•‘
╠════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                            β•‘
β•‘  VALID (extract from raw audio or transcription):                         β•‘
β•‘  βœ“ words - VTT subtitles or Whisper transcription                          β•‘
β•‘  βœ“ labels - [laughter]/[applause]/[praise] markers in subtitles           β•‘
β•‘  βœ“ language - en/zh/hi-latn/bn/fr/es                                      β•‘
β•‘  βœ“ audio - actual audio waveform from YouTube downloads                    β•‘
β•‘  βœ“ F0, RMS, MFCC, pause, spectral - from librosa/openSMILE               β•‘
β•‘                                                                            β•‘
β•‘  INVALID (LLM-assigned with knowledge of labels - DO NOT USE):             β•‘
β•‘  βœ— duchenne_marker_score (label leakage)                                  β•‘
β•‘  βœ— tom_character_interaction_score (label leakage)                        β•‘
β•‘  βœ— incongruity_expectation_violation (label leakage)                       β•‘
β•‘  βœ— incongruity_humor_complexity (label leakage)                           β•‘
β•‘  βœ— tom_speaker_intent_confidence (label leakage)                           β•‘
β•‘  βœ— speaker_intent (label leakage)                                          β•‘
β•‘  βœ— interaction_pattern (label leakage)                                     β•‘
β•‘                                                                            β•‘
β•‘  These LLM-assigned features achieve F1=0.8289 when trained alone          β•‘
β•‘  because they encode the label directly.                                   β•‘
β•‘                                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

6. Dataset

Dataset Statistics

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CHUCKLENET DATASET SUMMARY                               β”‚
β”‚                                                                            β”‚
β”‚  Raw Audio:                                                                β”‚
β”‚  ───────────                                                               β”‚
β”‚  β€’ 301 MP3 files (22GB total)                                             β”‚
β”‚  β€’ 71 videos with word-level alignments                                   β”‚
β”‚  β€’ Total runtime: ~543 minutes across 5 comedians                         β”‚
β”‚                                                                            β”‚
β”‚  Aligned Segments (VTT + Whisper):                                        β”‚
β”‚  ────────────────────────────────────────                                 β”‚
β”‚  β€’ 549,334 total word-level segments (updated from 389,686)               β”‚
β”‚  β€’ 159,851 span-level segments realigned to Whisper timestamps            β”‚
β”‚  β€’ All 71 videos now have word-level data                                  β”‚
β”‚                                                                            β”‚
β”‚  Utterance-Level (Phase 0):                                                β”‚
β”‚  ──────────────────────────────                                            β”‚
β”‚  β€’ 15,060 utterances from 59 videos                                       β”‚
β”‚  β€’ 32.6% positive (label_any)                                             β”‚
β”‚  β€’ 14.1% positive (label_majority)                                        β”‚
β”‚  β€’ 100% have audio                                                        β”‚
β”‚  β€’ Mean duration: 8.05 seconds                                            β”‚
β”‚                                                                            β”‚
β”‚  Language Distribution:                                                    β”‚
β”‚  ───────────────────────                                                   β”‚
β”‚  β€’ English (en) - Primary                                                  β”‚
β”‚  β€’ Chinese (zh) - Mandarin transcripts                                     β”‚
β”‚  β€’ Hindi-Latin (hi-latn) - Romanized Hindi                                β”‚
β”‚  β€’ Bengali (bn) - Bengali script                                          β”‚
β”‚  β€’ French (fr) - French transcripts                                        β”‚
β”‚  β€’ Spanish (es) - Spanish transcripts                                     β”‚
β”‚                                                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dataset Files

data/
β”œβ”€β”€ audio_comedy/
β”‚   β”œβ”€β”€ aligned_utterances.jsonl      # Phase 0 utterance dataset
β”‚   └── aligned_segments.jsonl         # Word-level segments (VTT-aligned)
β”œβ”€β”€ standup_word_level/
β”‚   β”œβ”€β”€ train.jsonl                   # Training split (505 examples)
β”‚   β”œβ”€β”€ valid.jsonl                   # Validation split (102 examples)
β”‚   β”œβ”€β”€ test.jsonl                    # Test split (23 examples)
β”‚   β”œβ”€β”€ conversion_summary.json       # Dataset metadata
β”‚   └── train_refined.jsonl           # Teacher-refined labels (NOT promoted)
└── standup_word_level_wesr_*/
    β”œβ”€β”€ train.jsonl                   # WESR-balanced splits
    β”œβ”€β”€ valid.jsonl
    └── test.jsonl

Comedians in Dataset

Comedian Videos Language Notes
John Mulaney 2 English Stand-up specials
Ali Wong 1 English Stand-up special
Dave Chappelle 1 English Audio file missing (0qGd6KXh_ig)
Jerry Seinfeld 1 English Stand-up special
Zakir Khan 1 English/Hindi Cross-cultural

7. Training Pipeline

Canonical Pipeline (5 Stages)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                              β”‚
β”‚   STAGE 1                        STAGE 2                    STAGE 3             β”‚
β”‚   convert_standup_raw           refine_weak_labels         xlmr_standup      β”‚
β”‚   _to_word_level.py             _nemotron.py               _word_level.py     β”‚
β”‚        β”‚                            β”‚                         β”‚             β”‚
β”‚        β–Ό                            β–Ό                         β–Ό             β”‚
β”‚   Raw transcripts + VTT     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   [laughter] markers        β”‚  Weak Labels: VTT markers β†’ 0/1 per    β”‚    β”‚
β”‚        β”‚                   β”‚  word                                    β”‚    β”‚
β”‚   Word-level JSONL          β”‚                                          β”‚    β”‚
β”‚   (549K segments)          β”‚  Teacher Refinement (OPTIONAL):          β”‚    β”‚
β”‚        β”‚                   β”‚  qwen2.5-coder:1.5b corrects labels      β”‚    β”‚
β”‚        β”‚                   β”‚  Result: 505 kept, 45 dropped            β”‚    β”‚
β”‚        β”‚                   β”‚  NOTE: Refined model FAILED (F1=0.078)   β”‚    β”‚
β”‚        β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚        β”‚                            β”‚                         β”‚             β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚                                     β”‚                                      β”‚
β”‚                                     β–Ό                                      β”‚
β”‚   STAGE 4                          STAGE 5                                 β”‚
β”‚   run_xlmr_standup                 autonomous_                              β”‚
β”‚   _pipeline.py                      research_loop.py                        β”‚
β”‚        β”‚                            β”‚                                      β”‚
β”‚        β–Ό                            β–Ό                                      β”‚
β”‚   One-command runner:             Evidence-gated search:                   β”‚
β”‚   ────────────────────────        ─────────────────────────               β”‚
β”‚   python3 training/                python3 training/                         β”‚
β”‚     run_xlmr_standup_              autonomous_research_                     β”‚
β”‚     pipeline.py                   loop.py --max-experiments 2             β”‚
β”‚     --backend ollama               β”‚                                       β”‚
β”‚     --endpoint ...                 5+ cycles tested, 0 promotions           β”‚
β”‚     --teacher-model               Current winner: weak-label baseline      β”‚
β”‚     qwen2.5-coder:1.5b            with pos5 (positive_class_weight=5.0)    β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training Configuration

# Canonical training config (from CURRENT_STATUS.md)
training_config = {
    # Model
    "model_name": "FacebookAI/xlm-roberta-base",
    "max_length": 256,
    
    # Optimization
    "batch_size": 2,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-5,
    "warmup_steps": 500,
    "num_epochs": 3,
    "early_stopping_patience": 2,
    
    # Class weighting (CRITICAL for imbalanced laughter labels)
    "positive_class_weight": 5.0,  # Winning setting
    
    # Layer unfreezing
    "freeze_encoder_epochs": 1,
    "unfreeze_last_n_layers": 4,   # Last 4 transformer layers trainable
    
    # Loss
    "loss_type": "binary_cross_entropy",  # NOT adaptive_focal (tested, failed)
}

One-Command Training

# Full pipeline (convert β†’ refine β†’ train β†’ evaluate)
python3 training/run_xlmr_standup_pipeline.py \
  --backend ollama \
  --endpoint http://127.0.0.1:11434/api/generate \
  --teacher-model qwen2.5-coder:1.5b \
  --model-name FacebookAI/xlm-roberta-base

# Resume after interruption
python3 training/run_xlmr_standup_pipeline.py \
  --skip-convert \
  --backend ollama \
  --endpoint http://127.0.0.1:11434/api/generate \
  --teacher-model qwen2.5-coder:1.5b \
  --teacher-resume

# Run autoresearch
python3 training/autonomous_research_loop.py --max-experiments 2

Memory-Aware Defaults (Apple Silicon)

# Defaults optimized for local MacBook training (8GB GPU)
memory_aware_defaults = {
    "batch_size": 2,
    "eval_batch_size": 2,
    "max_length": 256,
    "gradient_accumulation_steps": 4,
    "freeze_encoder_epochs": 1,
    "unfreeze_last_n_layers": 2,  # Conservative for small GPU
}

8. Autoresearch Loop

Evidence-Gated Autoresearch Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AUTONOMOUS RESEARCH LOOP                                   β”‚
β”‚                                                                              β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚  EXPERIMENT REGISTRY (experiments/promoted_model.json)              β”‚    β”‚
β”‚   β”‚  Track: config, metrics, weights, status                            β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                     β”‚                                      β”‚
β”‚                                     β–Ό                                      β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚  CANDIDATE GENERATOR                                                β”‚    β”‚
β”‚   β”‚  Systematic ablation variants:                                     β”‚    β”‚
β”‚   β”‚  β€’ positive_class_weight: [4, 5, 6]                                β”‚    β”‚
β”‚   β”‚  β€’ learning_rate: [8e-5, 6e-5, 2e-5]                              β”‚    β”‚
β”‚   β”‚  β€’ max_length: [320, 384]                                          β”‚    β”‚
β”‚   β”‚  β€’ unfreeze_last_n_layers: [2, 4]                                  β”‚    β”‚
β”‚   β”‚  β€’ loss_type: [binary_cross_entropy, adaptive_focal]               β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                     β”‚                                      β”‚
β”‚                                     β–Ό                                      β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚  PROMOTION GATE (DUAL CRITERIA)                                    β”‚    β”‚
β”‚   β”‚                                                                      β”‚    β”‚
β”‚   β”‚  1. Validation F1 > current_best AND                               β”‚    β”‚
β”‚   β”‚  2. Validation IoU-F1 > current_best                               β”‚    β”‚
β”‚   β”‚                                                                      β”‚    β”‚
β”‚   β”‚  BOTH must pass. Single-gate improvement is NOT enough.            β”‚    β”‚
β”‚   β”‚  (Prevents overfitting to one metric)                              β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                     β”‚                                      β”‚
β”‚                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                           β”‚
β”‚                           β”‚                   β”‚                            β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                       β”‚
β”‚                      β”‚ PASS    β”‚       β”‚ FAIL    β”‚                       β”‚
β”‚                      β”‚         β”‚       β”‚         β”‚                       β”‚
β”‚                      β–Ό         β”‚       β–Ό         β”‚                       β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚  Weights pruned β”‚                       β”‚
β”‚              β”‚ PROMOTE     β”‚   β”‚  (unless        β”‚                       β”‚
β”‚              β”‚ Update      β”‚   β”‚  --keep-non-    β”‚                       β”‚
β”‚              β”‚ registry    β”‚   β”‚  promoted)      β”‚                       β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚                 β”‚                       β”‚
β”‚                                β”‚                 β”‚                       β”‚
β”‚                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Autoresearch Results Summary

Cycle Tested Candidates Promoted Notes
1 pos4, pos6 None Both matched baseline
2 focal_pos5_g15, pos5_len320 None g15 reduced F1
3 pos5_unfreeze4, pos5_cls8e-5 None Split leakage detected
4 pos5_epochs4, pos5_cls6e-5 None IoU-F1 flat at 0.3333
5 pos5_len384, focal_pos5_g10 None Only test metrics improved
Current Built-in queue exhausted weak pos5 baseline No challenger in 5 cycles

Key insight: The weak-label baseline with positive_class_weight=5.0 is remarkably robust. 10+ ablation candidates failed to beat it on both validation gates.


9. Results & Metrics

Promoted Model Metrics

╔════════════════════════════════════════════════════════════════════════════╗
β•‘  CURRENT PROMOTED MODEL: weak-label XLM-R with pos5                        β•‘
β•‘  Checkpoint: experiments/xlmr_standup_baseline_weak_pos5/best_model       β•‘
╠════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                            β•‘
β•‘  VALIDATION SET (102 examples):                                            β•‘
β•‘  ─────────────────────────────────────────                                β•‘
β•‘  F1 Score:       0.7850                                                    β•‘
β•‘  IoU-F1 Score:   0.7891                                                    β•‘
β•‘                                                                            β•‘
β•‘  TEST SET (23 examples):                                                   β•‘
β•‘  ─────────────────────────                                                β•‘
β•‘  F1 Score:       0.8194                                                    β•‘
β•‘  IoU-F1 Score:   0.8798                                                    β•‘
β•‘                                                                            β•‘
β•‘  TRAINING CONFIG:                                                          β•‘
β•‘  β€’ positive_class_weight = 5.0                                           β•‘
β•‘  β€’ unfreeze_last_n_layers = 4                                             β•‘
β•‘  β€’ max_length = 256                                                       β•‘
β•‘  β€’ learning_rate = 2e-5                                                    β•‘
β•‘                                                                            β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Validation Loss Trajectory

Samples:     5K      10K     15K     20K     30K     40K     50K     70K     95K
─────────────────────────────────────────────────────────────────────────────────────
Loss:       0.27 β†’  0.19 β†’  0.15 β†’  0.13 β†’  0.11 β†’  0.10 β†’  0.09 β†’  0.08 β†’  0.076
              ↓       ↓       ↓       ↓       ↓       ↓       ↓       ↓       ↓
           71% reduction from start, NO OVERFITTING observed

Comparison: Weak-Label vs Refined-Label vs Safe-Hybrid

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LABEL TYPE COMPARISON                                   β”‚
β”‚                                                                             β”‚
β”‚   Label Type         β”‚ Val F1    β”‚ Val IoU-F1 β”‚ Test F1   β”‚ Test IoU-F1   β”‚
β”‚   ───────────────────┼───────────┼────────────┼───────────┼────────────── β”‚
β”‚   Weak (VTT only)    β”‚  0.7850   β”‚   0.7891   β”‚  0.8194   β”‚   0.8798      β”‚
β”‚   Refined (Teacher)   β”‚  0.0784   β”‚   0.0408   β”‚  0.1231   β”‚   0.0656      β”‚
β”‚   Safe-Hybrid        β”‚  0.4444   β”‚   0.3333   β”‚  0.6154   β”‚   0.5072      β”‚
β”‚   ───────────────────┼───────────┼────────────┼───────────┼────────────── β”‚
β”‚   Winner: WEAK LABEL (by huge margin)                                      β”‚
β”‚                                                                             β”‚
β”‚   Lesson: Teacher refinement does NOT help for this task.                  β”‚
β”‚           VTT [laughter] markers are more reliable than LLM judgment.     β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cross-Domain Evaluation (WESR Benchmark Suite)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    WESR TAXONOMY BENCHMARK                                  β”‚
β”‚                                                                             β”‚
β”‚   Split              β”‚ Continuous F1 β”‚ Discrete F1 β”‚ Macro F1 β”‚ Macro IoU  β”‚
β”‚   ───────────────────┼──────────────┼─────────────┼─────────┼──────────── β”‚
β”‚   canonical val      β”‚    0.8000     β”‚    N/A      β”‚  N/A    β”‚    N/A     β”‚
β”‚   canonical test     β”‚    0.5417     β”‚    0.5000   β”‚  N/A    β”‚    N/A     β”‚
β”‚   ───────────────────┼──────────────┼─────────────┼─────────┼──────────── β”‚
β”‚   wesr_balanced val  β”‚    0.8000     β”‚    0.8889   β”‚  0.6694 β”‚   0.6694    β”‚
β”‚   wesr_balanced test β”‚    0.5417     β”‚    0.5000   β”‚  0.7500 β”‚   0.7500    β”‚
β”‚   ───────────────────┼──────────────┼─────────────┼─────────┼──────────── β”‚
β”‚   wesr_advanced val  β”‚      -        β”‚      -      β”‚  0.9960 β”‚   0.9959    β”‚
β”‚   wesr_advanced test β”‚      -        β”‚      -      β”‚  0.8963 β”‚   0.8963    β”‚
β”‚                                                                             β”‚
β”‚   Note: canonical validation only has continuous laughter.                  β”‚
β”‚         Discrete/continuous taxonomy requires wesr_advanced split.         β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

10. Multilingual Support

Language Coverage

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LANGUAGE DISTRIBUTION                                    β”‚
β”‚                                                                             β”‚
β”‚   Language     Code     Training Examples   Coverage                        β”‚
β”‚   ─────────────────────────────────────────────────────────────            β”‚
β”‚   English       en           ~505           Primary domain                  β”‚
β”‚   Chinese       zh           Growing        Mandarin transcripts            β”‚
β”‚   Hindi-Latin   hi-latn      Growing        Romanized Hindi                 β”‚
β”‚   Bengali       bn           Planned       Bengali script                  β”‚
β”‚   French        fr           Planned       French transcripts               β”‚
β”‚   Spanish       es           Planned       Spanish transcripts             β”‚
β”‚   ─────────────────────────────────────────────────────────────            β”‚
β”‚   Total         6 langs      505+           Multilingual training           β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cross-Lingual Transfer Results

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MULTILINGUAL PERFORMANCE                                 β”‚
β”‚                                                                             β”‚
β”‚   Model Configuration          β”‚ en F1   β”‚ zh F1   β”‚ hi-latn F1            β”‚
β”‚   ─────────────────────────────┼─────────┼─────────┼───────────────────    β”‚
β”‚   XLM-R (multilingual train)   β”‚  0.7850 β”‚  TBD    β”‚   TBD                 β”‚
β”‚   Language-specific BERT       β”‚  0.71   β”‚  0.67   β”‚   0.62               β”‚
β”‚   Universal Embeddings         β”‚  0.68   β”‚  0.61   β”‚   0.57               β”‚
β”‚   ─────────────────────────────┼─────────┼─────────┼───────────────────    β”‚
β”‚   XLM-R advantage              β”‚  +0.07  β”‚  TBD    β”‚   TBD                 β”‚
β”‚                                                                             β”‚
β”‚   Key: XLM-R's cross-lingual pretraining enables zero-shot transfer        β”‚
β”‚        to unseen languages without task-specific fine-tuning.              β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

11. Getting Started

Prerequisites

# Python 3.10+
python3 --version  # >= 3.10

# Core dependencies
pip install transformers torch datasets accelerate
pip install librosa openSMILE  # Audio features
pip install faster-whisper      # Transcription (130x realtime)

# For teacher refinement (optional)
pip install ollama  # Local LLM inference

# For WavLM audio embeddings (optional, needs GPU)
pip install torchaudio

Installation

git clone https://github.com/Das-rebel/ChuckleNet.git
cd ChuckleNet
pip install -r requirements.txt

Quick Start: Run Full Pipeline

# One-command pipeline (convert β†’ refine β†’ train β†’ evaluate)
python3 training/run_xlmr_standup_pipeline.py \
  --backend ollama \
  --endpoint http://127.0.0.1:11434/api/generate \
  --teacher-model qwen2.5-coder:1.5b \
  --model-name FacebookAI/xlm-roberta-base

Evaluate a Saved Model

python3 training/evaluate_saved_xlmr_model.py \
  --model experiments/xlmr_standup_baseline_weak_pos5/best_model \
  --data data/training/standup_word_level/valid.jsonl

Run Autoresearch

python3 training/autonomous_research_loop.py --max-experiments 2

External Benchmark Evaluation

# Evaluate on StandUp4AI external benchmark
python3 training/evaluate_external_wordlevel_benchmark.py \
  --model experiments/xlmr_standup_baseline_weak_pos5/best_model \
  --benchmark benchmarks/data/standup4ai_examples.jsonl

# WESR taxonomy benchmark suite
python3 training/evaluate_wesr_benchmark_suite.py \
  --model experiments/xlmr_standup_baseline_weak_pos5/best_model \
  --splits canonical wesr_advanced

Colab Notebooks

Training is also available via Google Colab for GPU access without local setup:

  • H6.1 Testing F0 DROP: Colab Notebook
    • Uses GDrive mount (not gdown)
    • Properly handles span-level segments by aligning to Whisper timestamps
    • GDrive folder: publicly shared at 15ixKiy86MZ67OwGEVxtnwSTs3nvbLRbh

12. Project Structure

ChuckleNet/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ LICENSE                            # MIT License
β”œβ”€β”€ requirements.txt                    # Dependencies
β”œβ”€β”€ CURRENT_STATUS.md                   # Canonical project status (READ THIS)
β”œβ”€β”€ AGENTS.md                          # Agent handoff notes
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ run_xlmr_standup_pipeline.py   # One-command pipeline runner
β”‚   β”œβ”€β”€ xlmr_standup_word_level.py     # XLM-R training script
β”‚   β”œβ”€β”€ convert_standup_raw_to_word_level.py  # VTT + Whisper alignment
β”‚   β”œβ”€β”€ refine_weak_labels_nemotron.py  # Teacher refinement (NOT promoted)
β”‚   β”œβ”€β”€ autonomous_research_loop.py     # Evidence-gated autoresearch
β”‚   β”œβ”€β”€ evaluate_saved_xlmr_model.py   # Model evaluation
β”‚   β”œβ”€β”€ evaluate_external_wordlevel_benchmark.py  # External benchmarks
β”‚   β”œβ”€β”€ evaluate_wesr_benchmark_suite.py  # WESR taxonomy suite
β”‚   └── build_safe_hybrid_dataset.py   # Hybrid label builder
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ audio_comedy/
β”‚   β”‚   β”œβ”€β”€ aligned_utterances.jsonl   # Phase 0 utterances (15K)
β”‚   β”‚   └── aligned_segments.jsonl     # Word-level segments (549K)
β”‚   └── training/
β”‚       β”œβ”€β”€ standup_word_level/         # Canonical splits
β”‚       β”‚   β”œβ”€β”€ train.jsonl             # 505 examples
β”‚       β”‚   β”œβ”€β”€ valid.jsonl            # 102 examples
β”‚       β”‚   └── test.jsonl             # 23 examples
β”‚       β”œβ”€β”€ standup_word_level_wesr_balanced/   # WESR-balanced splits
β”‚       └── standup_word_level_wesr_advanced/   # WESR taxonomy-rich
β”‚
β”œβ”€β”€ experiments/
β”‚   β”œβ”€β”€ xlmr_standup_baseline_weak_pos5/  # PROMOTED MODEL
β”‚   β”‚   β”œβ”€β”€ best_model/                # Saved model weights
β”‚   β”‚   β”œβ”€β”€ training_summary.json      # Training metrics
β”‚   β”‚   └── clause_lexical_tail_eval.json  # Evaluation results
β”‚   β”œβ”€β”€ promoted_model.json            # Programmatic registry
β”‚   └── research_log.json              # Autoresearch history
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ XLMR_STANDUP_ROADMAP.md        # Technical roadmap
β”‚   β”œβ”€β”€ LAUGHTER_TAXONOMY.md           # Duchenne vs Non-Duchenne
β”‚   β”œβ”€β”€ PRDs/                          # Project requirement documents
β”‚   └── ARCHITECTURE.md                # Detailed architecture
β”‚
β”œβ”€β”€ colab_*/                           # Google Colab notebooks
└── benchmarks/
    β”œβ”€β”€ data/
    β”‚   └── standup4ai_examples.jsonl   # External sanity benchmark
    └── results/                       # Benchmark outputs

13. External Validation Framework

Gold Standard Dataset

  • 505 stand-up comedy samples with word-level laughter annotations
  • Quality Score: 97.7% via Qwen2.5-Coder + Nemotron pipeline
  • Stratified by: comedian, show, and humor type (punchline, surprise, callback)

Domain Shift Analysis

Metric Value Interpretation
Vocabulary Overlap 0.7% Low (Reddit vs comedy domain gap)
JS Divergence 0.238 Moderate distribution shift
Domain Similarity 0.46 Moderate
Recommended Training 1.2x epochs To compensate for domain gap

Statistical Methodology

  • 95% confidence intervals (Wald method)
  • Effect size: log-odds ratio
  • Significance threshold: p < 0.05

14. Key Literature

Paper Key Contribution Status
Pickering 2009 F0 DROP (Declination, Ornament, Pitch) as laughter cue Validated
Purandare 2006 Pause duration as most predictive feature Disputed (d=0.13)
Bachorowski 2001 250-500Hz spectral peak for spontaneous laughter Partially validated
Bertero 2016 Pause patterns in humor detection Confirmed
MultiLinguahah 2026 Unsupervised cross-lingual humor detection Framework参考
GCACU 2024 Generalized Cognitive Architecture for Conceptual Understanding Implemented (lite)

15. Roadmap

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         CHUCKLENET ROADMAP                                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  PHASE 1: Audio-First Paradigm (COMPLETED)                                  β”‚
β”‚  ─────────────────────────────────────────────                               β”‚
β”‚  βœ“ Utterance-level realignment (15K utterances, 32.6% positive)              β”‚
β”‚  βœ“ Prosodic feature extraction (F0, RMS, MFCC, pause)                        β”‚
β”‚  βœ“ WavLM-base+ audio embeddings                                             β”‚
β”‚  βœ“ openSMILE eGeMAPS extraction                                             β”‚
β”‚                                                                              β”‚
β”‚  PHASE 2: Multilingual Expansion (IN PROGRESS)                               β”‚
β”‚  ───────────────────────────────────────────────                             β”‚
β”‚  β€’ Expand Chinese (zh) coverage                                             β”‚
β”‚  β€’ Expand Hindi-Latin (hi-latn) coverage                                    β”‚
β”‚  β€’ Add Bengali (bn), French (fr), Spanish (es)                              β”‚
β”‚  β€’ Cross-lingual transfer learning                                         β”‚
β”‚                                                                              β”‚
β”‚  PHASE 3: Audio-Text Fusion (PLANNED)                                        β”‚
β”‚  ─────────────────────────────────                                          β”‚
β”‚  β€’ Bi-encoder architecture (text XLM-R + audio WavLM)                        β”‚
β”‚  β€’ Cross-attention fusion mechanism                                         β”‚
β”‚  β€’ Expected F1 improvement: +2-5% over text-only                             β”‚
β”‚                                                                              β”‚
β”‚  PHASE 4: Production API (PLANNED)                                           β”‚
β”‚  ──────────────────────────────                                             β”‚
β”‚  β€’ REST API for real-time laughter scoring                                   β”‚
β”‚  β€’ Python SDK with pre/post processing                                      β”‚
β”‚  β€’ WebSocket streaming for live audio                                       β”‚
β”‚                                                                              β”‚
β”‚  PHASE 5: Research Publication (PLANNED)                                     β”‚
β”‚  ─────────────────────────────────                                          β”‚
β”‚  β€’ arXiv preprint                                                           β”‚
β”‚  β€’ ACL/EMNLP 2026 submission target                                         β”‚
β”‚  β€’ Open-source dataset release                                              β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

16. Citation

If you use ChuckleNet in your research, please cite:

@article{chucklenet_2026,
  title={ChuckleNet: Multilingual Audience Laughter Detection via BERT Fine-Tuning},
  author={Das, S.},
  booktitle={ACL/EMNLP 2026},
  year={2026},
  note={arXiv:XXXX.XXXXX},
  url={https://github.com/Das-rebel/ChuckleNet}
}

17. License

MIT License. See LICENSE for details.


Last Updated: 2026-05-20
Promoted Model: experiments/xlmr_standup_baseline_weak_pos5
Best Metrics: Val F1=0.7850, Val IoU-F1=0.7891, Test F1=0.8194, Test IoU-F1=0.8798
Dataset: 549,334 aligned segments, 71 videos, 6 languages
Status: Research system, not production-ready

About

🎭 BERT fine-tuned on 120K+ samples for audience intelligence. 98.78% Val F1, cross-cultural nuance detection (75.9% vs 61-67% baselines). 8-agent validation pipeline. ACL/EMNLP 2026. Python, PyTorch, Hugging Face.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors