VTuberSoulExtractor — Virtual Streamer Personality Extractor

Automatically extract a virtual streamer's personality from Bilibili recordings. Outputs an OpenHanako-compatible ishiki.md personality file.

中文文档

How It Works

Bilibili Replay → yt-dlp Download → inaSpeechSegmenter Speech/Music Split
  → MLX-Whisper Transcribe → T2S Cleanup → Quality Filter
    → LLM Correction + Dialogue Reconstruction → Tone/Quirk Analysis
      → Generate ishiki.md Personality File

Who Is This For

Designed for VTuber fans. Feed it months of Bilibili replays:

Step	Tool	Output
Filter singing	inaSpeechSegmenter	Speech-only segments
Transcribe	MLX-Whisper large-v3	187K dialogue lines
Clean up	Rules + LLM	ASR-corrected text
Analyze	Statistical + LLM	Tone, quirks, catchphrases
Generate	LLM	OpenHanako ishiki.md

Hardware

	Minimum	Recommended
CPU	Apple M1	M1 Max+
RAM	16GB	64GB
Disk	20GB	50GB+

Quick Start

# Install dependencies
/usr/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple \
  numpy soundfile tqdm jieba zhconv librosa mlx-whisper \
  inaSpeechSegmenter yt-dlp openai imageio-ffmpeg modelscope

# Download MLX model (ModelScope, China-accessible)
python3 -c "
from modelscope import snapshot_download
snapshot_download('mlx-community/whisper-large-v3-mlx',
                   cache_dir='~/.cache/modelscope/hub')
"

# Fix inaSpeechSegmenter (see docs/PITFALLS.md)

# Download recordings
yt-dlp -x --audio-format m4a \
  -o "data/videos/BVid_p%(playlist_index)02d.%(ext)s" \
  "https://www.bilibili.com/video/BVid"

# Run pipeline
python3 src/pipeline.py separate      # Stage 2: Split speech/music
python3 src/pipeline.py transcribe    # Stage 3: Transcribe
python3 src/convert_t2s.py            # T→S conversion
python3 src/clean_transcripts.py      # Quality filter
python3 src/pipeline.py dialogue      # Stage 4: Analyze
python3 src/pipeline.py correct       # Stage 4.5: LLM correction
python3 src/auto_correct.py --api-key KEY --model deepseek-v4-flash
python3 src/pipeline.py personality   # Stage 5: Generate

Pipeline

Stage	Command	What It Does
1	`yt-dlp`	Download Bilibili replay collections
2	`separate`	inaSpeechSegmenter splits speech from music
3	`transcribe`	MLX-Whisper large-v3 speech-to-text
—	`convert_t2s`	Traditional → Simplified Chinese
—	`clean`	Remove repetition, BGM artifacts
4	`dialogue`	Tone word, punctuation, vocabulary stats
4.5	`correct`	LLM batch correction + dialogue inference
5	`personality`	Generate OpenHanako personality prompt

Real-World Results

Metric	Value
Source	8 Bilibili collections (331h)
Speech	143.6h (43.4%)
Segments	187,632 transcribed lines
Corrections	452 (22.1% of sampled)
Quirks found	宁→您(10), 捏→呢(3), 不了一点(358)

Output Example

Generated ishiki.md includes:

Speaking tone, catchphrases, typical sentence patterns
Tone word frequency (吧/啊/呢/呀)
Interaction patterns (greeting, gift thanks, teasing)
Quirk mapping: {宁→您: 10, 捏→呢: 3}
Forbidden phrases and stylistic guidelines

Project Structure

VTuberSoulExtractor/
├── src/
│   ├── pipeline.py           # Main pipeline
│   ├── clean_transcripts.py  # Quality filter
│   ├── convert_t2s.py        # T→S conversion
│   └── auto_correct.py       # LLM batch correction
├── data/                     # Runtime (gitignored)
├── docs/
│   ├── PITFALLS.md
│   └── PITFALLS_CN.md
├── README.md
├── README_CN.md
├── requirements.txt
└── .gitignore

Credits

inaSpeechSegmenter — speech/music classification
mlx-whisper — MLX-accelerated transcription
OpenHanako — personality target
DeepSeek V4 Flash — LLM correction

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTuberSoulExtractor — Virtual Streamer Personality Extractor

How It Works

Who Is This For

Hardware

Quick Start

Pipeline

Real-World Results

Output Example

Project Structure

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VTuberSoulExtractor — Virtual Streamer Personality Extractor

How It Works

Who Is This For

Hardware

Quick Start

Pipeline

Real-World Results

Output Example

Project Structure

Credits

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages