Automatically extract a virtual streamer's personality from Bilibili recordings. Outputs an OpenHanako-compatible ishiki.md personality file.
Bilibili Replay → yt-dlp Download → inaSpeechSegmenter Speech/Music Split
→ MLX-Whisper Transcribe → T2S Cleanup → Quality Filter
→ LLM Correction + Dialogue Reconstruction → Tone/Quirk Analysis
→ Generate ishiki.md Personality File
Designed for VTuber fans. Feed it months of Bilibili replays:
| Step | Tool | Output |
|---|---|---|
| Filter singing | inaSpeechSegmenter | Speech-only segments |
| Transcribe | MLX-Whisper large-v3 | 187K dialogue lines |
| Clean up | Rules + LLM | ASR-corrected text |
| Analyze | Statistical + LLM | Tone, quirks, catchphrases |
| Generate | LLM | OpenHanako ishiki.md |
| Minimum | Recommended | |
|---|---|---|
| CPU | Apple M1 | M1 Max+ |
| RAM | 16GB | 64GB |
| Disk | 20GB | 50GB+ |
# Install dependencies
/usr/bin/pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple \
numpy soundfile tqdm jieba zhconv librosa mlx-whisper \
inaSpeechSegmenter yt-dlp openai imageio-ffmpeg modelscope
# Download MLX model (ModelScope, China-accessible)
python3 -c "
from modelscope import snapshot_download
snapshot_download('mlx-community/whisper-large-v3-mlx',
cache_dir='~/.cache/modelscope/hub')
"
# Fix inaSpeechSegmenter (see docs/PITFALLS.md)
# Download recordings
yt-dlp -x --audio-format m4a \
-o "data/videos/BVid_p%(playlist_index)02d.%(ext)s" \
"https://www.bilibili.com/video/BVid"
# Run pipeline
python3 src/pipeline.py separate # Stage 2: Split speech/music
python3 src/pipeline.py transcribe # Stage 3: Transcribe
python3 src/convert_t2s.py # T→S conversion
python3 src/clean_transcripts.py # Quality filter
python3 src/pipeline.py dialogue # Stage 4: Analyze
python3 src/pipeline.py correct # Stage 4.5: LLM correction
python3 src/auto_correct.py --api-key KEY --model deepseek-v4-flash
python3 src/pipeline.py personality # Stage 5: Generate| Stage | Command | What It Does |
|---|---|---|
| 1 | yt-dlp |
Download Bilibili replay collections |
| 2 | separate |
inaSpeechSegmenter splits speech from music |
| 3 | transcribe |
MLX-Whisper large-v3 speech-to-text |
| — | convert_t2s |
Traditional → Simplified Chinese |
| — | clean |
Remove repetition, BGM artifacts |
| 4 | dialogue |
Tone word, punctuation, vocabulary stats |
| 4.5 | correct |
LLM batch correction + dialogue inference |
| 5 | personality |
Generate OpenHanako personality prompt |
| Metric | Value |
|---|---|
| Source | 8 Bilibili collections (331h) |
| Speech | 143.6h (43.4%) |
| Segments | 187,632 transcribed lines |
| Corrections | 452 (22.1% of sampled) |
| Quirks found | 宁→您(10), 捏→呢(3), 不了一点(358) |
Generated ishiki.md includes:
- Speaking tone, catchphrases, typical sentence patterns
- Tone word frequency (吧/啊/呢/呀)
- Interaction patterns (greeting, gift thanks, teasing)
- Quirk mapping:
{宁→您: 10, 捏→呢: 3} - Forbidden phrases and stylistic guidelines
VTuberSoulExtractor/
├── src/
│ ├── pipeline.py # Main pipeline
│ ├── clean_transcripts.py # Quality filter
│ ├── convert_t2s.py # T→S conversion
│ └── auto_correct.py # LLM batch correction
├── data/ # Runtime (gitignored)
├── docs/
│ ├── PITFALLS.md
│ └── PITFALLS_CN.md
├── README.md
├── README_CN.md
├── requirements.txt
└── .gitignore
- inaSpeechSegmenter — speech/music classification
- mlx-whisper — MLX-accelerated transcription
- OpenHanako — personality target
- DeepSeek V4 Flash — LLM correction
MIT