Skip to content

Commit 04ff18d

Browse files
authored
Merge pull request #131 from vcon-dev/feature/wtf-transcribe-link
Add wtf_transcribe link for WTF transcription
2 parents ec0311d + 950bc3c commit 04ff18d

2 files changed

Lines changed: 480 additions & 0 deletions

File tree

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# WTF Transcription Link (vfun Integration)
2+
3+
A link that sends vCon audio dialogs to a vfun transcription server and adds the results as WTF (World Transcription Format) analysis entries.
4+
5+
## Overview
6+
7+
This link integrates with the vfun transcription server to provide:
8+
- Multi-language speech recognition (English + auto-detect)
9+
- Speaker diarization (who spoke when)
10+
- GPU-accelerated processing with CUDA
11+
- WTF-compliant output format per IETF draft-howe-vcon-wtf-extension-01
12+
13+
## Configuration
14+
15+
```yaml
16+
wtf_transcribe:
17+
module: links.wtf_transcribe
18+
options:
19+
# Required: URL of the vfun transcription server
20+
vfun-server-url: http://localhost:8443/transcribe
21+
22+
# Optional: Enable speaker diarization (default: true)
23+
diarize: true
24+
25+
# Optional: Request timeout in seconds (default: 300)
26+
timeout: 300
27+
28+
# Optional: Minimum dialog duration to transcribe in seconds (default: 5)
29+
min-duration: 5
30+
31+
# Optional: API key for vfun server authentication
32+
api-key: your-api-key-here
33+
```
34+
35+
## How It Works
36+
37+
1. **Extract Audio**: Reads audio from vCon dialog (supports `body` with base64/base64url encoding, or `url` with file:// or http:// references)
38+
2. **Send to vfun**: POSTs audio file to vfun's `/transcribe` endpoint
39+
3. **Create WTF Analysis**: Formats the transcription result as a WTF analysis entry
40+
4. **Update vCon**: Adds the WTF analysis to the vCon and stores it back to Redis
41+
42+
## Output Format
43+
44+
The link adds analysis entries with the WTF format:
45+
46+
```json
47+
{
48+
"type": "wtf_transcription",
49+
"dialog": 0,
50+
"mediatype": "application/json",
51+
"vendor": "vfun",
52+
"product": "parakeet-tdt-110m",
53+
"schema": "wtf-1.0",
54+
"encoding": "json",
55+
"body": {
56+
"transcript": {
57+
"text": "Hello, how can I help you today?",
58+
"language": "en-US",
59+
"duration": 30.0,
60+
"confidence": 0.95
61+
},
62+
"segments": [
63+
{
64+
"id": 0,
65+
"start": 0.0,
66+
"end": 3.5,
67+
"text": "Hello, how can I help you today?",
68+
"confidence": 0.95,
69+
"speaker": 0
70+
}
71+
],
72+
"metadata": {
73+
"created_at": "2024-01-15T10:30:00Z",
74+
"processed_at": "2024-01-15T10:30:05Z",
75+
"provider": "vfun",
76+
"model": "parakeet-tdt-110m"
77+
},
78+
"speakers": {
79+
"0": {
80+
"id": 0,
81+
"label": "Speaker 0",
82+
"segments": [0],
83+
"total_time": 15.2
84+
}
85+
},
86+
"quality": {
87+
"average_confidence": 0.95,
88+
"multiple_speakers": true,
89+
"low_confidence_words": 0
90+
}
91+
}
92+
}
93+
```
94+
95+
## Behavior
96+
97+
- **Skips non-recording dialogs**: Only processes dialogs with `type: "recording"`
98+
- **Skips already transcribed**: Dialogs with existing WTF transcriptions are skipped
99+
- **Duration filtering**: Dialogs shorter than `min-duration` are skipped
100+
- **File URL support**: Can read audio from local `file://` URLs directly
101+
102+
## Example Chain Configuration
103+
104+
```yaml
105+
chains:
106+
transcription_chain:
107+
links:
108+
- tag
109+
- wtf_transcribe
110+
- supabase_webhook
111+
ingress_lists:
112+
- transcribe
113+
egress_lists:
114+
- transcribed
115+
enabled: 1
116+
```
117+
118+
## vfun Server
119+
120+
The vfun server provides GPU-accelerated transcription:
121+
122+
```bash
123+
# Start vfun server
124+
cd /path/to/vfun
125+
./vfun server
126+
127+
# Test health
128+
curl http://localhost:8443/ping
129+
130+
# Manual transcription test
131+
curl -X POST http://localhost:8443/transcribe \
132+
-H "Authorization: Bearer YOUR_API_KEY" \
133+
-F "file=@audio.wav" \
134+
-F "diarize=true"
135+
```
136+
137+
## Related
138+
139+
- [vfun](https://github.com/strolid/vfun) - GPU-accelerated transcription server
140+
- [draft-howe-vcon-wtf-extension](https://datatracker.ietf.org/doc/html/draft-howe-vcon-wtf-extension) - IETF WTF specification

0 commit comments

Comments
 (0)