|
| 1 | +# WTF Transcription Link (vfun Integration) |
| 2 | + |
| 3 | +A link that sends vCon audio dialogs to a vfun transcription server and adds the results as WTF (World Transcription Format) analysis entries. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +This link integrates with the vfun transcription server to provide: |
| 8 | +- Multi-language speech recognition (English + auto-detect) |
| 9 | +- Speaker diarization (who spoke when) |
| 10 | +- GPU-accelerated processing with CUDA |
| 11 | +- WTF-compliant output format per IETF draft-howe-vcon-wtf-extension-01 |
| 12 | + |
| 13 | +## Configuration |
| 14 | + |
| 15 | +```yaml |
| 16 | +wtf_transcribe: |
| 17 | + module: links.wtf_transcribe |
| 18 | + options: |
| 19 | + # Required: URL of the vfun transcription server |
| 20 | + vfun-server-url: http://localhost:8443/transcribe |
| 21 | + |
| 22 | + # Optional: Enable speaker diarization (default: true) |
| 23 | + diarize: true |
| 24 | + |
| 25 | + # Optional: Request timeout in seconds (default: 300) |
| 26 | + timeout: 300 |
| 27 | + |
| 28 | + # Optional: Minimum dialog duration to transcribe in seconds (default: 5) |
| 29 | + min-duration: 5 |
| 30 | + |
| 31 | + # Optional: API key for vfun server authentication |
| 32 | + api-key: your-api-key-here |
| 33 | +``` |
| 34 | +
|
| 35 | +## How It Works |
| 36 | +
|
| 37 | +1. **Extract Audio**: Reads audio from vCon dialog (supports `body` with base64/base64url encoding, or `url` with file:// or http:// references) |
| 38 | +2. **Send to vfun**: POSTs audio file to vfun's `/transcribe` endpoint |
| 39 | +3. **Create WTF Analysis**: Formats the transcription result as a WTF analysis entry |
| 40 | +4. **Update vCon**: Adds the WTF analysis to the vCon and stores it back to Redis |
| 41 | + |
| 42 | +## Output Format |
| 43 | + |
| 44 | +The link adds analysis entries with the WTF format: |
| 45 | + |
| 46 | +```json |
| 47 | +{ |
| 48 | + "type": "wtf_transcription", |
| 49 | + "dialog": 0, |
| 50 | + "mediatype": "application/json", |
| 51 | + "vendor": "vfun", |
| 52 | + "product": "parakeet-tdt-110m", |
| 53 | + "schema": "wtf-1.0", |
| 54 | + "encoding": "json", |
| 55 | + "body": { |
| 56 | + "transcript": { |
| 57 | + "text": "Hello, how can I help you today?", |
| 58 | + "language": "en-US", |
| 59 | + "duration": 30.0, |
| 60 | + "confidence": 0.95 |
| 61 | + }, |
| 62 | + "segments": [ |
| 63 | + { |
| 64 | + "id": 0, |
| 65 | + "start": 0.0, |
| 66 | + "end": 3.5, |
| 67 | + "text": "Hello, how can I help you today?", |
| 68 | + "confidence": 0.95, |
| 69 | + "speaker": 0 |
| 70 | + } |
| 71 | + ], |
| 72 | + "metadata": { |
| 73 | + "created_at": "2024-01-15T10:30:00Z", |
| 74 | + "processed_at": "2024-01-15T10:30:05Z", |
| 75 | + "provider": "vfun", |
| 76 | + "model": "parakeet-tdt-110m" |
| 77 | + }, |
| 78 | + "speakers": { |
| 79 | + "0": { |
| 80 | + "id": 0, |
| 81 | + "label": "Speaker 0", |
| 82 | + "segments": [0], |
| 83 | + "total_time": 15.2 |
| 84 | + } |
| 85 | + }, |
| 86 | + "quality": { |
| 87 | + "average_confidence": 0.95, |
| 88 | + "multiple_speakers": true, |
| 89 | + "low_confidence_words": 0 |
| 90 | + } |
| 91 | + } |
| 92 | +} |
| 93 | +``` |
| 94 | + |
| 95 | +## Behavior |
| 96 | + |
| 97 | +- **Skips non-recording dialogs**: Only processes dialogs with `type: "recording"` |
| 98 | +- **Skips already transcribed**: Dialogs with existing WTF transcriptions are skipped |
| 99 | +- **Duration filtering**: Dialogs shorter than `min-duration` are skipped |
| 100 | +- **File URL support**: Can read audio from local `file://` URLs directly |
| 101 | + |
| 102 | +## Example Chain Configuration |
| 103 | + |
| 104 | +```yaml |
| 105 | +chains: |
| 106 | + transcription_chain: |
| 107 | + links: |
| 108 | + - tag |
| 109 | + - wtf_transcribe |
| 110 | + - supabase_webhook |
| 111 | + ingress_lists: |
| 112 | + - transcribe |
| 113 | + egress_lists: |
| 114 | + - transcribed |
| 115 | + enabled: 1 |
| 116 | +``` |
| 117 | + |
| 118 | +## vfun Server |
| 119 | + |
| 120 | +The vfun server provides GPU-accelerated transcription: |
| 121 | + |
| 122 | +```bash |
| 123 | +# Start vfun server |
| 124 | +cd /path/to/vfun |
| 125 | +./vfun server |
| 126 | +
|
| 127 | +# Test health |
| 128 | +curl http://localhost:8443/ping |
| 129 | +
|
| 130 | +# Manual transcription test |
| 131 | +curl -X POST http://localhost:8443/transcribe \ |
| 132 | + -H "Authorization: Bearer YOUR_API_KEY" \ |
| 133 | + -F "file=@audio.wav" \ |
| 134 | + -F "diarize=true" |
| 135 | +``` |
| 136 | + |
| 137 | +## Related |
| 138 | + |
| 139 | +- [vfun](https://github.com/strolid/vfun) - GPU-accelerated transcription server |
| 140 | +- [draft-howe-vcon-wtf-extension](https://datatracker.ietf.org/doc/html/draft-howe-vcon-wtf-extension) - IETF WTF specification |
0 commit comments