- Model: Qwen3-8B-Q4_K_M (4.7 GB GGUF, Q4_K quantization)
- Server: llama-server (custom Vulkan build, ~/bin/llama-server)
- Backend: Vulkan (AMD Radeon 890M, gfx1150)
- Context: 4096 tokens
- System prompt:
/no_thinkto suppress verbose reasoning - Temperature: 0.1
| Category | Score | Target | Status |
|---|---|---|---|
| Simple | 92.0% | 80% | PASS |
| Multiple | 94.0% | 60% | PASS |
- 3x
x^2vsx**2notation mismatch (correct answer, formatting difference) - 1x
operating_hours: 11vs23(12h vs 24h format)
- 1x wrong function selected (database.create_backup instead of database.modify_columns)
- 1x
3x+2vs3x + 2whitespace mismatch - 1x
3x^2vs3x**2notation mismatch
| Category | Score | Target | Status | Margin |
|---|---|---|---|---|
| Simple | 95.8% | 80% | PASS | +15.8 |
| Multiple | 93.5% | 60% | PASS | +33.5 |
- 3x
x^2vsx**2notation (correct answer, Python vs math notation) - 3x semantic near-misses (e.g.,
human cellvshuman,bananavsbananas) - 3x value interpretation (e.g., 12h vs 24h time,
renewablevssolar) - 2x complex nested structures (array-of-objects conditions)
- 6x other minor mismatches
- 2x wrong function chosen (database.create_backup, random_forest_regression)
- 2x
x^2vsx**2notation - 4x parameter value mismatches (e.g.,
quarterlyvsannually, extra instruments) - 1x missing required param (n_rolls)
- 1x no tool call generated
- 3x other minor mismatches
- Avg time per case: 3.1s simple, 3.8s multiple (with /no_think)
- Total eval time: 1241s simple + 755s multiple = ~33 minutes
- Token generation: ~18 t/s on Vulkan
- Prompt processing: ~238 t/s on Vulkan
- Zero server errors in full run
/no_thinksystem prompt eliminates reasoning overhead (155 -> 45 tokens per call)- BFCL type conversion: dict->object, float->number, tuple->array, any->string
- Nested ground-truth dict matching for complex parameter values