Problem
When benchmarking reasoning/thinking models (e.g., QwQ, DeepSeek-R1), llmnop reports:
- 0 output tokens
- 0 throughput
- Misleading TTFT (appears instant or missing)
Example output with a reasoning model:
number_output_tokens
mean = 0
request_output_throughput_token_per_s
mean = 0
ttft_s
mean = 0.001 # Suspiciously fast
Root Cause
In src/benchmark.rs, we only check delta.content:
if let Some(content) = choice.delta.content {
if !content.is_empty() {
chunk_arrivals.push((Instant::now(), content.clone()));
generated_text.push_str(&content);
}
}
Reasoning models stream their thinking process via different fields depending on the inference server:
choices[].delta.reasoning_content (vLLM)
choices[].delta.reasoning (Ollama, LM Studio)
Since we ignore these fields, all reasoning tokens are missed. If the model does extended thinking before producing content, we miss that latency window entirely.
Caveat
The async-openai crate does not expose reasoning_content or reasoning fields on ChatCompletionStreamResponseDelta. The maintainers have declined PRs to add these fields, as the crate targets the official OpenAI API.
See: PR #418 (labeled "out of scope")
Recommended Approach: BYOT
The async-openai maintainer recommended:
"You can achieve this with combination of byot feature and Serde struct flattening to re-use/expand existing types."
The byot (Bring Your Own Types) feature enables *_byot() methods that accept custom request/response types. See: BYOT documentation
Implementation
- Enable the
byot feature in Cargo.toml:
async-openai = { version = "0.30", features = ["byot"] }
- Define custom streaming types with reasoning fields:
#[derive(Debug, Deserialize)]
struct StreamDelta {
content: Option<String>,
reasoning_content: Option<String>, // vLLM
reasoning: Option<String>, // Ollama, LM Studio
}
#[derive(Debug, Deserialize)]
struct StreamChoice {
delta: StreamDelta,
}
#[derive(Debug, Deserialize)]
struct StreamChunk {
choices: Vec<StreamChoice>,
}
- Use
create_stream_byot() instead of create_stream():
let stream: Pin<Box<dyn Stream<Item = Result<StreamChunk, OpenAIError>> + Send>> =
client.chat().create_stream_byot(request).await?;
Solution
1. Track arrivals separately
let mut content_arrivals: Vec<(Instant, String)> = Vec::new();
let mut reasoning_arrivals: Vec<(Instant, String)> = Vec::new();
let mut generated_text = String::new();
let mut reasoning_text = String::new();
2. Parse both content and reasoning
let content = delta.content.as_deref().unwrap_or("");
let reasoning = delta.reasoning_content
.as_deref()
.or(delta.reasoning.as_deref())
.unwrap_or("");
let now = Instant::now();
if !reasoning.is_empty() {
reasoning_arrivals.push((now, reasoning.to_string()));
reasoning_text.push_str(reasoning);
}
if !content.is_empty() {
content_arrivals.push((now, content.to_string()));
generated_text.push_str(content);
}
3. Add new metrics
TTFT (Time to First Token): Time to first token of ANY kind (including reasoning)
- Uses:
min(content_arrivals[0], reasoning_arrivals[0]) if both exist
TTFO (Time to First Output Token): Time to first NON-reasoning token
- Uses:
content_arrivals[0] only
- For non-reasoning models: TTFO = TTFT
Reasoning token count: Number of reasoning tokens generated
- Tokenize
reasoning_text separately from generated_text
4. Update BenchmarkResult
pub struct BenchmarkResult {
pub ttft: Duration, // First token (any kind)
pub ttfo: Option<Duration>, // First content token (None if no content)
pub total_latency: Duration,
pub throughput: f64,
pub input_tokens: u32,
pub output_tokens: u32, // Content tokens only
pub reasoning_tokens: u32, // Reasoning tokens (new)
pub inter_token_latency_s: f64,
pub total_tokens: u32, // input + output + reasoning
}
5. Update output
Add new metrics to console and JSON output:
time_to_first_output_token (TTFO) with percentiles
reasoning_tokens with percentiles
- Update
total_tokens to include reasoning
For non-reasoning models:
- TTFO = TTFT (or omit TTFO entirely)
reasoning_tokens = 0 or null
Files to Modify
Cargo.toml - Add byot feature to async-openai
src/client.rs - Use create_stream_byot() with custom types
src/benchmark.rs - Track content vs reasoning arrivals separately, compute TTFT/TTFO
src/output.rs - Add TTFO and reasoning token output fields
Testing
- Test with a non-reasoning model (e.g., Llama 3.1, Gemma 3) - should work as before
- Test with a reasoning model (e.g., QwQ, DeepSeek-R1) - should now show:
- Non-zero reasoning tokens
- TTFT reflecting first reasoning token
- TTFO reflecting first content token
- Correct throughput based on total tokens generated
Problem
When benchmarking reasoning/thinking models (e.g., QwQ, DeepSeek-R1), llmnop reports:
Example output with a reasoning model:
Root Cause
In
src/benchmark.rs, we only checkdelta.content:Reasoning models stream their thinking process via different fields depending on the inference server:
choices[].delta.reasoning_content(vLLM)choices[].delta.reasoning(Ollama, LM Studio)Since we ignore these fields, all reasoning tokens are missed. If the model does extended thinking before producing content, we miss that latency window entirely.
Caveat
The
async-openaicrate does not exposereasoning_contentorreasoningfields onChatCompletionStreamResponseDelta. The maintainers have declined PRs to add these fields, as the crate targets the official OpenAI API.See: PR #418 (labeled "out of scope")
Recommended Approach: BYOT
The async-openai maintainer recommended:
The
byot(Bring Your Own Types) feature enables*_byot()methods that accept custom request/response types. See: BYOT documentationImplementation
byotfeature inCargo.toml:create_stream_byot()instead ofcreate_stream():Solution
1. Track arrivals separately
2. Parse both content and reasoning
3. Add new metrics
TTFT (Time to First Token): Time to first token of ANY kind (including reasoning)
min(content_arrivals[0], reasoning_arrivals[0])if both existTTFO (Time to First Output Token): Time to first NON-reasoning token
content_arrivals[0]onlyReasoning token count: Number of reasoning tokens generated
reasoning_textseparately fromgenerated_text4. Update BenchmarkResult
5. Update output
Add new metrics to console and JSON output:
time_to_first_output_token(TTFO) with percentilesreasoning_tokenswith percentilestotal_tokensto include reasoningFor non-reasoning models:
reasoning_tokens= 0 or nullFiles to Modify
Cargo.toml- Addbyotfeature to async-openaisrc/client.rs- Usecreate_stream_byot()with custom typessrc/benchmark.rs- Track content vs reasoning arrivals separately, compute TTFT/TTFOsrc/output.rs- Add TTFO and reasoning token output fieldsTesting