Skip to content

fix(llm): stream_options.include_usage so cost tracking actually fires#119

Merged
rdwj merged 1 commit into
mainfrom
fix/stream-include-usage-118
Apr 28, 2026
Merged

fix(llm): stream_options.include_usage so cost tracking actually fires#119
rdwj merged 1 commit into
mainfrom
fix/stream-include-usage-118

Conversation

@rdwj
Copy link
Copy Markdown
Contributor

@rdwj rdwj commented Apr 28, 2026

Summary

LLMClient.call_model_stream_raw was never asking vLLM for the terminal usage chunk, so OpenAIChatServer._persist_cost_data (added in #117) returned early on every real call and cost_data accumulators stayed empty in deployed agents. Sets stream_options={\"include_usage\": True} by default. setdefault semantics so callers can opt out.

Surfaced during the cluster smoke for #116. Closes #118.

Test plan

  • Two regression tests in test_llm.py — default-on and caller-override
  • Full unit suite: 761 passing
  • Re-run cluster smoke after 0.14.1 lands (this PR's purpose)

Bumps fipsagents to 0.14.1.

Closes #118.

Assisted-by: Claude Code (Opus 4.7)

call_model_stream_raw now defaults stream_options to
{"include_usage": True}. Without it, vLLM (and any OpenAI-compatible
server) never emits a terminal usage chunk on streaming responses,
which means StreamMetrics.prompt_tokens / completion_tokens stay None
and OpenAIChatServer._persist_cost_data returns early -- cost_data
accumulators stayed empty in production despite full unit-test green.

Surfaced during the cluster smoke for #116. setdefault semantics so
callers can opt out by passing stream_options={include_usage: False}.

Bumps fipsagents to 0.14.1.

Closes #118.

Assisted-by: Claude Code (Opus 4.7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: call_model_stream_raw never requests stream_options.include_usage; cost tracking silent in production

1 participant