Skip to content

Streaming response consumption validates full event union per event on the loop, even for get_final_message() #1649

Description

@charles-dyfis-net

await stream.get_final_message() runs the full per-event pipeline synchronously on the loop: for every wire event it validates the whole RawMessageStreamEvent union, folds the delta into a snapshot, and builds a higher-level event object to yield. ~30-55 µs/event (in part on whether asyncio debug is enabled); on a real batch run that added up to ~100s of event-loop time for stream consumption alone, and a large single response is a few-hundred-ms stall. Much of that is avoidable — validating the whole union per event costs ~60% more than validating just the matched variant, and the event objects are discarded when the caller only wants the final Message.

Compared to performance ticket #1195 (on the request side), this is significantly more severe: On the (py-spy-sampled) run from which this ticket's numbers are taken it was ~50x the larger cost (request-side _transform_recursive is 1.4s, vs ~70s for consumption).

Environment

anthropic 0.105.2, also reproduced on 0.80.0. Python 3.14, pydantic 2.13.x.

Where the time goes

Measuring current timing and proposed fixes, see repro.py — real messages.stream() path against an in-process httpx mock transport, no network or spend.

Self-time from a py-spy profile of a real service consuming messages.stream() across many concurrent calls, summed across the batch (arm64, asyncio debug enabled):

function self s role
accumulate_event 18.2 builds the snapshot per delta; needed for the final message
construct_type 8.0 validates the wire-event union per event
build_events 2.9 builds a per-delta event object every delta
is_union + is_type_alias_type + is_annotated_type ~6.2 per-event type introspection

Plus 31.2s of pydantic self-time called by the Anthropic client, 26.7s of it under construct_type below — ~100s of loop time total for stream consumption.

The union decode is the largest leg — ~45s of the ~100s — and the one our production profiler flagged (construct_typevalidate_python under get_final_message()). construct_type (_models.py:609) validates the whole RawMessageStreamEvent union per event; selecting the variant by its type and validating just that returns an identical object for ~60% less (benchmarked, repro.py). The SDK's own discriminator fallback (:629) doesn't capture this — it routes to the unvalidated .construct() path, which measures ~2x slower. The per-call typing introspection (is_union/get_args, ~1.2 µs/event; the pydantic TypeAdapter is already lru_cached) is a separate, independently memoizable slice. Every consumer pays this leg, iteration or final-message-only.

build_events (_messages.py:284) is smaller but plainer waste: it runs for every event inside __stream__, which get_final_message() drains through, building a TextEvent/InputJsonEvent/etc. per delta that a final-message-only caller never reads.

accumulate_event is the one genuinely required leg — a caller that wants the final message has to fold in each delta — though it carries the latent tail below.

Magnitude

The consumer logged content_block_delta count per response — the n cost scales in, server-chosen, not the output-token count. 1,880 responses:

statistic deltas per response
min / median / mean 1 / 476 / 990
p90 / p95 / p99 2699 / 3359 / 4183
max 5671

1.86M deltas over the ~100s above is ≈54 µs/event on that host; the max response (5,671 deltas) lands near 0.31s, matching the production stalls that first flagged this.

Super-linear tail (latent)

accumulate_event has two O(n²)-in-block-length shapes. The attribute-target +=content.text += delta (:468) and the identical content.thinking += delta (:491, often the longest block) — doesn't get CPython's in-place concat optimization (local targets only), so each delta reallocs the buffer; the input_json_delta branch (:477,:480) does the same with json_buf += bytes(...) and then re-parses the whole buffer with from_json(json_buf, partial_mode=True) every delta. Paired micro-benchmarks against linear controls confirm both (local concat slope 1.0, attribute-target 2.05). Crossover with the linear floor is ~50–80k events and the observed max is 5,671, so this isn't biting today — worth fixing because it pads the per-event constant, not for the asymptote.

Fixes

  1. Decode by validating the discriminated variant rather than the whole union — ~60% of the decode leg, benchmarked, same returned object. Validate the variant, don't .construct() it: the SDK's current discriminator fallback constructs, measuring ~2x slower. Memoize the per-call typing introspection as a separate, smaller win.
  2. Skip or only lazily run build_events (keeping accumulate_event) when draining for the final message only. The per-delta events it builds are entirely unused on that path.
  3. Accumulate text and tool-input into a buffer joined/parsed once instead of per delta — safe only on the final-message drain; on the iterate path it must be a lazy value computed on access, not deferred to content_block_stop, or per-delta snapshot/.input reads regress. Drops the latent tail and trims the constant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions