await stream.get_final_message() runs the full per-event pipeline synchronously on the loop: for every wire event it validates the whole RawMessageStreamEvent union, folds the delta into a snapshot, and builds a higher-level event object to yield. ~30-55 µs/event (in part on whether asyncio debug is enabled); on a real batch run that added up to ~100s of event-loop time for stream consumption alone, and a large single response is a few-hundred-ms stall. Much of that is avoidable — validating the whole union per event costs ~60% more than validating just the matched variant, and the event objects are discarded when the caller only wants the final Message.
Compared to performance ticket #1195 (on the request side), this is significantly more severe: On the (py-spy-sampled) run from which this ticket's numbers are taken it was ~50x the larger cost (request-side _transform_recursive is 1.4s, vs ~70s for consumption).
Environment
anthropic 0.105.2, also reproduced on 0.80.0. Python 3.14, pydantic 2.13.x.
Where the time goes
Measuring current timing and proposed fixes, see repro.py — real messages.stream() path against an in-process httpx mock transport, no network or spend.
Self-time from a py-spy profile of a real service consuming messages.stream() across many concurrent calls, summed across the batch (arm64, asyncio debug enabled):
| function |
self s |
role |
accumulate_event |
18.2 |
builds the snapshot per delta; needed for the final message |
construct_type |
8.0 |
validates the wire-event union per event |
build_events |
2.9 |
builds a per-delta event object every delta |
is_union + is_type_alias_type + is_annotated_type |
~6.2 |
per-event type introspection |
Plus 31.2s of pydantic self-time called by the Anthropic client, 26.7s of it under construct_type below — ~100s of loop time total for stream consumption.
The union decode is the largest leg — ~45s of the ~100s — and the one our production profiler flagged (construct_type → validate_python under get_final_message()). construct_type (_models.py:609) validates the whole RawMessageStreamEvent union per event; selecting the variant by its type and validating just that returns an identical object for ~60% less (benchmarked, repro.py). The SDK's own discriminator fallback (:629) doesn't capture this — it routes to the unvalidated .construct() path, which measures ~2x slower. The per-call typing introspection (is_union/get_args, ~1.2 µs/event; the pydantic TypeAdapter is already lru_cached) is a separate, independently memoizable slice. Every consumer pays this leg, iteration or final-message-only.
build_events (_messages.py:284) is smaller but plainer waste: it runs for every event inside __stream__, which get_final_message() drains through, building a TextEvent/InputJsonEvent/etc. per delta that a final-message-only caller never reads.
accumulate_event is the one genuinely required leg — a caller that wants the final message has to fold in each delta — though it carries the latent tail below.
Magnitude
The consumer logged content_block_delta count per response — the n cost scales in, server-chosen, not the output-token count. 1,880 responses:
| statistic |
deltas per response |
| min / median / mean |
1 / 476 / 990 |
| p90 / p95 / p99 |
2699 / 3359 / 4183 |
| max |
5671 |
1.86M deltas over the ~100s above is ≈54 µs/event on that host; the max response (5,671 deltas) lands near 0.31s, matching the production stalls that first flagged this.
Super-linear tail (latent)
accumulate_event has two O(n²)-in-block-length shapes. The attribute-target += — content.text += delta (:468) and the identical content.thinking += delta (:491, often the longest block) — doesn't get CPython's in-place concat optimization (local targets only), so each delta reallocs the buffer; the input_json_delta branch (:477,:480) does the same with json_buf += bytes(...) and then re-parses the whole buffer with from_json(json_buf, partial_mode=True) every delta. Paired micro-benchmarks against linear controls confirm both (local concat slope 1.0, attribute-target 2.05). Crossover with the linear floor is ~50–80k events and the observed max is 5,671, so this isn't biting today — worth fixing because it pads the per-event constant, not for the asymptote.
Fixes
- Decode by validating the discriminated variant rather than the whole union — ~60% of the decode leg, benchmarked, same returned object. Validate the variant, don't
.construct() it: the SDK's current discriminator fallback constructs, measuring ~2x slower. Memoize the per-call typing introspection as a separate, smaller win.
- Skip or only lazily run
build_events (keeping accumulate_event) when draining for the final message only. The per-delta events it builds are entirely unused on that path.
- Accumulate text and tool-input into a buffer joined/parsed once instead of per delta — safe only on the final-message drain; on the iterate path it must be a lazy value computed on access, not deferred to
content_block_stop, or per-delta snapshot/.input reads regress. Drops the latent tail and trims the constant.
await stream.get_final_message()runs the full per-event pipeline synchronously on the loop: for every wire event it validates the wholeRawMessageStreamEventunion, folds the delta into a snapshot, and builds a higher-level event object to yield. ~30-55 µs/event (in part on whether asyncio debug is enabled); on a real batch run that added up to ~100s of event-loop time for stream consumption alone, and a large single response is a few-hundred-ms stall. Much of that is avoidable — validating the whole union per event costs ~60% more than validating just the matched variant, and the event objects are discarded when the caller only wants the finalMessage.Compared to performance ticket #1195 (on the request side), this is significantly more severe: On the (py-spy-sampled) run from which this ticket's numbers are taken it was ~50x the larger cost (request-side
_transform_recursiveis 1.4s, vs ~70s for consumption).Environment
anthropic0.105.2, also reproduced on 0.80.0. Python 3.14, pydantic 2.13.x.Where the time goes
Measuring current timing and proposed fixes, see repro.py — real
messages.stream()path against an in-process httpx mock transport, no network or spend.Self-time from a py-spy profile of a real service consuming
messages.stream()across many concurrent calls, summed across the batch (arm64, asyncio debug enabled):accumulate_eventconstruct_typebuild_eventsis_union+is_type_alias_type+is_annotated_typePlus 31.2s of pydantic self-time called by the Anthropic client, 26.7s of it under
construct_typebelow — ~100s of loop time total for stream consumption.The union decode is the largest leg — ~45s of the ~100s — and the one our production profiler flagged (
construct_type→validate_pythonunderget_final_message()).construct_type(_models.py:609) validates the wholeRawMessageStreamEventunion per event; selecting the variant by itstypeand validating just that returns an identical object for ~60% less (benchmarked,repro.py). The SDK's own discriminator fallback (:629) doesn't capture this — it routes to the unvalidated.construct()path, which measures ~2x slower. The per-call typing introspection (is_union/get_args, ~1.2 µs/event; the pydanticTypeAdapteris alreadylru_cached) is a separate, independently memoizable slice. Every consumer pays this leg, iteration or final-message-only.build_events(_messages.py:284) is smaller but plainer waste: it runs for every event inside__stream__, whichget_final_message()drains through, building aTextEvent/InputJsonEvent/etc. per delta that a final-message-only caller never reads.accumulate_eventis the one genuinely required leg — a caller that wants the final message has to fold in each delta — though it carries the latent tail below.Magnitude
The consumer logged
content_block_deltacount per response — thencost scales in, server-chosen, not the output-token count. 1,880 responses:1.86M deltas over the ~100s above is ≈54 µs/event on that host; the max response (5,671 deltas) lands near 0.31s, matching the production stalls that first flagged this.
Super-linear tail (latent)
accumulate_eventhas two O(n²)-in-block-length shapes. The attribute-target+=—content.text += delta(:468) and the identicalcontent.thinking += delta(:491, often the longest block) — doesn't get CPython's in-place concat optimization (local targets only), so each delta reallocs the buffer; theinput_json_deltabranch (:477,:480) does the same withjson_buf += bytes(...)and then re-parses the whole buffer withfrom_json(json_buf, partial_mode=True)every delta. Paired micro-benchmarks against linear controls confirm both (local concat slope 1.0, attribute-target 2.05). Crossover with the linear floor is ~50–80k events and the observed max is 5,671, so this isn't biting today — worth fixing because it pads the per-event constant, not for the asymptote.Fixes
.construct()it: the SDK's current discriminator fallback constructs, measuring ~2x slower. Memoize the per-call typing introspection as a separate, smaller win.build_events(keepingaccumulate_event) when draining for the final message only. The per-delta events it builds are entirely unused on that path.content_block_stop, or per-deltasnapshot/.inputreads regress. Drops the latent tail and trims the constant.