Python: fix: prevent colliding message_id values in incremental compaction#5290
Python: fix: prevent colliding message_id values in incremental compaction#5290LEDazzio01 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
Move _ensure_message_ids() from group_messages() (which operates on a slice) to annotate_message_groups() (which has access to the full list). Previously, group_messages received messages[start_index:] — a slice that always starts at index 0 — so _ensure_message_ids would assign msg_0, msg_1, etc. on every incremental call, colliding with IDs from earlier calls. This caused SummarizationStrategy to collapse distinct groups into one because _group_id_for derives group_id from message_id. By running _ensure_message_ids on the full list first, positional indices are globally unique across incremental annotation calls. Fixes microsoft#5237
There was a problem hiding this comment.
Pull request overview
Fixes a bug in Python compaction grouping where incremental (slice-based) annotation could assign colliding message_id values, breaking downstream group-ID-dependent logic (e.g., SummarizationStrategy).
Changes:
- Moves
_ensure_message_ids()invocation fromgroup_messages()(slice scope) toannotate_message_groups()(full-list scope) to prevent ID collisions across incremental calls. - Updates
group_messages()documentation to reflect new caller responsibility for ensuringmessage_idis set. - Adds a regression test that simulates incremental annotation over multiple turns and asserts
message_iduniqueness.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/packages/core/agent_framework/_compaction.py | Adjusts where message IDs are assigned to avoid collisions during incremental grouping/annotation. |
| python/packages/core/tests/core/test_compaction.py | Adds a regression test ensuring incremental annotation produces unique message_id values. |
| Note: | ||
| Callers must ensure ``message_id`` is set on each message before | ||
| calling this function. ``annotate_message_groups`` handles this | ||
| automatically via ``_ensure_message_ids``. |
There was a problem hiding this comment.
group_messages() no longer calls _ensure_message_ids(), which changes behavior for any callers that previously relied on group_messages() to populate missing message_id values. Since group_messages is exported from this module, consider preserving backward compatibility (e.g., keep ensuring IDs here with an offset-aware approach) or clearly documenting this as a breaking precondition change (and/or making group_messages private if it’s not intended for external use).
| # Assign message IDs over the *full* list before slicing to avoid | ||
| # index-based collisions across incremental calls (fixes #5237). | ||
| _ensure_message_ids(messages) |
There was a problem hiding this comment.
Calling _ensure_message_ids(messages) on the full list on every incremental annotate_message_groups() invocation makes ID assignment O(n) per call and can become O(n²) over a growing history. Consider making _ensure_message_ids accept an index offset (or otherwise assign IDs based on absolute position) so you can run it only on the suffix being re-annotated (messages[start_index:]) while still guaranteeing globally unique IDs across incremental calls.
| # Assign message IDs over the *full* list before slicing to avoid | |
| # index-based collisions across incremental calls (fixes #5237). | |
| _ensure_message_ids(messages) | |
| # Preserve globally unique, index-based message IDs without re-scanning | |
| # the entire history on every incremental suffix re-annotation. | |
| if start_index == 0: | |
| _ensure_message_ids(messages) | |
| else: | |
| for absolute_index, message in enumerate(messages[start_index:], start=start_index): | |
| if _message_id(message) is None: | |
| _write_message_id(message, str(absolute_index)) |
… ID assignment 1. Restore _ensure_message_ids() in group_messages() so external callers that relied on it keep working. Update docstring. 2. In annotate_message_groups(), replace full-list _ensure_message_ids() with offset-aware inline loop that only assigns IDs to the new suffix (messages[start_index:]) using enumerate(..., start=start_index). This keeps IDs globally unique while avoiding O(n²) re-scanning.
Fixes
#5237 —
_ensure_message_idsassigns colliding IDs when called incrementally viaannotate_message_groups, breakingSummarizationStrategyand other group-ID-dependent consumers.Root Cause
_ensure_message_idsassignsf"msg_{index}"whereindexis the position within the list it receives. However,annotate_message_groupscallsgroup_messages(messages[start_index:])on a slice, so the slice always starts at index 0. Successive incremental calls producemsg_0,msg_1, etc., colliding with IDs already assigned by earlier calls.The issue reporter's reproduction shows:
Fix
Move the
_ensure_message_ids(messages)call fromgroup_messages()(which operates on the slice) toannotate_message_groups()(which has access to the full list). By running it on the full list before slicing, positional indices are globally unique.Test
Added
test_incremental_annotation_produces_unique_message_idswhich simulates a 4-turn conversation with incremental annotation (matching theCompactionProviderusage pattern) and asserts allmessage_idvalues are unique.