Skip to content

Add original example idx in input_metadata #346

Merged
mayinghan merged 6 commits intomainfrom
fix-stable-row-id-with-duplicate-content
Nov 25, 2025
Merged

Add original example idx in input_metadata #346
mayinghan merged 6 commits intomainfrom
fix-stable-row-id-with-duplicate-content

Conversation

@mayinghan
Copy link
Collaborator

@mayinghan mayinghan commented Nov 25, 2025


name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''


Description

Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)
Implements # (issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Refactoring/Code cleanup
  • Build/CI/CD related changes
  • Other (please describe):

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

  • Test A
  • Test B

Test Configuration:

  • Firmware version:
  • Hardware:
  • Toolchain:
  • SDK:

Checklist:

  • My code follows the style guidelines of this project (ran black ., isort ., flake8 .)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Screenshots (if applicable)

If applicable, add screenshots to help showcase your changes.

Additional context

Add any other context about the PR here.


Note

Adds data_loader_row_idx to row metadata, copies dataset_info from remote traces, and adds a test ensuring stable row IDs for identical-content rows.

  • Data Loading:
    • Add data_loader_row_idx to row.input_metadata.dataset_info in eval_protocol/data_loader/models.py.
    • Iterate with enumerate in _apply_metadata to attach per-row index.
  • Tracing:
    • Propagate input_metadata.dataset_info from remote rows in eval_protocol/pytest/tracing_utils.py during rollout update.
  • Tests:
    • Add tests/data_loader/test_data_loader_stable_row_id.py to verify stable/unique row_id across repeated generation with identical content.

Written by Cursor Bugbot for commit 93ebb15. This will update automatically on new commits. Configure here.

@mayinghan mayinghan requested a review from benjibc November 25, 2025 06:17
@mayinghan mayinghan marked this pull request as ready for review November 25, 2025 06:17
# Apply row counts
row.input_metadata.dataset_info["data_loader_num_rows"] = original_count
row.input_metadata.dataset_info["data_loader_num_rows_after_preprocessing"] = processed_count
row.input_metadata.dataset_info["data_loader_row_idx"] = idx
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Row index added after preprocessing instead of before

The data_loader_row_idx is enumerated from rows after preprocessing, but the PR aims to add the original example index. When preprocess_fn filters rows, the indices get renumbered (e.g., original rows 0, 2, 4 become indices 0, 1, 2), losing track of the original positions. To capture original indices, enumeration needs to happen before preprocessing in _process_variant and the index preserved through the preprocessing step.

Fix in Cursor Fix in Web

row.messages = remote_row.messages
row.tools = remote_row.tools
row.input_metadata.session_data = remote_row.input_metadata.session_data
row.input_metadata.dataset_info = remote_row.input_metadata.dataset_info
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Original row index lost when copying remote dataset info

The complete overwriting of row.input_metadata.dataset_info with remote_row.input_metadata.dataset_info causes the original row's data_loader_row_idx to be lost. The remote row typically has data_loader_row_idx=0 after filter_longest_conversation preprocessing returns a single-element list, which overwrites the original index that tracked the row's position in the source dataset. This also loses any other original dataset metadata fields.

Fix in Cursor Fix in Web

@dphuang2
Copy link
Collaborator

can you make sure to add tests for whatever behavior you expect

Copy link
Collaborator

@dphuang2 dphuang2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks! I think there are existing tests for the non-duplicated case so if tests pass then LGTM

@mayinghan mayinghan merged commit 361369e into main Nov 25, 2025
9 checks passed
@mayinghan mayinghan deleted the fix-stable-row-id-with-duplicate-content branch November 25, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants