Add original example idx in input_metadata #346
Conversation
| # Apply row counts | ||
| row.input_metadata.dataset_info["data_loader_num_rows"] = original_count | ||
| row.input_metadata.dataset_info["data_loader_num_rows_after_preprocessing"] = processed_count | ||
| row.input_metadata.dataset_info["data_loader_row_idx"] = idx |
There was a problem hiding this comment.
Bug: Row index added after preprocessing instead of before
The data_loader_row_idx is enumerated from rows after preprocessing, but the PR aims to add the original example index. When preprocess_fn filters rows, the indices get renumbered (e.g., original rows 0, 2, 4 become indices 0, 1, 2), losing track of the original positions. To capture original indices, enumeration needs to happen before preprocessing in _process_variant and the index preserved through the preprocessing step.
| row.messages = remote_row.messages | ||
| row.tools = remote_row.tools | ||
| row.input_metadata.session_data = remote_row.input_metadata.session_data | ||
| row.input_metadata.dataset_info = remote_row.input_metadata.dataset_info |
There was a problem hiding this comment.
Bug: Original row index lost when copying remote dataset info
The complete overwriting of row.input_metadata.dataset_info with remote_row.input_metadata.dataset_info causes the original row's data_loader_row_idx to be lost. The remote row typically has data_loader_row_idx=0 after filter_longest_conversation preprocessing returns a single-element list, which overwrites the original index that tracked the row's position in the source dataset. This also loses any other original dataset metadata fields.
|
can you make sure to add tests for whatever behavior you expect |
dphuang2
left a comment
There was a problem hiding this comment.
awesome, thanks! I think there are existing tests for the non-duplicated case so if tests pass then LGTM
name: Pull Request
about: Propose changes to the codebase
title: "Brief description of changes"
labels: ''
assignees: ''
Description
Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Implements # (issue)
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
Test Configuration:
Checklist:
black .,isort .,flake8 .)Screenshots (if applicable)
If applicable, add screenshots to help showcase your changes.
Additional context
Add any other context about the PR here.
Note
Adds
data_loader_row_idxto row metadata, copiesdataset_infofrom remote traces, and adds a test ensuring stable row IDs for identical-content rows.data_loader_row_idxtorow.input_metadata.dataset_infoineval_protocol/data_loader/models.py.enumeratein_apply_metadatato attach per-row index.input_metadata.dataset_infofrom remote rows ineval_protocol/pytest/tracing_utils.pyduring rollout update.tests/data_loader/test_data_loader_stable_row_id.pyto verify stable/uniquerow_idacross repeated generation with identical content.Written by Cursor Bugbot for commit 93ebb15. This will update automatically on new commits. Configure here.