Success only filter for lerobot dataset conversion.#22
Conversation
There was a problem hiding this comment.
Pull request overview
Adds options and logic to support filtering to successful episodes during LeRobot v2.1 conversion, while maintaining LeRobot’s requirement for contiguous episode_index/task_index and enabling safe re-runs into an existing output directory.
Changes:
- Add
success_onlyfiltering during conversion and remap episode/task indices to contiguous ranges in the output. - Add
overwritesupport to clear prior LeRobot outputs safely when re-running conversion. - Adjust train split computation to avoid empty train splits on very small (filtered) datasets.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/test_lerobot_v21.py |
Updates existing conversion fixture to use overwrite, and adds a new test covering success_only conversion output. |
src/openarm_dataset/lerobot_v21.py |
Implements success_only filtering, contiguous episode/task remapping, overwrite clearing logic, and updates split computation. |
src/openarm_dataset/convert.py |
Exposes --success-only and --overwrite CLI flags and passes them through to Dataset.write(...) for lerobot_v2.1. |
|
|
|
OK, will separate the PR. |
|
Remove |
| num_episodes = len(records) | ||
| total_chunks = max((num_episodes - 1) // CHUNK_SIZE + 1, 0) if num_episodes else 0 | ||
| train_end = int(num_episodes * train_split) | ||
| train_end = round(num_episodes * train_split) |
| def _build_remaps(dataset: Dataset, records): | ||
| """Build remapping dicts from original episode/task indices to contiguous indices. | ||
|
|
||
| When records is a filtered subset of episodes (e.g., success_only=True), | ||
| original indices may be sparse. LeRobot v2.1 expects episode/task indices | ||
| to be contiguous starting from 0. When records contains all episodes the | ||
| returned maps are the identity. | ||
| """ | ||
| remap_episode_index = {original: new for new, (original, *_) in enumerate(records)} | ||
| seen = set() | ||
| used_task_indices = [] | ||
| for original_episode_index, *_ in records: | ||
| original_task_index = int( | ||
| dataset.meta.episodes[original_episode_index]["task_index"] | ||
| ) | ||
| if original_task_index not in seen: | ||
| seen.add(original_task_index) | ||
| used_task_indices.append(original_task_index) | ||
| used_task_indices.sort() | ||
| remap_task_index = {original: new for new, original in enumerate(used_task_indices)} | ||
| return remap_episode_index, remap_task_index |
| parser.add_argument( | ||
| "--success-only", | ||
| help="Include only successful episodes in the output dataset (default: False) if the output format is lerobot_v2.1", | ||
| action="store_true", | ||
| default=False, | ||
| ) |
kou
left a comment
There was a problem hiding this comment.
It may be better that we create a class instead of using module private functions to reduce arguments for module private functions.
Can we work on it as a separated task?
| ): | ||
| records = [] | ||
| for episode_index in range(dataset.meta.num_episodes): | ||
| success = bool(dataset.meta.episodes[episode_index]["success"]) |
| seen = set() | ||
| used_task_indices = [] | ||
| for original_episode_index, *_ in records: | ||
| original_task_index = int( |
|
|
||
| # collect downsampled data for each episode | ||
| records = _collect_downsampled_data(dataset, fps, joint_keys) | ||
| records = _collect_downsampled_data(dataset, fps, joint_keys, success_only) |
There was a problem hiding this comment.
Should we return an error when there are 0 records?
There was a problem hiding this comment.
Is empty LeRobotDataset v2.1 invalid?
Summary
values starting from 0, the conversion now remaps original sparse indices to a contiguous range whenever episodes are filtered out (and falls back to the identity map when nothing is filtered).