Skip to content

[ISSUE GUIDE] dataset path mismatch, hidden files causing empty task names, and corrupted output HDF5s #27

@MrZoyo

Description

@MrZoyo

Issue 1 — Dataset directory layout mismatch (preprocess finds no data / weird downstream errors)

Symptom

preprocess_libero.py fails early or behaves unexpectedly because it does not find expected LIBERO suite files in the directory it scans.

Root cause

The download/output directory conventions are inconsistent: the preprocessing script expects suites under a data/libero/<suite> layout, but the dataset may be placed directly under data/<suite> (e.g., data/libero_10, data/libero_90, …).

Fix

Move suite folders under data/libero/ so the expected structure exists:

mkdir -p data/libero
mv data/libero_10 data/libero/
mv data/libero_90 data/libero/
mv data/libero_goal data/libero/
mv data/libero_object data/libero/
mv data/libero_spatial data/libero/

After this, data/libero/ contains:

  • libero_10, libero_90, libero_goal, libero_object, libero_spatial

Issue 2 — IndexError: string index out of range in get_task_name_from_file_name

Symptom

Preprocessing crashes with:

IndexError: string index out of range
... in get_task_name_from_file_name
if name[0].isupper():

In our case this sometimes happened after processing many demos (e.g., after completing 50/50 demos for a task).

Root cause

The script iterates over directory entries using something like os.listdir(...) and derives task names via split('.'). If the directory contains hidden files (e.g., .DS_Store) or other non-.hdf5 entries, split('.')[0] can become an empty string (""), and name[0] triggers the IndexError.

Fix (recommended)

Ensure the input suite directory contains only .hdf5 files (remove hidden files / junk entries), e.g.:

find data/libero/libero_spatial -maxdepth 1 -name ".DS_Store" -delete
find data/libero/libero_spatial -maxdepth 1 -name ".ipynb_checkpoints" -exec rm -rf {} +

Fix (more robust, code-level)

Change the outer traversal to only iterate over .hdf5 files (e.g., glob("*.hdf5")) instead of os.listdir. This avoids crashes even if hidden files exist.


Issue 3 — OSError: Unable to open file (bad object header version number) while preprocessing

Symptom

Preprocessing crashes with:

OSError: Unable to open file (bad object header version number)
... in inital_save_h5
with h5py.File(path, 'r') as f:

Notably:

  • It can happen very early (0%) or after completing some demos.
  • A scan of the input dataset .hdf5 files shows they are all readable (bad=0), yet the error persists.

Root cause

This is not caused by the input dataset .hdf5 files. It is caused by a corrupted output .hdf5 file under the preprocessing output directory:

  • Output root: data/atm_libero/<suite>/.../demo_k.hdf5

If a previous run was interrupted (killed job, disconnect, disk full, etc.), a partially written demo_*.hdf5 can remain. When rerunning with --skip_exist 1, the script tries to open existing output files in read mode to decide whether to skip; opening a corrupted output file triggers the HDF5 “bad object header” error.

Fix (per-file)

Delete the specific corrupted output file and rerun with --skip_exist 1.
Example:

rm -f data/atm_libero/libero_goal/<task_name>/demo_1.hdf5
python -m scripts.preprocess_libero --suite libero_goal --skip_exist 1

Fix (suite-level, fastest if nothing valuable was produced)

If the suite fails at 0% or output is not needed, remove the entire output suite folder and rerun:

rm -rf data/atm_libero/libero_90
python -m scripts.preprocess_libero --suite libero_90 --skip_exist 1

Debug tip

If the stack trace does not show which file is corrupted, add a debug print before opening the file in inital_save_h5():

print("[DEBUG] opening existing h5:", path, flush=True)

Then rerun once to get the exact path to delete.

Preventive improvement (code-level)

Wrap the h5py.File(path, 'r') open with try/except OSError and, on failure, delete and regenerate the corrupted output file (so long runs don’t get stuck on a single bad artifact).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions