[ISSUE GUIDE] dataset path mismatch, hidden files causing empty task names, and corrupted output HDF5s

## Issue 1 — Dataset directory layout mismatch (preprocess finds no data / weird downstream errors)

### Symptom

`preprocess_libero.py` fails early or behaves unexpectedly because it does not find expected LIBERO suite files in the directory it scans.

### Root cause

The download/output directory conventions are inconsistent: the preprocessing script expects suites under a `data/libero/<suite>` layout, but the dataset may be placed directly under `data/<suite>` (e.g., `data/libero_10`, `data/libero_90`, …).

### Fix

Move suite folders under `data/libero/` so the expected structure exists:

```bash
mkdir -p data/libero
mv data/libero_10 data/libero/
mv data/libero_90 data/libero/
mv data/libero_goal data/libero/
mv data/libero_object data/libero/
mv data/libero_spatial data/libero/
```

After this, `data/libero/` contains:

* `libero_10`, `libero_90`, `libero_goal`, `libero_object`, `libero_spatial`

---

## Issue 2 — `IndexError: string index out of range` in `get_task_name_from_file_name`

### Symptom

Preprocessing crashes with:

```
IndexError: string index out of range
... in get_task_name_from_file_name
if name[0].isupper():
```

In our case this sometimes happened after processing many demos (e.g., after completing 50/50 demos for a task).

### Root cause

The script iterates over directory entries using something like `os.listdir(...)` and derives task names via `split('.')`. If the directory contains hidden files (e.g., `.DS_Store`) or other non-`.hdf5` entries, `split('.')[0]` can become an empty string (`""`), and `name[0]` triggers the `IndexError`.

### Fix (recommended)

Ensure the input suite directory contains only `.hdf5` files (remove hidden files / junk entries), e.g.:

```bash
find data/libero/libero_spatial -maxdepth 1 -name ".DS_Store" -delete
find data/libero/libero_spatial -maxdepth 1 -name ".ipynb_checkpoints" -exec rm -rf {} +
```

### Fix (more robust, code-level)

Change the outer traversal to only iterate over `.hdf5` files (e.g., `glob("*.hdf5")`) instead of `os.listdir`. This avoids crashes even if hidden files exist.

---

## Issue 3 — `OSError: Unable to open file (bad object header version number)` while preprocessing

### Symptom

Preprocessing crashes with:

```
OSError: Unable to open file (bad object header version number)
... in inital_save_h5
with h5py.File(path, 'r') as f:
```

Notably:

* It can happen very early (0%) or after completing some demos.
* A scan of the *input* dataset `.hdf5` files shows they are all readable (bad=0), yet the error persists.

### Root cause

This is **not** caused by the input dataset `.hdf5` files. It is caused by a **corrupted output** `.hdf5` file under the preprocessing output directory:

* Output root: `data/atm_libero/<suite>/.../demo_k.hdf5`

If a previous run was interrupted (killed job, disconnect, disk full, etc.), a partially written `demo_*.hdf5` can remain. When rerunning with `--skip_exist 1`, the script tries to open existing output files in read mode to decide whether to skip; opening a corrupted output file triggers the HDF5 “bad object header” error.

### Fix (per-file)

Delete the specific corrupted output file and rerun with `--skip_exist 1`.
Example:

```bash
rm -f data/atm_libero/libero_goal/<task_name>/demo_1.hdf5
python -m scripts.preprocess_libero --suite libero_goal --skip_exist 1
```

### Fix (suite-level, fastest if nothing valuable was produced)

If the suite fails at 0% or output is not needed, remove the entire output suite folder and rerun:

```bash
rm -rf data/atm_libero/libero_90
python -m scripts.preprocess_libero --suite libero_90 --skip_exist 1
```

### Debug tip

If the stack trace does not show which file is corrupted, add a debug print before opening the file in `inital_save_h5()`:

```python
print("[DEBUG] opening existing h5:", path, flush=True)
```

Then rerun once to get the exact `path` to delete.

### Preventive improvement (code-level)

Wrap the `h5py.File(path, 'r')` open with `try/except OSError` and, on failure, delete and regenerate the corrupted output file (so long runs don’t get stuck on a single bad artifact).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ISSUE GUIDE] dataset path mismatch, hidden files causing empty task names, and corrupted output HDF5s #27

Issue 1 — Dataset directory layout mismatch (preprocess finds no data / weird downstream errors)

Symptom

Root cause

Fix

Issue 2 — `IndexError: string index out of range` in `get_task_name_from_file_name`

Symptom

Root cause

Fix (recommended)

Fix (more robust, code-level)

Issue 3 — `OSError: Unable to open file (bad object header version number)` while preprocessing

Symptom

Root cause

Fix (per-file)

Fix (suite-level, fastest if nothing valuable was produced)

Debug tip

Preventive improvement (code-level)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ISSUE GUIDE] dataset path mismatch, hidden files causing empty task names, and corrupted output HDF5s #27

Description

Issue 1 — Dataset directory layout mismatch (preprocess finds no data / weird downstream errors)

Symptom

Root cause

Fix

Issue 2 — IndexError: string index out of range in get_task_name_from_file_name

Symptom

Root cause

Fix (recommended)

Fix (more robust, code-level)

Issue 3 — OSError: Unable to open file (bad object header version number) while preprocessing

Symptom

Root cause

Fix (per-file)

Fix (suite-level, fastest if nothing valuable was produced)

Debug tip

Preventive improvement (code-level)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Issue 2 — `IndexError: string index out of range` in `get_task_name_from_file_name`

Issue 3 — `OSError: Unable to open file (bad object header version number)` while preprocessing