Skip to content

Rethink SEGFAULT protection #1194

@woutdenolf

Description

@woutdenolf

Regarding the multiprocessing problem we have arrived at a place where the idea of minimal changes is more difficult than re-thinking the entire thing imo.

So I'm starting again 4 years ago when this was introduced: #848.

class H5NodeProxy(object):

    def raw_keys(self):
        # !!!!!! THIS CAN SEGFAULT when the node is h5py.File !!!!!!
        for node in self.children:
            yield node.name


class H5FileProxy(H5NodeProxy):
    def raw_keys(self):
        if isinstance(self.file, h5py.File):
            # The root node is a native h5py.File object
            file_path = self.file.filename
            data_path = self.name
            return HDF5Utils.safe_hdf5_group_keys(file_path, data_path=data_path)
        else:
            return super().raw_keys()

Note that the SEGFAULTs happen when looking at a file that is being written to (so online).

So in summary: when are getting the keys of an h5py.File instance we can get a SEGFAULT and we want to protect ourselves from it.

Protecting against SEGFAULTs can only be done by using a subprocesses. We cannot pass the actual h5py.File object so we need to pass the file_path and data_path (which is always going to be "/" in this case).

The reason when we don't expect safe_hdf5_group_keys to raise an error is that we already have self.file which is an opened file object. So any failure would be most likely due to a SEGFAULT. I think this is where the bad design starts to creep in because this assumption makes the implementation of safe_hdf5_group_keys unclear.

So if we forget about what calls this and just look a a function that protect against crashing the main process but other wide behaves like normal h5py (you get exceptions when HDF5 raises execptions):

def safe_hdf5_group_keys(file_path, data_path="/"):
     """Return a list of keys for the HDF5 group or [] if the process crashes (segfault)."""
    return _run_with_segfault_protection(
        _get_hdf5_group_keys,
        file_path,
        data_path,
        default=[],
    )


def _get_hdf5_group_keys(file_path, data_path):
    import h5py
    with h5py.File(file_path, "r") as f:
        return list(f[data_path].keys())


def _run_with_segfault_protection(func, *args, default=None, timeout=5, **kwargs):
    """
    Run `func(*args, **kwargs)` in a subprocess to isolate crashes.

    - Returns result if subprocess succeeds
    - Returns `default` if subprocess crashes or hangs
    - Falls back to direct execution if multiprocessing unavailable
    """
    queue = None
    p = None

    try:
        import multiprocessing

        ctx = multiprocessing.get_context("spawn")
        queue = ctx.Queue(maxsize=1)

        p = ctx.Process(
            target=_subprocess_worker,
            args=(queue, func, args, kwargs),
        )
        p.start()

        p.join(timeout)

        # Subprocess still alive? force kill
        if p.is_alive():
            try:
                p.terminate()  # graceful attempt
            except Exception:
                pass
            try:
                p.kill()  # forceful attempt
            except Exception:
                pass
            p.join()
            return default

        # If subprocess crashed
        if p.exitcode != 0:
            return default

        try:
            status, payload = queue.get(timeout=1)
        except Exception:
            return default

        if status == "ok":
            return payload
        else:
            raise payload

    except Exception:
        # Fallback: multiprocessing unavailable or not working
        return func(*args, **kwargs)

    finally:
        # Safe cleanup
        if queue is not None:
            try:
                queue.close()
            except Exception:
                pass
        if p is not None:
            try:
                p.close()
            except Exception:
                pass


def _subprocess_worker(queue, func, args, kwargs):
    try:
        result = func(*args, **kwargs)
        queue.put(("ok", result))
    except Exception as exc:
        # We treat Python exceptions as normal results
        queue.put(("error", exc))

The big difference with the current implementation is that we make a different between a normal python exception and a SEGFAULT.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions