Skip to content

Split getitem into several dedicated named functions #283

@jl-wynen

Description

@jl-wynen

Overloaded indexing

We heavily overload __getitem__ to perform a variety of tasks based on the type of its argument. While this can make for a convenient interface in simple cases, it makes the implementation very difficult to understand and check. Further, it also makes it difficult to understand what is happening at the call site. Consider for example this code:

data[x, a:b]

Is this label-based or positional indexing? That is, is x a dimension or a coordinate name? The only way to know is by figuring out the types of a and b which can be tricky.
Similarly:

data[x]

Does this look up a child by name, multiple children by nexus class or does it select a single element by index? Again, we would have to know the type of x.
This is also a problem in Scipp which I think we should address. But I think it is exacerbated in ScippNexus due to the many more overloads (although DataGroup shares some of the extra complexity).

I think this is a comprehensive list: (arguments indicate types, not values)

  • select child: d[str]
  • select all children with matching class: d[type]
  • load all: d[()]
  • load all: d[...]
  • load a single element (1d): d[int]
  • load a range (1d): d[slice[int, int, Any]]
  • load a single element (nd): d[str, int]
  • load a single element (nd): d[str, Variable]
  • load a range (nd): d[str, slice[int, int, Any]]
  • load a range by label (nd): d[str, slice[Variable, Variable, Any]]
  • combination of above for multiple dims: d[dict[str, int | slice[int, int, Any] | slice[Variable, Variable, Any]]]

Granted, not all overloads are available for both groups and datasets. But each has a nontrivial subset of them.

Proposed solution: split into named functions

Ask yourself this question: How often do we call __getitem__ (outside of ScippNexus) and do not know which of the above cases we want? I would expect this number to be almost if not exactly zero. So why do we have code that has to determine, at run time, which of the options the caller selected?

So here is my proposal: Split up __getitem__ into separate named functions which each handle a single user case. This way, each function becomes much simpler and the ambiguity at the call site disappears. This approach is similar to the sel, isel, etc. functions in Pandas.

  • Select a single child
  • Select any children by nexus class
  • Load dataset (all or sub range by position)
    • I think it is fine to overloade d[()], d[...], d[int], d[slice[int, int, Any]] as that is common enough in Python.
    • While I have come to dislike explicit dim labels in __getitem__ (d[str, int] and d[str, slice[int, int, Any]]), I don't think we can split this off any more, there are too many usages now. So we should stick to it in ScippNexus as well.
  • Load all data in a group with a slice
    • I would split this off so that we don't overload d[str] and d[int] etc.
  • Load data by label
    • This is a major point of complication in Scipp itself; we should update that as well if we change ScippNexus.
    • In ScippNexus, performance and memory usage of this operation is different from Scipp and it is much more costly than slicing by position. So this should be a clear opt-in from the user.
  • Slicing in multiple dimensions
    • Is this feature even used?
    • This does not exist in Scipp, so we don't need to use a syntax here that matches what it would look like in Scipp if we had this feature.

I would not be afraid to break with the analogy with h5py and split d[str] and d[()] into separate functions. Just because they do it does not mean that we have to as well.

I am not proposing function names yet or which operations are covered by named functions and which by __getitem__ because I want to discuss the idea itself first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions