Skip to content

feat: Add codec accessors from (Async)Array; remove codec subclasses#44

Merged
kylebarron merged 7 commits into
mainfrom
kyle/codecs-overhaul
Jun 23, 2026
Merged

feat: Add codec accessors from (Async)Array; remove codec subclasses#44
kylebarron merged 7 commits into
mainfrom
kyle/codecs-overhaul

Conversation

@kylebarron

@kylebarron kylebarron commented Jun 23, 2026

Copy link
Copy Markdown
Member

Change list

  • Instead of single codecs accessor returning an array of codecs, Array/AsyncArray now return compressors (BytesToBytes codecs), filters (ArrayToArray codecs) and serializer (ArrayToBytes codec) as three different properties. This matches upstream Zarr-Python.
  • Remove codec subclasses. Instead of Blosc as a class, it's now just a function to create a BytesToBytesCodec class. This is simpler code and it also makes it simpler to return codecs, because we don't have to downcast to a specific Python subclass. We can return BytesToBytesCodec classes generically.
  • Add name, config, and from_config methods to each codec type.
  • Remove CodecChain, as it's now superseded by the specific codec types returned from Array/AsyncArray

@kylebarron

Copy link
Copy Markdown
Member Author

A consequence of this simplification (of not exposing subclasses) is that we have worse type hinting, because we can't type both the strict codec metadata from zarr_metadata and an open-ended plugin approach.

We could improve type hinting if we brought back subclasses, but I'm not sure if type hinting is worth bringing back the more complex subclass approach.

cc @d-v-b

Claude:

Codec Python API: how to type known vs. unknown (plugin) codecs

Context

Array.filters / .serializer / .compressors return codec objects in three
categories: ArrayToArrayCodec, ArrayToBytesCodec, BytesToBytesCodec.
zarrs ships ~24 codecs today and supports runtime plugin codecs, so the set
of possible codecs is open-ended.

We currently have a lean design: one class per category, with name -> str | None,
config -> JSONValue | None, and a generic from_config(metadata) (routes through
Codec::from_metadata, so it covers any codec including plugins). The open
question is how far to go on typing the known codecs.

Goal

When a user inspects a codec read back from an array, ideally:

  1. Clean narrowing — identifying a known codec narrows its config to the
    exact TypedDict (e.g. GzipCodecConfiguration), no cast.
  2. Readable plugin config — an unknown/plugin codec's config is still usable.
  3. Open world — plugins never break; no closed enumeration that silently
    degrades.

Finding: a single generic class can't satisfy all three (verified with pyright)

We explored making the category class generic on the discriminant, e.g.
BytesToBytesCodec[Name, Config], with per-codec aliases
(GzipCodec = BytesToBytesCodec[Literal["gzip"], GzipCodecConfiguration]) and a
union return type, so if c.name == "gzip" would narrow the whole instance.

It half-works. The blocker is that the open-world fallback member has
name: str, and str ⊇ Literal["gzip"], so a type checker can never exclude
the fallback on a literal check — and Python has no "str minus these literals"
(negation) type. Whatever config the fallback carries either leaks into the
known branches or vanishes everywhere:

Fallback config known-codec config after name == plugin config
JSONValue GzipConfig | JSONValue (cast needed even for known) JSONValue
Never GzipConfig ✅ clean Never (unreadable)
(closed union, no fallback) GzipConfig ✅ clean plugin mistyped as a known codec ❌

So with a single discriminated-union class it's a trilemma — pick two.

The subclass approach escapes the trilemma

Concrete subclasses + nominal (isinstance) narrowing avoids the str-overlap
problem entirely:

class BytesToBytesCodec:                       # base / open-world fallback
    @property
    def config(self) -> Mapping[str, object] | None: ...

class Gzip(BytesToBytesCodec):
    @property
    def config(self) -> GzipCodecConfiguration: ...

@kylebarron kylebarron marked this pull request as ready for review June 23, 2026 16:05
@kylebarron kylebarron changed the title feat: Add codec accessors from (Async)Array feat: Add codec accessors from (Async)Array; remove codec subclasses Jun 23, 2026
@kylebarron kylebarron merged commit 30dfa29 into main Jun 23, 2026
6 checks passed
@kylebarron kylebarron deleted the kyle/codecs-overhaul branch June 23, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant