SentinelArrays.jl uses undef constructors to initialize the array to "missing" values. I propose to do the following instead:
- Use
missing constructors to initialize arrays to "missing" (sentinel) values.
- Use
undef constructors to skip initialization (or more precisely, do whatever the undef constructor does for the underlying array)
Background: undef constructors of Base arrays have been (ab)used to initialize arrays with missing values. This relies on undocumented behavior: the undef constructor does zero-initialization of the memory for union arrays, to avoid invalid element types. And due to implementation details, arrays of Union{Missing,...} usually end up with all elements initialized to missing.
Relying on undocumented behavior is not great, so an attempt was made to document it: JuliaLang/julia#31091. It turned out that this use of undef works almost always, but can fail with unions of Missing and other singleton types.
The issue was solved by addingmissing constructors for Base arrays: JuliaLang/julia#25054 .
Arguments:
-
Using undef to initialize to a particular value makes no sense from a semantics point of view. It doesn't feel like a good API when "define to xxx" is made by calling "leave undefined"!
Please note the problem is not in the behavior of Base constructors: a "leave undefined" constructor can return anything, so why not all-missing values. But users shouldn't rely on it (at least not when there is a better way).
-
Using undef for this purpose in SentinelArrays will encourage people to do the same with Base arrays, where it can introduce subtle bugs (since it works almost always, but not always).
-
It leaves performance on the table. In Base, zero-initialization is necessary to guarantee valid union types. This constraint doesn't apply to SentinelArrays! And the whole point of SentinelArrays is to give better performance for particular use cases, so I think SentinelArrays should define undef constructor that do what it says on the tin: leave values uninitialized.
Relation to Base:
A downside of this proposal is that SentinelVector{Float64}(undef, 3) would not longer behave the same as Array{Union{Missing,Float64}}(undef, 3). But the latter is undefined behavior. Is agreeing on this undefined behavior more important than having an API that makes sense and offering the best performance?
Another way to look at it:
- SentinelArrays should promote patterns that also work with Base arrays, so it should not promote using
undef to get missing.
- Base does the semantically correct thing:
undef for uninitialized (where possible), and missing for initialized-to-missing. In particular, Base does not document the values returned by undef constructors. I think SentinelArrays should do the same.
SentinelArrays.jl uses
undefconstructors to initialize the array to "missing" values. I propose to do the following instead:missingconstructors to initialize arrays to "missing" (sentinel) values.undefconstructors to skip initialization (or more precisely, do whatever theundefconstructor does for the underlying array)Background:
undefconstructors of Base arrays have been (ab)used to initialize arrays with missing values. This relies on undocumented behavior: theundefconstructor does zero-initialization of the memory for union arrays, to avoid invalid element types. And due to implementation details, arrays ofUnion{Missing,...}usually end up with all elements initialized tomissing.Relying on undocumented behavior is not great, so an attempt was made to document it: JuliaLang/julia#31091. It turned out that this use of
undefworks almost always, but can fail with unions ofMissingand other singleton types.The issue was solved by adding
missingconstructors for Base arrays: JuliaLang/julia#25054 .Arguments:
Using
undefto initialize to a particular value makes no sense from a semantics point of view. It doesn't feel like a good API when "define to xxx" is made by calling "leave undefined"!Please note the problem is not in the behavior of Base constructors: a "leave undefined" constructor can return anything, so why not all-missing values. But users shouldn't rely on it (at least not when there is a better way).
Using
undeffor this purpose in SentinelArrays will encourage people to do the same with Base arrays, where it can introduce subtle bugs (since it works almost always, but not always).It leaves performance on the table. In Base, zero-initialization is necessary to guarantee valid union types. This constraint doesn't apply to SentinelArrays! And the whole point of SentinelArrays is to give better performance for particular use cases, so I think SentinelArrays should define
undefconstructor that do what it says on the tin: leave values uninitialized.Relation to Base:
A downside of this proposal is that
SentinelVector{Float64}(undef, 3)would not longer behave the same asArray{Union{Missing,Float64}}(undef, 3). But the latter is undefined behavior. Is agreeing on this undefined behavior more important than having an API that makes sense and offering the best performance?Another way to look at it:
undefto getmissing.undeffor uninitialized (where possible), andmissingfor initialized-to-missing. In particular, Base does not document the values returned byundefconstructors. I think SentinelArrays should do the same.