julia> summary(posts)
"42710197×3 DataFrame"
julia> typeof.(eachcol(posts))
3-element Vector{DataType}:
SentinelArrays.ChainedVector{Union{Missing, Int64}, SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}}
SentinelArrays.ChainedVector{Union{Missing, Int64}, SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}}
SentinelArrays.ChainedVector{Union{Missing, Int64}, SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}}
julia> @time dropmissing(posts);
0.819397 seconds (137 allocations: 1.822 GiB)
julia> @time dropmissing(copy(posts));
0.560146 seconds (130 allocations: 2.657 GiB)
and - as you can see - it is faster to copy a data frame (to change sentinel vectors to just Vector) and then do dropmissing than just do dropmissing directly.
In some practical cases
SentinelVectoris much slower thanVector. For example for data tested in https://bkamins.github.io/julialang/2022/12/23/duckdb.html.We have:
and - as you can see - it is faster to copy a data frame (to change sentinel vectors to just
Vector) and then dodropmissingthan just dodropmissingdirectly.