Change unique() to return values in the same ordering as levels for PDAs by nalimilan · Pull Request #237 · JuliaStats/DataArrays.jl

nalimilan · 2017-02-19T21:32:24Z

While the generic unique() method says it preserves the order of appearance,
the ordering of levels is more likely to be useful. In particular, it will
allow StatsModels to use unique() to get levels present in the data in the
user-defined order, with the first level as reference by default.

The new code (inspired by CategoricalArrays) is also more efficient in the
common case where all values are encountered well before the end of the array,
by doing a periodic short-circuiting check.

See JuliaStats/StatsModels.jl#13 (comment). The current behavior was chosen after discussion at #92, but it looks like the issues where mainly about DataArray, not PooledDataArray. It's slightly annoying to deviate from the standard behavior of unique, but I can't think of cases where the order of appearance would be more useful than the ordering of levels, which should be carefully chosen (else using a PDA doesn't make much sense).

The other solution is to provide a separate function for this, but that sounds overkill.

While the generic unique() method says it preserves the order of appearance, the ordering of levels is more likely to be useful. In particular, it will allow StatsModels to use unique() to get levels present in the data in the user-defined order, with the first level as reference by default. The new code is based on CategoricalArrays.

nalimilan · 2017-02-22T10:17:03Z

Another approach which I find possibly better would be to keep the current behavior, but define unique(x::AbstractArray, sort::Bool=false). When sort=true, the order of levels would be preserved for PooledDataArray; for other AbstractArray, we would simply call sort!(unique(x)). That would exactly suit our needs (see JuliaStats/StatsModels.jl#14 (comment)), and could be a useful behavior in general. The default method could even be defined in Base.

ararslan · 2017-02-22T18:06:21Z

The default method could even be defined in Base.

If we go that route I think it needs to be in Base, otherwise we're committing the grave sin of type piracy (i.e. overloading Base methods with Base types in an external package).

nalimilan · 2017-02-24T16:59:15Z

Yet another approach is not to change anything, and obtain the ordered set of unique values via intersect(levels(x), unique(x)). The advantage is that it's simple and does not require a new API, the drawback is that unique does not return the most useful result by default (i.e. functions which call unique generically and return results based on that will use a meaningless order while a useful order is available).

nalimilan · 2017-03-10T10:42:06Z

I think we should merge this and discuss the larger issue of what unique means for AbstractArray later, as we need it to get StatsModels to work with DataArrays.

ararslan · 2017-03-10T19:36:09Z

I'm hesitant to deviate from the behavior in Base when overloading a Base function, even if a more useful behavior is possible. intersect(levels(x), unique(x)) seems okay to me, if a bit verbose.

nalimilan · 2017-03-10T22:22:02Z

Yeah, but that requires defining levels somewhere. Not so urgent I guess if we go with JuliaData/DataFrames.jl#1170 for now.

This was referenced Feb 19, 2017

Get tests pass again, fix handling of string columns JuliaStats/StatsModels.jl#13

Merged

Avoid hard-coding for concrete implementations JuliaStats/StatsModels.jl#14

Closed

nalimilan mentioned this pull request Feb 25, 2017

Improve docstrings for unique() JuliaLang/julia#20800

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change unique() to return values in the same ordering as levels for PDAs#237

Change unique() to return values in the same ordering as levels for PDAs#237
nalimilan wants to merge 1 commit intomasterfrom
nl/unique

nalimilan commented Feb 19, 2017

Uh oh!

nalimilan commented Feb 22, 2017

Uh oh!

ararslan commented Feb 22, 2017

Uh oh!

nalimilan commented Feb 24, 2017

Uh oh!

nalimilan commented Mar 10, 2017

Uh oh!

ararslan commented Mar 10, 2017

Uh oh!

nalimilan commented Mar 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nalimilan commented Feb 19, 2017

Uh oh!

nalimilan commented Feb 22, 2017

Uh oh!

ararslan commented Feb 22, 2017

Uh oh!

nalimilan commented Feb 24, 2017

Uh oh!

nalimilan commented Mar 10, 2017

Uh oh!

ararslan commented Mar 10, 2017

Uh oh!

nalimilan commented Mar 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants