Conversation
While the generic unique() method says it preserves the order of appearance, the ordering of levels is more likely to be useful. In particular, it will allow StatsModels to use unique() to get levels present in the data in the user-defined order, with the first level as reference by default. The new code is based on CategoricalArrays.
|
Another approach which I find possibly better would be to keep the current behavior, but define |
If we go that route I think it needs to be in Base, otherwise we're committing the grave sin of type piracy (i.e. overloading Base methods with Base types in an external package). |
|
Yet another approach is not to change anything, and obtain the ordered set of unique values via |
|
I think we should merge this and discuss the larger issue of what |
|
I'm hesitant to deviate from the behavior in Base when overloading a Base function, even if a more useful behavior is possible. |
|
Yeah, but that requires defining |
While the generic unique() method says it preserves the order of appearance,
the ordering of levels is more likely to be useful. In particular, it will
allow StatsModels to use unique() to get levels present in the data in the
user-defined order, with the first level as reference by default.
The new code (inspired by CategoricalArrays) is also more efficient in the
common case where all values are encountered well before the end of the array,
by doing a periodic short-circuiting check.
See JuliaStats/StatsModels.jl#13 (comment). The current behavior was chosen after discussion at #92, but it looks like the issues where mainly about
DataArray, notPooledDataArray. It's slightly annoying to deviate from the standard behavior ofunique, but I can't think of cases where the order of appearance would be more useful than the ordering of levels, which should be carefully chosen (else using a PDA doesn't make much sense).The other solution is to provide a separate function for this, but that sounds overkill.