Fix handling of string columns in model matrix#1147
Fix handling of string columns in model matrix#1147
Conversation
Add compatibility with pre-contrasts ModelFrame constructor
…ise for speed improvement (#1070)
Completely remove support for DataArrays.
This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.
groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.
Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.
Not sure what I meant by this. If it was really serious, we'll discover it sooner or later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.
For now, preserve the current semantics: conversion to NullableArray does not happen via insert!().
Again a broader issue which doesn't particularly affect DataFrames. Cf. JuliaStats/NullableArrays.jl#143
Better handle that separately.
Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.
This depends on a CategoricalArrays change by which levels are sorted when creating the array.
There's no inconsistency here: when the input is a Matrix, there's no point in returning a NullableArray. Anyway, these are test methods.
We don't have to handle this right now.
Keep this in DataFrames for now, renaming it to the more explicit sharepools(). Also relax signatures to accept non-Nullable categorical arrays.
These were not exercized by the tests, and the use case for them isn't obvious. (They were formerly methods of DataArrays.PooledDataArray().)
For NullableArrays, even current git master is not enough at this time.
Tests pass, but the Nullable{Any} results could be annoying for users.
New type merging NominalArray and OrdinalArray in 0.0.5.
These shouldn't live in DataFrames.
Misc minor enhancements
Also return a NullableCategoricalArray from sharepools() since the code currently doesn't check that no null values are present. anyway this function is internal and the change imposes no overhead.
* Better display of Nullables * Don't write trailing space in Latex output Also fix missing newline in show test
* limit attribute of IOContext is used for html generation * fixup
I apparently missed these occurrences when removing these functions.
The instruction to rename columns if they don't match was part of the docs previously (http://dataframesjl.readthedocs.io/en/latest/joins_and_indexing.html). I adapted the syntax to avoid using {}.
All non-real columns are now considered as categorical, since conversion
to a float column will likely fail. Types which should be converted to float
will have to define is_categorical(::T) = false. Before this, Vector{String}
and NullableVector{String} columns triggered an error.
Replace ContrastsMatrix(c::ContrastsMatrix, x::CategoricalArray) with
ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector), offering
equivalent functionality: it's more consistent with the ContrastsMatrix()
constructor, and more general since only the levels are actually needed.
|
|
||
| function _levels{T<:Nullable}(x::AbstractArray{T}) | ||
| levs = [get(l) for l in unique(x) if !isnull(l)] | ||
| try; sort!(levs); end |
There was a problem hiding this comment.
Why is the try necessary here?
There was a problem hiding this comment.
Because not all types support isless. And method_exists is actually not enough since it doesn't work with Array{Any}, and isless might even fail when it's defined (e.g. with unordered CategoricalValue).
There was a problem hiding this comment.
Ah, okay, that makes sense. Thanks for the explanation.
kleinschmidt
left a comment
There was a problem hiding this comment.
This looks fine to me (I'm assuming the test failures on nightly are unrelated).
|
|
||
| const DEFAULT_CONTRASTS = DummyCoding | ||
|
|
||
| _levels(x::Union{CategoricalArray, NullableCategoricalArray}) = levels(x) |
There was a problem hiding this comment.
I vaguely remember that at some point we'd talked about using unique instead of levels, because the levels of a sub-vector might include values that aren't there (#1095 is the only reference I can find to this but I'm pretty sure it was discussed elsewhere). Then again that might only be a problem using a view of a categorical vector, and we hadn't actually incorporated that optimization?
There was a problem hiding this comment.
Yes. One issue is that unique(::NullableCategoricalArray) has to return Nullable elements to be consistent with the generic AbstractArray method. But of course we can unwrap these in our internal method.
|
Closing in favor of JuliaStats/StatsModels.jl#13. |
All non-real columns are now considered as categorical, since conversion
to a float column will likely fail. Types which should be converted to float
will have to define is_categorical(::T) = false. Before this, Vector{String}
and NullableVector{String} columns triggered an error.
Replace ContrastsMatrix(c::ContrastsMatrix, x::CategoricalArray) with
ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector), offering
equivalent functionality: it's more consistent with the ContrastsMatrix()
constructor, and more general since only the levels are actually needed.
Fixes #1085.
The
_levelscustom function should probably be included in CategoricalArrays. I'd like to hear your opinions about it, though, especially since it treatsNullablein a special fashion (which is consistent with how it works withNullableCategoricalArray). On the one hand, that's really useful in many circumstances; on the other, it breaks theAbstractArrayinterface by not always following its element type. The element type ofCategoricalArrayisCategoricalValue{T}anyway, so maybe that's not a concern. That's one of the fundamental design questions with categorical arrays.Of course this will also have to be applied to StatsModels.