Skip to content

Fix handling of string columns in model matrix#1147

Closed
nalimilan wants to merge 81 commits intomasterfrom
nl/modelmat
Closed

Fix handling of string columns in model matrix#1147
nalimilan wants to merge 81 commits intomasterfrom
nl/modelmat

Conversation

@nalimilan
Copy link
Member

All non-real columns are now considered as categorical, since conversion
to a float column will likely fail. Types which should be converted to float
will have to define is_categorical(::T) = false. Before this, Vector{String}
and NullableVector{String} columns triggered an error.

Replace ContrastsMatrix(c::ContrastsMatrix, x::CategoricalArray) with
ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector), offering
equivalent functionality: it's more consistent with the ContrastsMatrix()
constructor, and more general since only the levels are actually needed.

Fixes #1085.


The _levels custom function should probably be included in CategoricalArrays. I'd like to hear your opinions about it, though, especially since it treats Nullable in a special fashion (which is consistent with how it works with NullableCategoricalArray). On the one hand, that's really useful in many circumstances; on the other, it breaks the AbstractArray interface by not always following its element type. The element type of CategoricalArray is CategoricalValue{T} anyway, so maybe that's not a concern. That's one of the fundamental design questions with categorical arrays.

Of course this will also have to be applied to StatsModels.

Gord Stephen and others added 30 commits September 14, 2016 10:13
Add compatibility with pre-contrasts ModelFrame constructor
Completely remove support for DataArrays.
This depends on PRs moving these into NullableArrays.jl.
Also use isequal() instead of ==, as the latter is in Base and
unlikely to change its semantics.
groupby() did not follow the order of levels, and wasn't robust to reordering
levels. Add tests for corner cases.
Use the fallbacks for now, should be added back after
JuliaData/CategoricalArrays.jl#12 is fixed.
Not sure what I meant by this. If it was really serious, we'll discover
it sooner or later.
This is a much more general issue (JuliaStats/NullableArrays.jl#85) which
can be tackled later.
For now, preserve the current semantics: conversion to NullableArray
does not happen via insert!().
Again a broader issue which doesn't particularly affect DataFrames. Cf.
JuliaStats/NullableArrays.jl#143
Better handle that separately.
Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.
This depends on a CategoricalArrays change by which levels are sorted when
creating the array.
There's no inconsistency here: when the input is a Matrix, there's no
point in returning a NullableArray. Anyway, these are test methods.
We don't have to handle this right now.
Keep this in DataFrames for now, renaming it to the more explicit
sharepools(). Also relax signatures to accept non-Nullable categorical arrays.
These were not exercized by the tests, and the use case for them isn't obvious.
(They were formerly methods of DataArrays.PooledDataArray().)
For NullableArrays, even current git master is not enough at this time.
Tests pass, but the Nullable{Any} results could be annoying for users.
New type merging NominalArray and OrdinalArray in 0.0.5.
These shouldn't live in DataFrames.
ararslan and others added 20 commits October 1, 2016 10:52
* handle -1 and add tests

* replace `import Base.==` with `Base.:(==)`

* typo and error test
Also return a NullableCategoricalArray from sharepools() since
the code currently doesn't check that no null values are present.
anyway this function is internal and the change imposes no overhead.
* Better display of Nullables

* Don't write trailing space in Latex output

Also fix missing newline in show test
* limit attribute of IOContext is used for html generation

* fixup
I apparently missed these occurrences when removing these functions.
The instruction to rename columns if they don't match was part of the docs previously (http://dataframesjl.readthedocs.io/en/latest/joins_and_indexing.html). I adapted the syntax to avoid using {}.
All non-real columns are now considered as categorical, since conversion
to a float column will likely fail. Types which should be converted to float
will have to define is_categorical(::T) = false. Before this, Vector{String}
and NullableVector{String} columns triggered an error.

Replace ContrastsMatrix(c::ContrastsMatrix, x::CategoricalArray) with
ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector), offering
equivalent functionality: it's more consistent with the ContrastsMatrix()
constructor, and more general since only the levels are actually needed.

function _levels{T<:Nullable}(x::AbstractArray{T})
levs = [get(l) for l in unique(x) if !isnull(l)]
try; sort!(levs); end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the try necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because not all types support isless. And method_exists is actually not enough since it doesn't work with Array{Any}, and isless might even fail when it's defined (e.g. with unordered CategoricalValue).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, that makes sense. Thanks for the explanation.

Copy link
Contributor

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me (I'm assuming the test failures on nightly are unrelated).


const DEFAULT_CONTRASTS = DummyCoding

_levels(x::Union{CategoricalArray, NullableCategoricalArray}) = levels(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember that at some point we'd talked about using unique instead of levels, because the levels of a sub-vector might include values that aren't there (#1095 is the only reference I can find to this but I'm pretty sure it was discussed elsewhere). Then again that might only be a problem using a view of a categorical vector, and we hadn't actually incorporated that optimization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. One issue is that unique(::NullableCategoricalArray) has to return Nullable elements to be consistent with the generic AbstractArray method. But of course we can unwrap these in our internal method.

@nalimilan
Copy link
Member Author

Closing in favor of JuliaStats/StatsModels.jl#13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.