Fix handling of string columns in model matrix by nalimilan · Pull Request #1147 · JuliaData/DataFrames.jl

nalimilan · 2017-01-15T17:27:20Z

All non-real columns are now considered as categorical, since conversion
to a float column will likely fail. Types which should be converted to float
will have to define is_categorical(::T) = false. Before this, Vector{String}
and NullableVector{String} columns triggered an error.

Replace ContrastsMatrix(c::ContrastsMatrix, x::CategoricalArray) with
ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector), offering
equivalent functionality: it's more consistent with the ContrastsMatrix()
constructor, and more general since only the levels are actually needed.

Fixes #1085.

The _levels custom function should probably be included in CategoricalArrays. I'd like to hear your opinions about it, though, especially since it treats Nullable in a special fashion (which is consistent with how it works with NullableCategoricalArray). On the one hand, that's really useful in many circumstances; on the other, it breaks the AbstractArray interface by not always following its element type. The element type of CategoricalArray is CategoricalValue{T} anyway, so maybe that's not a concern. That's one of the fundamental design questions with categorical arrays.

Of course this will also have to be applied to StatsModels.

Add compatibility with pre-contrasts ModelFrame constructor

…ise for speed improvement (#1070)

Completely remove support for DataArrays.

This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.

groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.

Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.

Not sure what I meant by this. If it was really serious, we'll discover it sooner or later.

This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.

For now, preserve the current semantics: conversion to NullableArray does not happen via insert!().

Again a broader issue which doesn't particularly affect DataFrames. Cf. JuliaStats/NullableArrays.jl#143

Better handle that separately.

Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.

This depends on a CategoricalArrays change by which levels are sorted when creating the array.

There's no inconsistency here: when the input is a Matrix, there's no point in returning a NullableArray. Anyway, these are test methods.

We don't have to handle this right now.

Keep this in DataFrames for now, renaming it to the more explicit sharepools(). Also relax signatures to accept non-Nullable categorical arrays.

These were not exercized by the tests, and the use case for them isn't obvious. (They were formerly methods of DataArrays.PooledDataArray().)

For NullableArrays, even current git master is not enough at this time.

Tests pass, but the Nullable{Any} results could be annoying for users.

New type merging NominalArray and OrdinalArray in 0.0.5.

These shouldn't live in DataFrames.

…w references

Misc minor enhancements

* handle -1 and add tests * replace `import Base.==` with `Base.:(==)` * typo and error test

Also return a NullableCategoricalArray from sharepools() since the code currently doesn't check that no null values are present. anyway this function is internal and the change imposes no overhead.

* Better display of Nullables * Don't write trailing space in Latex output Also fix missing newline in show test

* limit attribute of IOContext is used for html generation * fixup

Closes #1103

I apparently missed these occurrences when removing these functions.

The instruction to rename columns if they don't match was part of the docs previously (http://dataframesjl.readthedocs.io/en/latest/joins_and_indexing.html). I adapted the syntax to avoid using {}.

All non-real columns are now considered as categorical, since conversion to a float column will likely fail. Types which should be converted to float will have to define is_categorical(::T) = false. Before this, Vector{String} and NullableVector{String} columns triggered an error. Replace ContrastsMatrix(c::ContrastsMatrix, x::CategoricalArray) with ContrastsMatrix(c::ContrastsMatrix, levels::AbstractVector), offering equivalent functionality: it's more consistent with the ContrastsMatrix() constructor, and more general since only the levels are actually needed.

ararslan · 2017-01-15T21:26:11Z

src/statsmodels/formula.jl

+
+function _levels{T<:Nullable}(x::AbstractArray{T})
+    levs = [get(l) for l in unique(x) if !isnull(l)]
+    try; sort!(levs); end


Why is the try necessary here?

Because not all types support isless. And method_exists is actually not enough since it doesn't work with Array{Any}, and isless might even fail when it's defined (e.g. with unordered CategoricalValue).

Ah, okay, that makes sense. Thanks for the explanation.

kleinschmidt

This looks fine to me (I'm assuming the test failures on nightly are unrelated).

kleinschmidt · 2017-01-16T16:36:30Z

src/statsmodels/formula.jl


 const DEFAULT_CONTRASTS = DummyCoding

+_levels(x::Union{CategoricalArray, NullableCategoricalArray}) = levels(x)


I vaguely remember that at some point we'd talked about using unique instead of levels, because the levels of a sub-vector might include values that aren't there (#1095 is the only reference I can find to this but I'm pretty sure it was discussed elsewhere). Then again that might only be a problem using a view of a categorical vector, and we hadn't actually incorporated that optimization?

Yes. One issue is that unique(::NullableCategoricalArray) has to return Nullable elements to be consistent with the generic AbstractArray method. But of course we can unwrap these in our internal method.

nalimilan · 2017-02-18T18:11:05Z

Closing in favor of JuliaStats/StatsModels.jl#13.

Gord Stephen and others added 30 commits September 14, 2016 10:13

RFC: Add compatibility with pre-contrasts ModelFrame constructor (#1042)

968e980

Add compatibility with pre-contrasts ModelFrame constructor

Reindex transposed sparse contrast matrix into modelmat_cols column-w…

d4ad15b

…ise for speed improvement (#1070)

Fill existing arrays with scalars (#1057)

2931693

Port to NullableArrays and CategoricalArrays

e4662fd

Completely remove support for DataArrays.

Get rid of custom Nullable operators and functions

9de5c08

This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.

Fix grouping

6ac7549

groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.

Remove custom isnull() definition

653fc1d

Remove optimized sorting methods

a17f264

Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.

Remove inscrutable FIXME

9a71705

Not sure what I meant by this. If it was really serious, we'll discover it sooner or later.

More Julia 0.4 compatibility

9f1e5e6

Remove another FIXME

a75a4a4

This is a much more general issue (JuliaStats/NullableArrays.jl#85) which can be tackled later.

Remove FIXME about insert!()

1b44ffe

For now, preserve the current semantics: conversion to NullableArray does not happen via insert!().

Remove FIXME about +(::NullableArray{Int}, ::Int)

0ff4dc8

Again a broader issue which doesn't particularly affect DataFrames. Cf. JuliaStats/NullableArrays.jl#143

Remove FIXME about test/indexing.jl

110deac

Better handle that separately.

Remove FIXME about map()

0ff6373

Shorter written that way for now. Filed as JuliaStats/NullableArrays.jl#144.

Fix sortperm() tests

431d135

This depends on a CategoricalArrays change by which levels are sorted when creating the array.

Remove FIXME about predict()

cc87f46

There's no inconsistency here: when the input is a Matrix, there's no point in returning a NullableArray. Anyway, these are test methods.

Remove FIXME about head() and tail()

ec9b706

We don't have to handle this right now.

Remove FIXME about PooledDataVecs

e9a1c8c

Keep this in DataFrames for now, renaming it to the more explicit sharepools(). Also relax signatures to accept non-Nullable categorical arrays.

Remove unused NominalArray methods

bf16c5f

These were not exercized by the tests, and the use case for them isn't obvious. (They were formerly methods of DataArrays.PooledDataArray().)

Mention Julia bug in FIXME

3bb1323

Bump dependencies on NullableArrays and CategoricalArrays

95789bf

For NullableArrays, even current git master is not enough at this time.

Require NullableArrays 0.0.8

5c33249

Bump CategoricalArrays requirement

e1df391

Fix tests on Julia 0.4

63c1d96

Tests pass, but the Nullable{Any} results could be annoying for users.

Use CategoricalArray instead of NominalArray

2ec131e

New type merging NominalArray and OrdinalArray in 0.0.5.

Remove DataArrays benchmarks

ad75f67

These shouldn't live in DataFrames.

Update docs

492351c

Fix failures introduced when rebasing

d48d7f8

Update docs to remove references to DataArrays and fully qualify a fe…

f8dc8c6

…w references

ararslan and others added 20 commits October 1, 2016 10:52

Merge pull request #1076 from alyst/misc_fixes

23ec690

Misc minor enhancements

Add output to LaTeX (useful for IJulia notebook export to PDF) (#845)

1658c35

handle A ~ B - 1 and add tests (#1086)

400da84

* handle -1 and add tests * replace `import Base.==` with `Base.:(==)` * typo and error test

Fix join when mixing NullableArray and Array{Nullable} (#1089)

e1c5014

Also return a NullableCategoricalArray from sharepools() since the code currently doesn't check that no null values are present. anyway this function is internal and the change imposes no overhead.

Better display of Nullables (#1084)

725a226

* Better display of Nullables * Don't write trailing space in Latex output Also fix missing newline in show test

Update StatsBase.df to dof (#1097)

e4ab277

limit attribute of IOContext is used for html generation (#1099)

203b50f

* limit attribute of IOContext is used for html generation * fixup

Fix docstring example (#1107)

b6de65a

Closes #1103

Loosen constructor for a DataFrame (#1108)

10c4423

Use the tagged version of Documenter (#1109)

9cae226

fix typo in Nullable holding 1 example (#1112)

e7ea227

Small docs fixes (#1077)

3706704

I apparently missed these occurrences when removing these functions.

Enable doctests (#1110)

e418174

Add documentation for Query.jl (#1105)

4c47ed3

Juno display (#1125)

cd6d749

Add querying chapter to table of content (#1129)

c81d57e

Update joins doc to include rename! (#1131)

dd2772b

The instruction to rename columns if they don't match was part of the docs previously (http://dataframesjl.readthedocs.io/en/latest/joins_and_indexing.html). I adapted the syntax to avoid using {}.

Avoid closing IO unless responsible for opening (#1138)

e205b70

Fix no-op transpose warning (#1141)

bf0bda8

nalimilan requested review from ararslan and kleinschmidt January 15, 2017 17:27

ararslan reviewed Jan 15, 2017

View reviewed changes

ararslan requested a review from johnmyleswhite January 15, 2017 21:52

kleinschmidt reviewed Jan 25, 2017

View reviewed changes

ararslan force-pushed the master branch from 1013694 to e5347cf Compare February 11, 2017 18:48

nalimilan closed this Feb 18, 2017

nalimilan deleted the nl/modelmat branch February 18, 2017 18:11

nalimilan mentioned this pull request Feb 18, 2017

Get tests pass again, fix handling of string columns JuliaStats/StatsModels.jl#13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of string columns in model matrix#1147

Fix handling of string columns in model matrix#1147
nalimilan wants to merge 81 commits intomasterfrom
nl/modelmat

nalimilan commented Jan 15, 2017

Uh oh!

ararslan Jan 15, 2017

Uh oh!

nalimilan Jan 15, 2017

Uh oh!

ararslan Jan 15, 2017

Uh oh!

kleinschmidt left a comment

Uh oh!

kleinschmidt Jan 16, 2017

Uh oh!

nalimilan Jan 25, 2017

Uh oh!

nalimilan commented Feb 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants


		const DEFAULT_CONTRASTS = DummyCoding

		_levels(x::Union{CategoricalArray, NullableCategoricalArray}) = levels(x)

Conversation

nalimilan commented Jan 15, 2017

Uh oh!

ararslan Jan 15, 2017

Choose a reason for hiding this comment

Uh oh!

nalimilan Jan 15, 2017

Choose a reason for hiding this comment

Uh oh!

ararslan Jan 15, 2017

Choose a reason for hiding this comment

Uh oh!

kleinschmidt left a comment

Choose a reason for hiding this comment

Uh oh!

kleinschmidt Jan 16, 2017

Choose a reason for hiding this comment

Uh oh!

nalimilan Jan 25, 2017

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Feb 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants