With latest master, string columns trigger an error when used in a formula to build a ModelMatrix. We should probably treat them as categorical variables, either by converting them to CategoricalArray, or (even better) by building contrasts for them on the fly (without a copy).
Though I wonder what to do with other kinds of non-numeric columns. Raise an error? Treat them as categorical by default?
@kleinschmidt Comments?
Reproducer:
julia> using DataFrames
julia> df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼─────┤
│ 1 │ 1 │ "M" │
│ 2 │ 2 │ "F" │
│ 3 │ 3 │ "F" │
│ 4 │ 4 │ "M" │
julia> ModelMatrix(ModelFrame(A~B, df))
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64
This may have arisen from a call to the constructor Float64(...),
since type constructors fall back to convert methods.
in copy!(::Base.LinearFast, ::Array{Float64,2}, ::Base.LinearFast, ::Array{String,2}) at ./abstractarray.jl:575
in modelmat_cols(::Type{Array{Float64,2}}, ::NullableArrays.NullableArray{String,1}) at /home/milan/.julia/DataFrames/src/statsmodels/formula.jl:349
in #modelmat_cols#122(::Bool, ::Function, ::Type{Array{Float64,2}}, ::Symbol, ::DataFrames.ModelFrame) at /home/milan/.julia/DataFrames/src/statsmodels/formula.jl:342
in (::DataFrames.#kw##modelmat_cols)(::Array{Any,1}, ::DataFrames.#modelmat_cols, ::Type{Array{Float64,2}}, ::Symbol, ::DataFrames.ModelFrame) at ./<missing>:0
in DataFrames.ModelMatrix{Array{Float64,2}}(::DataFrames.ModelFrame) at /home/milan/.julia/DataFrames/src/statsmodels/formula.jl:478
in DataFrames.ModelMatrix{T<:AbstractArray{T<:AbstractFloat,2}}(::DataFrames.ModelFrame) at /home/milan/.julia/DataFrames/src/statsmodels/formula.jl:501
With latest master, string columns trigger an error when used in a formula to build a
ModelMatrix. We should probably treat them as categorical variables, either by converting them toCategoricalArray, or (even better) by building contrasts for them on the fly (without a copy).Though I wonder what to do with other kinds of non-numeric columns. Raise an error? Treat them as categorical by default?
@kleinschmidt Comments?
Reproducer: