According to Wright & Konig (2019) the optimal way for regression and binary classification random forests to deal with categorical predictors with many levels is to order the categories by their mean outcomes and treat them as ordinal variables. This is done at each split in each tree.
The RandomForest package implements this trick, but only allows categorical variables with <54 categories. This restriction exists for the sake of multi-way classification, which cannot rely on the ordering trick, but is not necessary for regression or binary classification.
However, I think the ranger package does the ordering trick, but does not have the 53-level cap.
Note that Wright, who co-authored the paper cited at the beginning, is also a co-author on the ranger paper. However, (strangely, IMO), Wright & Konig (2019) never actually say that ranger implements the ordering trick. Instead, they write:
We also implemented the Order (once) method in the runtime-optimized ranger package (Wright & Ziegler, 2017)
Where "Order (once)" means ordering the variable once, before running the random forest, rather than re-ordering the variable at each split. This would be a problem for us since it would allow $Y_i$ to influence the fit of the models for $\hat{t}_i$ and $\hat{c}_i$.
However, I think that they implemented "Order (once)" for the sake of their paper, whereas ranger automatically implements "Order (split)" (which is what we would like it to do). I have simulation evidence:
> c1 <- sample(1:50,1000,replace=TRUE)
> c2 <- sample(c("a","b"),1000,replace=TRUE)
> y <- ifelse(c2=="a",c1-24.5,24.5-c1)+rnorm(1000)
Unless I'm mistaken, ordering the variable c1 will only be helpful if you do it after already splitting on c2.
(actually, I tried ordering c1 by hand before running ranger and it wasn't completely useless but it def. performed worse than if I had left it categorical.)
Anyway, check it:
> dat <- data.frame(y,c1=factor(c1),c2=factor(c2))
> randomForest(y~c1+c2,data=dat)
Call:
randomForest(formula = y ~ c1 + c2, data = dat)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 67.23644
% Var explained: 68
> ranger(y~c1+c2,data=dat)
Ranger result
Call:
ranger(y ~ c1 + c2, data = dat)
Type: Regression
Number of trees: 500
Sample size: 1000
Number of independent variables: 2
Mtry: 1
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 50.6178
R squared (OOB): 0.7593061
>
> ### with pre-ordered c1:
> ord <- rank(tapply(dat$y,dat$c1,mean))
> dat$c1O <- ordered(ord[dat$c1])
> randomForest(y~c1O+c2,data=dat)
Call:
randomForest(formula = y ~ c1O + c2, data = dat)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 145.0066
% Var explained: 30.98
> ranger(y~c1O+c2,data=dat)
Ranger result
Call:
ranger(y ~ c1O + c2, data = dat)
Type: Regression
Number of trees: 500
Sample size: 1000
Number of independent variables: 2
Mtry: 1
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 144.3639
R squared (OOB): 0.313532
So ranger with c1 categorical beats out ranger with pre-ordered c1 and both versions of randomForest. To me, that implies that ranger is ordering c1 at each split.
Oh, and ranger doesn't have the 53-category limit:
> c1 <- sample(1:100,2000,replace=TRUE)
> c2 <- sample(c("a","b"),2000,replace=TRUE)
> y <- ifelse(c2=="a",c1-50,50-c1)+rnorm(2000)
>
>
> dat <- data.frame(y,c1=factor(c1),c2=factor(c2))
> randomForest(y~c1+c2,data=dat)
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
>
> ranger(y~c1+c2,data=dat)
Ranger result
Call:
ranger(y ~ c1 + c2, data = dat)
Type: Regression
Number of trees: 500
Sample size: 2000
Number of independent variables: 2
Mtry: 1
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 204.2992
R squared (OOB): 0.7530359
According to Wright & Konig (2019) the optimal way for regression and binary classification random forests to deal with categorical predictors with many levels is to order the categories by their mean outcomes and treat them as ordinal variables. This is done at each split in each tree.
The RandomForest package implements this trick, but only allows categorical variables with <54 categories. This restriction exists for the sake of multi-way classification, which cannot rely on the ordering trick, but is not necessary for regression or binary classification.
However, I think the
rangerpackage does the ordering trick, but does not have the 53-level cap.Note that Wright, who co-authored the paper cited at the beginning, is also a co-author on the
rangerpaper. However, (strangely, IMO), Wright & Konig (2019) never actually say thatrangerimplements the ordering trick. Instead, they write:Where "Order (once)" means ordering the variable once, before running the random forest, rather than re-ordering the variable at each split. This would be a problem for us since it would allow$Y_i$ to influence the fit of the models for $\hat{t}_i$ and $\hat{c}_i$ .
However, I think that they implemented "Order (once)" for the sake of their paper, whereas
rangerautomatically implements "Order (split)" (which is what we would like it to do). I have simulation evidence:Unless I'm mistaken, ordering the variable
c1will only be helpful if you do it after already splitting onc2.(actually, I tried ordering
c1by hand before runningrangerand it wasn't completely useless but it def. performed worse than if I had left it categorical.)Anyway, check it:
So
rangerwithc1categorical beats outrangerwith pre-orderedc1and both versions ofrandomForest. To me, that implies thatrangeris orderingc1at each split.Oh, and
rangerdoesn't have the 53-category limit: