Skip to content

Switch from randomForest to ranger? Reason: categorical covariates with >53 categories #12

@adamSales

Description

@adamSales

According to Wright & Konig (2019) the optimal way for regression and binary classification random forests to deal with categorical predictors with many levels is to order the categories by their mean outcomes and treat them as ordinal variables. This is done at each split in each tree.

The RandomForest package implements this trick, but only allows categorical variables with <54 categories. This restriction exists for the sake of multi-way classification, which cannot rely on the ordering trick, but is not necessary for regression or binary classification.

However, I think the ranger package does the ordering trick, but does not have the 53-level cap.

Note that Wright, who co-authored the paper cited at the beginning, is also a co-author on the ranger paper. However, (strangely, IMO), Wright & Konig (2019) never actually say that ranger implements the ordering trick. Instead, they write:

We also implemented the Order (once) method in the runtime-optimized ranger package (Wright & Ziegler, 2017)

Where "Order (once)" means ordering the variable once, before running the random forest, rather than re-ordering the variable at each split. This would be a problem for us since it would allow $Y_i$ to influence the fit of the models for $\hat{t}_i$ and $\hat{c}_i$.

However, I think that they implemented "Order (once)" for the sake of their paper, whereas ranger automatically implements "Order (split)" (which is what we would like it to do). I have simulation evidence:

> c1 <- sample(1:50,1000,replace=TRUE)
> c2 <- sample(c("a","b"),1000,replace=TRUE)
> y <- ifelse(c2=="a",c1-24.5,24.5-c1)+rnorm(1000)

Unless I'm mistaken, ordering the variable c1 will only be helpful if you do it after already splitting on c2.
(actually, I tried ordering c1 by hand before running ranger and it wasn't completely useless but it def. performed worse than if I had left it categorical.)

Anyway, check it:

> dat <- data.frame(y,c1=factor(c1),c2=factor(c2))

> randomForest(y~c1+c2,data=dat)

Call:
 randomForest(formula = y ~ c1 + c2, data = dat) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 67.23644
                    % Var explained: 68
> ranger(y~c1+c2,data=dat)
Ranger result

Call:
 ranger(y ~ c1 + c2, data = dat) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      1000 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       50.6178 
R squared (OOB):                  0.7593061 
> 
> ### with pre-ordered c1:
> ord <- rank(tapply(dat$y,dat$c1,mean))
> dat$c1O <- ordered(ord[dat$c1])
> randomForest(y~c1O+c2,data=dat)

Call:
 randomForest(formula = y ~ c1O + c2, data = dat) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 145.0066
                    % Var explained: 30.98
> ranger(y~c1O+c2,data=dat)
Ranger result

Call:
 ranger(y ~ c1O + c2, data = dat) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      1000 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       144.3639 
R squared (OOB):                  0.313532 

So ranger with c1 categorical beats out ranger with pre-ordered c1 and both versions of randomForest. To me, that implies that ranger is ordering c1 at each split.

Oh, and ranger doesn't have the 53-category limit:

>  c1 <- sample(1:100,2000,replace=TRUE)
> c2 <- sample(c("a","b"),2000,replace=TRUE)
> y <- ifelse(c2=="a",c1-50,50-c1)+rnorm(2000)
> 
> 
>  dat <- data.frame(y,c1=factor(c1),c2=factor(c2))
> randomForest(y~c1+c2,data=dat)
Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.
> 
> ranger(y~c1+c2,data=dat)
Ranger result

Call:
 ranger(y ~ c1 + c2, data = dat) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      2000 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       204.2992 
R squared (OOB):                  0.7530359 

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions