Switch from randomForest to ranger? Reason: categorical covariates with >53 categories

According to [Wright & Konig (2019)](https://pmc.ncbi.nlm.nih.gov/articles/PMC6368971/#sec1) the optimal way for regression and binary classification random forests to deal with categorical predictors with many levels is to order the categories by their mean outcomes and treat them as ordinal variables. This is done at each split in each tree.

The RandomForest package implements this trick, but only allows categorical variables with <54 categories. This restriction exists for the sake of multi-way classification, which cannot rely on the ordering trick, but is not necessary for regression or binary classification.

However, I _think_ the `ranger` package does the ordering trick, but does not have the 53-level cap.

Note that Wright, who co-authored the paper cited at the beginning, is also a co-author on the `ranger` paper. However, (strangely, IMO), [Wright & Konig (2019)](https://pmc.ncbi.nlm.nih.gov/articles/PMC6368971/#sec1) never actually say that `ranger` implements the ordering trick. Instead, they write:

> We also implemented the Order (once) method in the runtime-optimized ranger package ([Wright & Ziegler, 2017](https://pmc.ncbi.nlm.nih.gov/articles/PMC6368971/#ref-38))

Where "Order (once)" means ordering the variable once, before running the random forest, rather than re-ordering the variable at each split. This would be a problem for us since it would allow $Y_i$ to influence the fit of the models for $\hat{t}_i$ and $\hat{c}_i$. 

However, I think that they implemented "Order (once)" for the sake of their paper, whereas `ranger` automatically implements "Order (split)" (which is what we would like it to do). I have simulation evidence:

```
> c1 <- sample(1:50,1000,replace=TRUE)
> c2 <- sample(c("a","b"),1000,replace=TRUE)
> y <- ifelse(c2=="a",c1-24.5,24.5-c1)+rnorm(1000)
```
Unless I'm mistaken, ordering the variable `c1` will only be helpful if you do it _after_ already splitting on `c2`.
(actually, I tried ordering `c1` by hand before running `ranger` and it wasn't completely useless but it def. performed worse than if I had left it categorical.)

Anyway, check it:

```
> dat <- data.frame(y,c1=factor(c1),c2=factor(c2))

> randomForest(y~c1+c2,data=dat)

Call:
 randomForest(formula = y ~ c1 + c2, data = dat) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 67.23644
                    % Var explained: 68
> ranger(y~c1+c2,data=dat)
Ranger result

Call:
 ranger(y ~ c1 + c2, data = dat) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      1000 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       50.6178 
R squared (OOB):                  0.7593061 
> 
> ### with pre-ordered c1:
> ord <- rank(tapply(dat$y,dat$c1,mean))
> dat$c1O <- ordered(ord[dat$c1])
> randomForest(y~c1O+c2,data=dat)

Call:
 randomForest(formula = y ~ c1O + c2, data = dat) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 145.0066
                    % Var explained: 30.98
> ranger(y~c1O+c2,data=dat)
Ranger result

Call:
 ranger(y ~ c1O + c2, data = dat) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      1000 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       144.3639 
R squared (OOB):                  0.313532 

```

So `ranger` with `c1` categorical beats out `ranger` with pre-ordered `c1` and both versions of `randomForest`. To me, that implies that `ranger` is ordering `c1` at each split.

Oh, and `ranger` doesn't have the 53-category limit:

```
>  c1 <- sample(1:100,2000,replace=TRUE)
> c2 <- sample(c("a","b"),2000,replace=TRUE)
> y <- ifelse(c2=="a",c1-50,50-c1)+rnorm(2000)
> 
> 
>  dat <- data.frame(y,c1=factor(c1),c2=factor(c2))
> randomForest(y~c1+c2,data=dat)
Error in randomForest.default(m, y, ...) : 
  Can not handle categorical predictors with more than 53 categories.
> 
> ranger(y~c1+c2,data=dat)
Ranger result

Call:
 ranger(y ~ c1 + c2, data = dat) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      2000 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       204.2992 
R squared (OOB):                  0.7530359 
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from randomForest to ranger? Reason: categorical covariates with >53 categories #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Switch from randomForest to ranger? Reason: categorical covariates with >53 categories #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions