Skip to content

Conversation

@s-marton
Copy link

Add GRANDE method for TabArena benchmarking

Description

This PR introduces GRANDE, a novel method for learning axis-aligned decision tree ensembles with gradient descent, into the TabArena benchmark.

Changes included

  • Added GRANDE implementation in:
    • tabarena/tabarena/benchmark/models/ag/grande/
    • tabarena/tabarena/models/grande/
  • Updated method registration in model_registry.py and models/utils.py files to include GRANDE
  • Added test tst/benchmark/models/test_grande.py to verify correct integration and basic functionality

Evaluation

  • GRANDE was benchmarked on TabArena Lite, as well as on an extended set of datasets using folds 0, 1, and 2 (the reported results are based on this extended evaluation).
  • Results show that GRANDE achieves strong performance, particularly on binary classification and regression tasks. Performance on multiclass tasks is currently lower, which drags down the overall benchmark results. Ongoing work is focused on improving this.


Figure 1: GRANDE results on TabArena folds 0, 1, and 2.

tuning-impact-elo-horizontal_bin
Figure 2a: GRANDE binary classification results on TabArena folds 0, 1, and 2.
tuning-impact-elo-horizontal_reg
Figure 2b: GRANDE regression results on TabArena folds 0, 1, and 2.
tuning-impact-elo-horizontal_multi
Figure 2c: GRANDE multi-class classification results on TabArena folds 0, 1, and 2.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copilot AI review requested due to automatic review settings December 22, 2025 10:39
@LennartPurucker
Copy link
Collaborator

Heyho @s-marton, great work on the PR, this looks great and awesome results!

I will try to get back to you ASAP and start a run on my end so I can run on all folds/splits afterwards. I kindly ask for some patience, as the winter break begins on Wednesday, and it might take me a bit longer to take a closer look.

@LennartPurucker
Copy link
Collaborator

One initial thought: I see a fixed n_estimators in the search space. I assume this is the upper limit (?). Is this a large enough limit as we often see values in the scale of 10k for such an upper limit for other boosting methods?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@s-marton
Copy link
Author

Heyho @s-marton, great work on the PR, this looks great and awesome results!

I will try to get back to you ASAP and start a run on my end so I can run on all folds/splits afterwards. I kindly ask for some patience, as the winter break begins on Wednesday, and it might take me a bit longer to take a closer look.

Hey, no problem at all! Take your time, and enjoy the winter break. Looking forward to hearing from you whenever you get a chance.

One initial thought: I see a fixed n_estimators in the search space. I assume this is the upper limit (?). Is this a large enough limit as we often see values in the scale of 10k for such an upper limit for other boosting methods?

Regarding your question on n_estimators: This is fixed intentionally. Unlike GBDTs, in GRANDE all estimators are trained in parallel rather than sequentially, so this number always corresponds to the exact number of estimators in the ensemble. Increasing it further usually does not yield a notable performance gain but mainly adds computational and memory overhead.

@LennartPurucker
Copy link
Collaborator

Unlike GBDTs, in GRANDE all estimators are trained in parallel

Ah, gotcha! I should read the paper as well! :)

@dholzmueller
Copy link
Contributor

It seems that the multi-class plot above is in fact the regression plot and vice versa. So GRANDE's weakness is regression tasks, not multi-class tasks.

@s-marton
Copy link
Author

s-marton commented Jan 5, 2026

It seems that the multi-class plot above is in fact the regression plot and vice versa. So GRANDE's weakness is regression tasks, not multi-class tasks.

I just double-checked the subsets, and you’re absolutely right, the issue is with regression. That also makes more sense conceptually and gives a clearer direction for how we might improve the results. I’ll look into it and hopefully be able to push an update soon.

@LennartPurucker
Copy link
Collaborator

Heyho, here are some results from my initial evaluation on TabArena-Lite with 50 configs (for now).

The results seem a bit worse, but that's reasonable given it is less HPO and we now force early stopping after 1 hour. Yet, there seems to be a problem related to the default performance for regression. I am not quite sure what triggered this change. And will have to investigate.

In general, from my work on the refactor, it seems like it would be great to refactor the PR such that we pip install GRANDE from the official repository so it might function as a standalone package. @s-marton is this something you are working towards?

All tasks:
image

tuning-impact-elo-horizontal_bin
binary
tuning-impact-elo-horizontal_reg
regression
tuning-impact-elo-horizontal_multi
multi-class

@s-marton
Copy link
Author

Hi @LennartPurucker,
thanks for the update! In general, I think it is reasonable that the performance with 50 trials is a bit worse, considering that GRANDE benefits more from tuning and ensembling compared to some baselines, and that the 1h ES could also trigger for some larger datasets depending on the model configuration.

The default performance for regression, however, is surprising. I will take a look at this as well. The GRANDE repo is currently a bit behind, but I am planning to update it soon, so it should be possible to refactor the PR to use a pip install. I will take care of this shortly.

@LennartPurucker
Copy link
Collaborator

Great to hear @s-marton! Let me know once I should take another look. After ICML, I should also have enough compute to run more configs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants