initial thoughts on h2o integration by topepo · Pull Request #20 · tidymodels/planning

topepo · 2021-09-19T19:54:13Z

@ledell, @juliasilge, @hfrick, and @DavisVaughan for a discussion this week.

DavisVaughan · 2021-09-21T15:35:07Z

+* I don't think that there is a way for convert an `H2OGrid` to a data frame. Using `as.data.frame(gbm_grid1@summary_table)` is close but everything is character. 
+
+* I don't know how to get the holdout predictions either. 


It seems like once you extract the "model" objects from the grid, you can do a lot more. The grid is just a lightweight summary object?

library(h2o) h2o.init() # Import a sample binary outcome dataset into H2O data <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") # Identify predictors and response y <- "response" x <- setdiff(names(data), y) # For binary classification, response should be a factor data[, y] <- as.factor(data[, y]) test[, y] <- as.factor(test[, y]) # Split data into train & validation ss <- h2o.splitFrame(data, seed = 1) train <- ss[[1]] valid <- ss[[2]] # GBM hyperparameters gbm_params1 <- list(learn_rate = c(0.01, 0.1), max_depth = c(3, 5, 9), sample_rate = c(0.8, 1.0), col_sample_rate = c(0.2, 0.5, 1.0)) # Train and validate a cartesian grid of GBMs gbm_grid1 <- h2o.grid("gbm", x = x, y = y, grid_id = "gbm_grid1", training_frame = train, validation_frame = valid, ntrees = 100, seed = 1, hyper_params = gbm_params1) model_ids <- gbm_grid1@model_ids # get the model objects models <- lapply(model_ids, h2o.getModel) # make holdout predictions on the assessment data (very noisy) predictions <- lapply(models, h2o.predict, newdata = valid) #> <SUPER NOISY HERE> predictions[[1]] #> predict p0 p1 #> 1 1 0.13336856 0.8666314 #> 2 1 0.11695254 0.8830475 #> 3 1 0.06611336 0.9338866 #> 4 1 0.07556947 0.9244305 #> 5 1 0.09630038 0.9036996 #> 6 0 0.79153668 0.2084633 #> #> [2489 rows x 3 columns] # compute all the performance metrics on test data performance <- h2o.performance(models[[1]], newdata = test) h2o.auc(performance) #> [1] 0.7813055 performance@metrics$AUC #> [1] 0.7813055

^{Created on 2021-09-21 by the reprex package (v2.0.0.9000)}

@DavisVaughan Yeah the grid is a summary table, and you can grab the models like you're doing above.

ledell · 2022-01-14T04:47:43Z

+
+There are some differences between tidymodels grid tuning and h2o:
+
+* `h2o.grid()` seems to work on one resample at a time.


Not sure if this is helpful to note, but we have a parallelism arg for h2o.grid() which allows for multiple models to be trained in parallel. You specify an integer for how many models to train at once, e.g. parallelism = 5. So in theory it can work with multiple resamples at a time.

Ok, great. We can link that to the tidymodels control options for parallelism.

ledell · 2022-01-14T05:04:59Z

+* The indices for the modeling and holdout data
+* The grid values
+* Seeds
+* Data details (the data frame and x/y names)


Is the goal here to apply the same (whole) grid to multiple resample train/valid pairs? Or do you need the more fine-grained control of assigning a specific resample train/valid set to a specific grid combo (e.g. learn_rate = 0.1, max_depth = 5, sample_rate = 0.8, col_sample_rate = 1.0)?

If the former, then we can already support that with the current API, I think. You'd just do something like:

gbm_grid1 <- h2o.grid("gbm", x = x, y = y, grid_id = "gbm_grid1", training_frame = data[train_idx_resample_1, ], validation_frame = data[valid_idx_resample_1, ], ntrees = 100, seed = 1, hyper_params = gbm_params1)

Yes, applying it to everything would be the plan.

ledell

@topepo @DavisVaughan Added a few more comments to the README (with some questions).

initial thoughts

d5997e9

DavisVaughan reviewed Sep 21, 2021

View reviewed changes

ledell reviewed Sep 24, 2021

View reviewed changes

Comment thread h2o/README.Rmd

ledell reviewed Jan 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial thoughts on h2o integration#20

initial thoughts on h2o integration#20
topepo wants to merge 1 commit into
mainfrom
h2o

topepo commented Sep 19, 2021

Uh oh!

DavisVaughan Sep 21, 2021

Uh oh!

ledell Jan 14, 2022

Uh oh!

Uh oh!

ledell Jan 14, 2022

Uh oh!

topepo Jan 19, 2022

Uh oh!

ledell Jan 14, 2022 •

edited

Loading

Uh oh!

topepo Jan 19, 2022

Uh oh!

ledell left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		* I don't think that there is a way for convert an `H2OGrid` to a data frame. Using `as.data.frame(gbm_grid1@summary_table)` is close but everything is character.

		* I don't know how to get the holdout predictions either.


		There are some differences between tidymodels grid tuning and h2o:

		* `h2o.grid()` seems to work on one resample at a time.

Conversation

topepo commented Sep 19, 2021

Uh oh!

DavisVaughan Sep 21, 2021

Choose a reason for hiding this comment

Uh oh!

ledell Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ledell Jan 14, 2022

Choose a reason for hiding this comment

Uh oh!

topepo Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

ledell Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topepo Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

ledell left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ledell Jan 14, 2022 •

edited

Loading