Conversation
| * I don't think that there is a way for convert an `H2OGrid` to a data frame. Using `as.data.frame(gbm_grid1@summary_table)` is close but everything is character. | ||
|
|
||
| * I don't know how to get the holdout predictions either. |
There was a problem hiding this comment.
It seems like once you extract the "model" objects from the grid, you can do a lot more. The grid is just a lightweight summary object?
library(h2o)
h2o.init()
# Import a sample binary outcome dataset into H2O
data <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(data), y)
# For binary classification, response should be a factor
data[, y] <- as.factor(data[, y])
test[, y] <- as.factor(test[, y])
# Split data into train & validation
ss <- h2o.splitFrame(data, seed = 1)
train <- ss[[1]]
valid <- ss[[2]]
# GBM hyperparameters
gbm_params1 <- list(learn_rate = c(0.01, 0.1),
max_depth = c(3, 5, 9),
sample_rate = c(0.8, 1.0),
col_sample_rate = c(0.2, 0.5, 1.0))
# Train and validate a cartesian grid of GBMs
gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid1",
training_frame = train,
validation_frame = valid,
ntrees = 100,
seed = 1,
hyper_params = gbm_params1)
model_ids <- gbm_grid1@model_ids
# get the model objects
models <- lapply(model_ids, h2o.getModel)
# make holdout predictions on the assessment data (very noisy)
predictions <- lapply(models, h2o.predict, newdata = valid)
#> <SUPER NOISY HERE>
predictions[[1]]
#> predict p0 p1
#> 1 1 0.13336856 0.8666314
#> 2 1 0.11695254 0.8830475
#> 3 1 0.06611336 0.9338866
#> 4 1 0.07556947 0.9244305
#> 5 1 0.09630038 0.9036996
#> 6 0 0.79153668 0.2084633
#>
#> [2489 rows x 3 columns]
# compute all the performance metrics on test data
performance <- h2o.performance(models[[1]], newdata = test)
h2o.auc(performance)
#> [1] 0.7813055
performance@metrics$AUC
#> [1] 0.7813055Created on 2021-09-21 by the reprex package (v2.0.0.9000)
There was a problem hiding this comment.
@DavisVaughan Yeah the grid is a summary table, and you can grab the models like you're doing above.
|
|
||
| There are some differences between tidymodels grid tuning and h2o: | ||
|
|
||
| * `h2o.grid()` seems to work on one resample at a time. |
There was a problem hiding this comment.
Not sure if this is helpful to note, but we have a parallelism arg for h2o.grid() which allows for multiple models to be trained in parallel. You specify an integer for how many models to train at once, e.g. parallelism = 5. So in theory it can work with multiple resamples at a time.
There was a problem hiding this comment.
Ok, great. We can link that to the tidymodels control options for parallelism.
| * The indices for the modeling and holdout data | ||
| * The grid values | ||
| * Seeds | ||
| * Data details (the data frame and x/y names) |
There was a problem hiding this comment.
Is the goal here to apply the same (whole) grid to multiple resample train/valid pairs? Or do you need the more fine-grained control of assigning a specific resample train/valid set to a specific grid combo (e.g. learn_rate = 0.1, max_depth = 5, sample_rate = 0.8, col_sample_rate = 1.0)?
If the former, then we can already support that with the current API, I think. You'd just do something like:
gbm_grid1 <- h2o.grid("gbm", x = x, y = y,
grid_id = "gbm_grid1",
training_frame = data[train_idx_resample_1, ],
validation_frame = data[valid_idx_resample_1, ],
ntrees = 100, seed = 1,
hyper_params = gbm_params1)There was a problem hiding this comment.
Yes, applying it to everything would be the plan.
ledell
left a comment
There was a problem hiding this comment.
@topepo @DavisVaughan Added a few more comments to the README (with some questions).
@ledell, @juliasilge, @hfrick, and @DavisVaughan for a discussion this week.