Optimizing MIDAS on very large/complex datasets

In very large datasets (~30,000 samples x 1,000,000 features) with complex relationships (e.g. cancer omics data), the runtime for MIDAS can take a very long time (days?), even on a single GPU. However, I would like to take advantage of the 'overimpute' feature for hyperparameter tuning. This is prohibitive since this _very_ useful feature runs the algorithm multiple times to evaluate various settings.

Would random downsampling of samples (columns) and/or features (rows) generalize the optimal hyperparameters to the larger dataset? For instance, a random subset of 500-1,000 samples with 5,000-10,000 features. This would be to specifically determine the optimal number of: nodes, layers, learning rate, and training epochs. I would think batch size (which can speed up training) is a function of the dataset size, so this would not generalize.

Any help would be great

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing MIDAS on very large/complex datasets #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimizing MIDAS on very large/complex datasets #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions