Skip to content

Commit 47c3366

Browse files
GijsVermarienGert-Janremykusters
authored
New dataset api (#61)
* Reworking dataset; now supports loading 1D 1 out numerical and analytical. * Moved loading into separate function. * Extra checks in notebook * Simplified API to use a function which returns data and grid. * Updated with black. * Proposed new structure for the dataset, with examples * Did a black lint * Refactored for use with dataloader * Created a Loader for datasets that fit in GPU * Small refactor * Updated to a functional version of Dataset * Created one loss to replace 3 seperate losses * Renamed the data function to reflect they are fun * Updated subsamplers * Tried adding the new features to the examples * Added the notebook * Added black newline * Updated the dataloader * Updated the notebooks to the new api style * Update the 2D AD example * Added better support for dimensionful data * Updated notebooks * Changed the training loop * Updated to correctly take MSE_test per feature * subsampling along axis * Added dimension to the data functions * Updated docstrings and added better checks * removed print * Updated the notebooks to the newest version * Removed the legacy datasets * Updated the documentation * Updated the notebooks * Fixed some black whitespace shenanigans * Remove test notebook * Fixed the figure in the documentation. * Delete Dataset.ipynb This file is not needed. * Update examples.py Updated the function name. Co-authored-by: Gert-Jan <gertjanboth@gmail.com> Co-authored-by: Remy Kusters <kusters.remy@gmail.com>
1 parent 07934e0 commit 47c3366

22 files changed

Lines changed: 1036 additions & 908 deletions

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,4 +52,6 @@ MANIFEST
5252
__pycache__
5353
src/DeePyMoD.egg/
5454
site/
55-
.eggs/
55+
.eggs/
56+
*events.out.tfevents.*
57+
*.pt

docs/datasets/data.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Datasets
2+
3+
## The general workflow
4+
5+
The custom DeePyMoD dataset and dataloaders are created for data that typically fits in the RAM/VRAM during training, if this is not your use case, they are interchangeable with the PyTorch general [Datasets](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and
6+
and [Dataloaders](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset).
7+
8+
For model discovery we typically want to add some noise to our dataset, normalize certain features and ensure it is in the right place for optimal PyTorch performance. This can easily be done by using the custom `deepymod.data.Dataset` and `deepymod.data.get_train_test_loader`. An illustration of the workflow is shown below:
9+
10+
![Workflow](../figures/data_workflow_for_deepymod.png)
11+
12+
## The dataset
13+
The dataset needs a function that loads all the samples and returns it in a coordinate, data format.
14+
```python
15+
def load_data():
16+
# create or load your data here
17+
return coordinates, data
18+
```
19+
Here it is important that the last axis of the data is the number of features, even if it is just one. The dataset accepts data that is still dimensionful `(t,x,y,z,number_of_features)` as well as data that is already
20+
flattened `(number_of_samples, number_of_features)`. After returning the data tries to apply the following functions to the samples that were just loaded: `preprocessing`, `subsampling` and lastly `shuffling`.
21+
22+
### Preprocessing
23+
The preprocessing performs steps commonly used in the framework, normalizing the coordinates, normalizing the data and adding noise to the data. One can provide these choices via a dictionary of arguments:
24+
```python
25+
preprocess_kwargs: dict = {
26+
"random_state": 42,
27+
"noise_level": 0.0,
28+
"normalize_coords": False,
29+
"normalize_data": False,
30+
}
31+
```
32+
And we can override the way we preprocess functions by defining the preprocess functions `apply_normalize`, `apply_noise` or even the way we shuffle using `apply_shuffle`.
33+
34+
### Subsampling
35+
Sometimes we do not wish to use the whole dataset, and as such we can subsample it. Sometimes using a subset
36+
of the time snapshots available is enough, for this we can use `deepymod.data.samples.Subsample_time` or
37+
randomly with `deepymod.data.samples.Subsample_random`. You can provide the arguments for these functions via `subsampler_kwargs` to the Dataset.
38+
39+
### The resulting shape of the data
40+
Since for random subsampling the (number_of_samples, number_of_features) format is better and for spatial
41+
subsampling the (t,x,y,z,number_of_features) format is best, we accept both formats. However since the trainer
42+
can only work with (number_of_samples, number_of_features), we will reshape the data to this format once
43+
the data is preprocessed and subsampled. After this we can shuffle the data.
44+
45+
### Shuffling
46+
If the data needs to be shuffled, `shuffle=True` can be used
47+
48+
## Dataloaders
49+
Dataloaders are used in the PyTorch framework to ensure that the loading of the data goes smoothly,
50+
for example with multiple workers. We however can typically fit the whole dataset into memory once,
51+
so the overhead of the PyTorch Dataloader is not needed. We thus provide the Loader, which provides
52+
a wrapper around the dataset. The loader will return the entire batch at once.
53+
54+
## Obtaining the dataloaders
55+
In order to create a train and test split, we can use the function `get_train_test_loader`, which divides
56+
the dataset into two pieces, and then directly passes them into the loader.
57+
58+
50 KB
Loading

examples/ODE_Example_coupled_nonlin.ipynb

Lines changed: 291 additions & 117 deletions
Large diffs are not rendered by default.

examples/PDE_2D_Advection-Diffusion.ipynb

Lines changed: 152 additions & 124 deletions
Large diffs are not rendered by default.

examples/PDE_Burgers.ipynb

Lines changed: 102 additions & 150 deletions
Large diffs are not rendered by default.

examples/PDE_KdV.ipynb

Lines changed: 66 additions & 129 deletions
Large diffs are not rendered by default.

setup.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0
3232
# Add here dependencies of your project (semicolon/line-separated), e.g.
3333
install_requires = numpy
3434
torch
35-
sklearn
35+
scikit-learn
3636
pysindy
3737
natsort
3838
tensorboard

src/deepymod/data/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
from .base import Dataset, Dataset_2D
1+
from deepymod.data.base import Dataset, Loader, get_train_test_loader

0 commit comments

Comments
 (0)