Skip to content

Commit dbe9320

Browse files
committed
Rebase latest changes from main
1 parent 64ec47f commit dbe9320

File tree

1 file changed

+28
-31
lines changed

1 file changed

+28
-31
lines changed

tutorialposts/2021-01-21-data-loader.md

Lines changed: 28 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -3,80 +3,77 @@ title = "Using Flux DataLoader"
33
published = "21 January 2021"
44
author = "Liliana Badillo, Dhairya Gandhi"
55
+++
6-
6+
77
In this tutorial, we show how to load image data in Flux DataLoader and process it in mini-batches. We use the [DataLoader](https://fluxml.ai/Flux.jl/stable/data/dataloader/#Flux.Data.DataLoader) type to handle iteration over mini-batches of data. For this example, we load the [MNIST dataset](https://juliaml.github.io/MLDatasets.jl/stable/datasets/MNIST/) using the [MLDatasets](https://juliaml.github.io/MLDatasets.jl/stable/) package.
8-
8+
99
Before we start, make sure you have installed the following packages:
10-
10+
1111
* [Flux](https://github.com/FluxML/Flux.jl)
1212
* [MLDatasets](https://juliaml.github.io/MLDatasets.jl/stable/)
13-
13+
1414
To install these packages, run the following in the REPL:
15-
15+
1616
```julia
1717
Pkg.add("Flux")
1818
Pkg.add("MLDatasets")
1919
```
20-
21-
20+
2221
Load the packages we'll need:
23-
22+
2423
```julia
2524
using MLDatasets: MNIST
2625
using Flux.Data: DataLoader
2726
using Flux: onehotbatch
2827
```
29-
28+
3029
## Step1: Loading the MNIST data set
31-
30+
3231
We load the MNIST train and test data from MLDatasets:
33-
32+
3433
```julia
35-
train_x, train_y = MNIST.traindata(Float32)
36-
test_x, test_y = MNIST.testdata(Float32)
34+
train_x, train_y = MNIST(:train)[:]
35+
test_x, test_y = MNIST(:test)[:]
3736
```
38-
37+
3938
This code loads the MNIST train and test images as Float32 as well as their labels. The data set `train_x` is a 28×28×60000 multi-dimensional array. It contains 60000 elements and each one of it contains a 28x28 array. Each array represents a 28x28 image (in grayscale) of a handwritten digit. Moreover, each element of the 28x28 arrays is a pixel that represents the amount of light that it contains. On the other hand, `test_y` is a 60000 element vector and each element of this vector represents the label or actual value (0 to 9) of a handwritten digit.
40-
39+
4140
## Step 2: Loading the dataset onto DataLoader
42-
41+
4342
Before we load the data onto a DataLoader, we need to reshape it so that it has the correct shape for Flux. For this example, the MNIST train data must be of the same dimension as our model's input and output layers.
44-
43+
4544
For example, if our model's input layer expects a 28x28x1 multi-dimensional array, we need to reshape the train and test data as follows:
46-
45+
4746
```julia
4847
train_x = reshape(train_x, 28, 28, 1, :)
4948
test_x = reshape(test_x, 28, 28, 1, :)
5049
```
51-
50+
5251
Also, the MNIST labels must be encoded as a vector with the same dimension as the number of categories (unique handwritten digits) in the data set. To encode the labels, we use the [Flux's onehotbatch](https://fluxml.ai/Flux.jl/stable/data/onehot/#Batches-1) function:
53-
52+
5453
```julia
5554
train_y, test_y = onehotbatch(train_y, 0:9), onehotbatch(test_y, 0:9)
5655
```
57-
56+
5857
>**Note:** For more information on other encoding methods, see [Handling Data in Flux](https://fluxml.ai/Flux.jl/stable/data/onehot/).
59-
58+
6059
Now, we load the train images and their labels onto a DataLoader object:
61-
60+
6261
```julia
6362
data_loader = DataLoader((train_x, train_y); batchsize=128, shuffle=true)
6463
```
65-
64+
6665
Notice that we set the DataLoader `batchsize` to 128. This will enable us to iterate over the data in batches of size 128. Also, by setting `shuffle=true` the DataLoader will shuffle the observations each time that iterations are re-started.
67-
66+
6867
## Step 3: Iterating over the data
69-
70-
Finally, we can iterate over the 60000 MNIST train data in mini-batches (most of them of size 128) using the Dataloader that we created in the previous step. Each element of the DataLoader is a tuple `(x, y)` in which `x` represents a 28x28x1 array and `y` a vector that encodes the corresponding label of the image.
71-
68+
69+
Finally, we can iterate over the 60000 MNIST train data in mini-batches (most of them of size 128) using the Dataloader that we created in the previous step. Each element of the DataLoader is a tuple `(x, y)` in which `x` represents a 28x28x1 array and `y` a vector that encodes the corresponding label of the image.
70+
7271
```julia
7372
for (x, y) in data_loader
7473
@assert size(x) == (28, 28, 1, 128) || size(x) == (28, 28, 1, 96)
7574
@assert size(y) == (10, 128) || size(y) == (10, 96)
7675
...
7776
end
7877
```
79-
80-
81-
78+
8279
Now, we can create a model and train it using the `data_loader` we just created. For more information on building models in Flux, see [Model-Building Basics](https://fluxml.ai/Flux.jl/stable/models/basics/#Model-Building-Basics-1).

0 commit comments

Comments
 (0)