local dl = require 'dataload'A collection of Torch dataset loaders. The library provides the following generic data loader classes :
- DataLoader : an abstract class inherited by the following classes;
- TensorLoader : for tensor or nested (i.e. tables of) tensor datasets;
- ImageClass : for image classification datasets stored in a flat folder structure;
- AsyncIterator : decorates a
DataLoaderfor asynchronou multi-threaded iteration; - SequenceLoader : for sequence datasets like language or time-series;
- MultiSequence : for shuffled sets of sequence datasets like shuffled sentences.
The library also provides functions for downloading specific datasets and preparing them using the above loaders :
- loadMNIST : load the MNIST handwritten digit dataset for image classification;
- loadImageNet : load the ILSVRC2014 dataset for image classification;
- loadPTB : load the Penn Tree Bank corpus for language modeling;
- loadGBW : load the Google Billion Words corpus for language modeling.
Also, we try to provide some useful preprocessing functions :
- fitImageNormalize : normalize images by channel.
dataloader = dl.DataLoader()An abstract class inherited by all DataLoader instances.
It wraps a data set to provide methods for accessing
inputs and targets. The data itself may be loaded from disk or memory.
Returns the number of samples in the dataloader.
Returns the size of inputs. When excludedim is 1 (the default),
the batch dimension is excluded from size.
When inputs is a tensor, the returned size is
a table of numbers. When it is a table of tensors, the returned size
is a table of table of numbers.
Returns the size of targets. When excludedim is 1 (the default),
the batch dimension is excluded from size.
When targets is a tensor, the returned size is
a table of numbers. When it is a table of tensors, the returned size
is a table of table of numbers.
Returns inputs and targets containing samples indexed by indices.
So for example :
indices = torch.LongTensor{1,2,3,4,5}
inputs, targets = dataloader:index(indices)would return a batch of inputs and targets containing samples 1 through 5.
When inputs and targets are provided as arguments, they are used as
memory buffers for the returned inputs and targets,
i.e. their allocated memory is reused.
Returns inputs and targets containing batchsize random samples.
This method is equivalent to :
indices = torch.LongTensor(batchsize):random(1,dataloader:size())
inputs, targets = dataloader:index(indices)Returns inputs and targets containing stop-start+1 samples between start and stop.
This method is equivalent to :
indices = torch.LongTensor():range(start, stop)
inputs, targets = dataloader:index(indices)Internally shuffles the inputs and targets. Note that not all
subclasses support this method.
Splits the dataloader into two new DataLoader instances
where ds1 contains the first math.floor(ratio x dataloader:size()) samples,
and ds2 contains the remainder.
Useful for splitting a training set into a new training set and validation set.
Returns an iterator over a validation and test sets. Each iteration returns 3 values :
k: the number of samples processed so far. Each iteration returns a maximum ofbatchsizesamples.inputs: a tensor (or nested table thereof) containing a maximum ofbatchsizeinputs.targets: a tensor (or nested table thereof) containing targets for the commensurate inputs.
The iterator will return batches of inputs and targets of size at most batchsize until
epochsize samples have been returned.
Note that the default implementation of this iterator is to call sub for each batch. Sub-classes may over-write this behavior.
Example :
local dl = require 'dataload'
inputs, targets = torch.range(1,5), torch.range(1,5)
dataloader = dl.TensorLoader(inputs, targets)
local i = 0
for k, inputs, targets in dataloader:subiter(2,6) do
i = i + 1
print(string.format("batch %d, nsampled = %d", i, k))
print(string.format("inputs:\n%stargets:\n%s", inputs, targets))
endOutput :
batch 1, nsampled = 2
inputs:
1
2
[torch.DoubleTensor of size 2]
targets:
1
2
[torch.DoubleTensor of size 2]
batch 2, nsampled = 4
inputs:
3
4
[torch.DoubleTensor of size 2]
targets:
3
4
[torch.DoubleTensor of size 2]
batch 3, nsampled = 5
inputs:
5
[torch.DoubleTensor of size 1]
targets:
5
[torch.DoubleTensor of size 1]
batch 4, nsampled = 6
inputs:
1
[torch.DoubleTensor of size 1]
targets:
1
[torch.DoubleTensor of size 1]Note how the last two batches are of size 1 while those before are of size batchsize = 2.
The reason for this is that the dataloader only has 5 samples.
So the last batch is split between the last sample and the first.
Returns an iterator over a training set. Each iteration returns 3 values :
k: the number of samples processed so far. Each iteration returns a maximum ofbatchsizesamples.inputs: a tensor (or nested table thereof) containing a maximum ofbatchsizeinputs.targets: a tensor (or nested table thereof) containing targets for the commensurate inputs.
The iterator will return batches of inputs and targets of size at most batchsize until
epochsize samples have been returned.
Note that the default implementation of this iterator is to call sample for each batch. Sub-classes may over-write this behavior.
Example :
local dl = require 'dataload'
inputs, targets = torch.range(1,5), torch.range(1,5)
dataloader = dl.TensorLoader(inputs, targets)
local i = 0
for k, inputs, targets in dataloader:sampleiter(2,6) do
i = i + 1
print(string.format("batch %d, nsampled = %d", i, k))
print(string.format("inputs:\n%stargets:\n%s", inputs, targets))
endOutput :
batch 1, nsampled = 2
inputs:
1
2
[torch.DoubleTensor of size 2]
targets:
1
2
[torch.DoubleTensor of size 2]
batch 2, nsampled = 4
inputs:
4
2
[torch.DoubleTensor of size 2]
targets:
4
2
[torch.DoubleTensor of size 2]
batch 3, nsampled = 6
inputs:
4
1
[torch.DoubleTensor of size 2]
targets:
4
1
[torch.DoubleTensor of size 2]Resets all internal counters such as those used for iterators.
Called by AsyncIterator before serializing the DataLoader to threads.
Collect garbage every self.gccdelay times this method is called.
Returns a deep copy clone of self.
dataloader = dl.TensorLoader(inputs, targets) The TensorLoader can be used to encapsulate tensors of inputs and targets.
As an example, consider a dummy 3 x 8 x 8 image classification dataset consisting of 1000 samples and 10 classes:
inputs = torch.randn(1000, 3, 8, 8)
targets = torch.LongTensor(1000):random(1,10)
dataloader = dl.TensorLoader(inputs, targets)The TensorLoader can also be used to encapsulate nested tensors of inputs and targets.
It uses recursive functions to handle nestings of arbitrary depth. As an example, let us
modify the above example to include x,y GPS coordinates in the inputs and
a parallel set of classification targets (7 classes):
inputs = {torch.randn(1000, 3, 8, 8), torch.randn(1000, 2)}
targets = {torch.LongTensor(1000):random(1,10), torch.LongTensor(1000):random(1,7)}
dataloader = dl.TensorLoader(inputs, targets)dataloader = dl.ImageClass(datapath, loadsize, [samplesize, samplefunc, sortfunc, verbose])For loading an image classification data set stored in a flat folder structure :
(datapath)/(classdir)/(imagefile).(jpg|png|etc)
So directory classdir is expected to contain the all images belonging to that class.
All image files are indexed into an efficient CharTensor during initialization.
Images are only loaded into inputs and targets tensors upon calling
batch sampling methods like index, sample and sub.
Note that for asynchronous loading of images (i.e. loading batches of images in different threads),
the ImageClass loader can be decorated with an AsyncIterator.
Images on disk can have different height, width and number of channels.
Constructor arguments are as follows :
datapath: one or many paths to directories of images;loadsize: initialize size to load the images to. Example :{3, 256, 256};samplesize: consistent sample size to resize the images to. Defaults toloadsize;samplefunc:function f(self, dst, path)used to create a sample(s) from an image path. Stores them inCharTensordst. Strings"sampleDefault"(the default),"sampleTrain"or"sampleTest"can also be provided as they refer to existing functionsverbose: display verbose message (default istrue);sortfunc: comparison operator used for sortingclassdirto get class indices. Defaults to the<operator.
dataloader = dl.AsyncIterator(dataloader, [nthread, verbose])This DataLoader subclass overwrites the subiter and sampleiter
iterator methods. The implementation uses the threads package to
build a pool of nthread worker threads. The main thread delegates the tasks of building inputs and targets tensors
to the workers. The workers each have a deep copy of the decorated dataloader.
When a task is received from the main thread through the Queue, they call sample
or sub to build the batch and return the inputs and targets to the
main thread. The iteration is asynchronous as the first iteration will fill the Queue with nthread tasks.
Note that when nthread > 1 the order of tensors is not deterministic.
This loader is well suited for decorating a dl.ImageClass instance and other
such I/O and CPU bound loaders.
dataloader = dl.SequenceLoader(sequence, batchsize, [bidirectional])This DataLoader subclass can be used to encapsulate a sequence
for training time-series or language models.
The sequence is a tensor where the first dimension indexes time.
Internally, the loader will split the sequence into batchsize subsequences.
Calling the sub(start, stop, inputs, targets) method will return
inputs and targets of size seqlen x batchsize [x inputsize]
where stop - start + 1 <= seqlen.
See RNNLM training script for an example.
The bidirectional argument should be set
to true for bidirectional models like BRNN/BLSTMs. In which case,
the returned inputs and targets will be aligned.
For example, using batchsize = 3 and seqlen = 5 :
print(inputs:t(), targets:t())
36 1516 853 94 1376
3193 433 553 805 521
512 434 57 1029 1962
[torch.IntTensor of size 3x5]
36 1516 853 94 1376
3193 433 553 805 521
512 434 57 1029 1962
[torch.IntTensor of size 3x5]When bidirectional is false (the default), the targets will
be one step in the future with respect to the inputs :
For example, using batchsize = 3 and seqlen = 5 :
print(inputs:t(), targets:t())
36 1516 853 94 1376
3193 433 553 805 521
512 434 57 1029 1962
[torch.IntTensor of size 3x5]
1516 853 94 1376 719
433 553 805 521 27
434 57 1029 1962 49
[torch.IntTensor of size 3x5]dataloader = dl.MultiSequence(sequences, batchsize)This DataLoader subclass is used by the Billion Words dataset to encapsulate unordered sentences.
The sequences arguments is a table or tds.Vec of tensors.
Each such tensors is a single sequence independent of the others.
When calling sub(start, stop) or subiter(seqlen) methods,
a column of the returned inputs and targets tensors (of size seqlen x batchsize) could
contain multiple sequences. For example, a character-level language model could look like:
target : [ ] E L L O [ ] C R E E N ...
input : [ ] H E L L [ ] S C R E E ...
where HELLO and SCREEN would be two independent sequences.
Note that [ ] is a zero mask used to seperate independent sequences.
For most cases, the [ ] token is a 0.
Except for 1D targets, where it is a 1 (so that it works with ClassNLLCriterion).
train, valid, test = dl.loadMNIST([datapath, validratio, scale, srcurl])Returns the training, validation and testing sets as 3 TensorLoader instances.
Each such loader encapsulates a part of the MNIST dataset which is
located in datapath (defaults to dl.DATA_PATH/mnist).
The validratio argument, a number between 0 and 1,
specifies the ratio of the 60000 training samples
that will be allocated to the validation set.
The scale argument specifies range within which pixel values will be scaled (defaults to {0,1}).
The srcurl specifies the URL from where the raw data can be downloaded from
if not located on disk.
train, valid, test = dl.loadPTB(batchsize, [datapath, srcurl])Returns the training, validation and testing sets as 3 SequenceLoader instance
Each such loader encapsulates a part of the Penn Tree Bank dataset which is
located in datapath (defaults to dl.DATA_PATH/PennTreeBank).
If the files aren't found in the datapath, they will be automatically downloaded
from the srcurl URL.
The batchsize specifies the number of samples that will be returned when
iterating through the dataset. If specified as a table, its elements
specify the batchsize of commensurate train, valid and test tables.
We recommend a batchsize of 1 for evaluation sets (e.g. {50,1,1}).
See RNNLM training script for an example.
Ref.: A. http://image-net.org/challenges/LSVRC/2014/download-images-5jj5.php
train, valid = dl.loadImageNet(datapath, [nthread, loadsize, samplesize, verbose])Returns the training and validation sets of the Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) image classification dataset (commonly known as ImageNet). The dataset hasn't changed from 2012-2014.
The returned train and valid loaders do not read all images into memory when first loaded.
Each dataset is implemented using an ImageClass loader decorated by an AsyncIterator.
The datapath should point to a directory containing the outputs of the downloadimagenet.lua and
harmonizeimagenet.lua scripts (see bellow).
Due to its size, the data first needs to be prepared offline. Use downloadimagenet.lua to download and extract the data :
th downloadimagenet.lua --savePath '/path/to/diskspace/ImageNet'The entire process requires about 360 GB of disk space to complete the download and extraction process.
This can be reduced to about 150 GB if the training set is downloaded and extracted first,
and all the .tar files are manually deleted. Repeat for the validation set, devkit and metadata.
If you still don't have enough space in one partition, you can divide the data among different partitions.
We recommend a good internet connection (>60Mbs download) and a Solid-State Drives (SSD).
Use harmonizeimagenet.lua to harmonize the train and validation sets:
th harmonizeimagenet.lua --dataPath /path/to/diskspace/ImageNet --progress --forRealEach set will then contain a directory of images for each class with name class[id]
where [id] is a class index, between 1 and 1000, used for the ILVRC2014 competition.
Then we need to install graphicsmagick :
luarocks install graphicsmagickAs in the famous (Krizhevsky et al. 2012) paper, the ImageNet training dataset samples images cropped from random 224x224 patches from the images resizes so that the smallest dimension has size 256. As for the validation set, ten 224x224 patches are cropped per image, i.e. center, four corners and their horizontal flips, and their predictions are averaged.
train, valid, test = dl.loadGBW(batchsize, [trainfile, datapath, srcurl, verbose])Loads the Google Billion Words corpus as MultiSequence loaders.
The preprocessing specified in
Google Billion Words language modeling benchmark
was applied to training-monolingual.tokenized/news.20??.en.shuffled.tokenized to generate the different subsets.
These subsets are automatically downloaded when not found on disk.
The task consists in predicting the next word given the previous ones.
The corpus contains approximately 30 million sentences of an average length of about 25 words.
In total, there are about 800 thousand (unique) words in the vocabulary, which makes it a very memory intensive problem.
ppf = dl.fitImageNormalize(trainset, [nsample, cachepath, verbose])Returns a ppf preprocessing function that can be used to in-place normalize a batch of images (inputs)
channel-wise :
ppf(inputs)The trainset argument is a DataLoader instance
containing image inputs. The mean and standard deviation will be measured
on nsample images (default 10000). When cachepath is provided, the
mean and standard deviation are saved for the next function call.