You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a script that takes data from the CSVs, download and stores the images.
3
-
It assumes the csv has the following columns: name, thumbnail link, date, img link, subcategory id, subcategory
4
-
5
-
The final dataset is stored in the final_filename{i} files according to CIFAR semantics:
6
-
7
-
The archive contains the files dataset1, dataset2, ... Each of these files is a Python "pickled" object produced with Pickle.
8
-
Here is a python3 routine which will open such a file and return a dictionary:
9
-
10
-
def unpickle(file):
11
-
import pickle
12
-
with open(file, 'rb') as fo:
13
-
dict = pickle.load(fo, encoding='bytes')
14
-
return dict
15
-
16
-
Loaded in this way, each of the batch files contains a dictionary with the following elements:
17
-
data -- a 10000x7500 numpy array of uint8s. Each row of the array stores a 80x80 colour image. The first 2500 entries contain the red channel values, the next 2500 the green, and the final 2500 the blue.
18
-
The image is stored in row-major order, so that the first 50 entries of the array are the red channel values of the first row of the image.
19
-
labels -- a list of 10000 numbers in the range 0-20. The number at index i indicates the label of the ith image in the array data.
2
+
This is a script that takes data from various CSV files, download and stores the images.
3
+
It assumes the csv has the following columns: name, thumbnail link, date, img link, subcategory id, subcategory.
4
+
5
+
The final dataset can be store according to two different conventions:
6
+
one that is more pytorch friendly: saves the images as individual files inside a folder and the labels in a csv
7
+
the other is more numpy friendly: the images are stored as numpy array inside a file "pickled" from a dictionary with also the labels
8
+
9
+
Details of the numpy version:
10
+
The final dataset is stored in the final_filename{i} files according to CIFAR semantics:
11
+
The archive contains the files dataset1, dataset2, ... Each of these files is a Python "pickled" object produced with Pickle.
12
+
Here is a python3 routine which will open such a file and return a dictionary:
13
+
14
+
def unpickle(file):
15
+
import pickle
16
+
with open(file, 'rb') as fo:
17
+
dict = pickle.load(fo, encoding='bytes')
18
+
return dict
19
+
20
+
Loaded in this way, each of the batch files contains a dictionary with the following elements:
21
+
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image.
22
+
The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue.
23
+
The image is stored in row-major order, so that the first 50 entries of the array are the red channel values of the first row of the image.
24
+
labels -- a list of 10000 numbers in the range 1-21. The number at index i indicates the label of the ith image in the array data.
25
+
sublabels -- a list of 10000 numbers in the range 1-?. The number at index i indecates the sublabel (detailed category) of the ith image in the array data.
26
+
27
+
Details of the pytorch version:
28
+
The images will all be stored inside the folder 'images' (that will be created if non-existent) with their original name (assumed to be unique)
29
+
The labels will be stored inside a csv file dataset_labels.csv here the first column has the image name, the second its label, the third its sublabel
0 commit comments