ML_project/Build_GMS_Round2.Rmd at master · KNMI-DataLab/ML_project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Build test/train for round 2 model building"
author: "Eva Kleingeld"
date: "October 5, 2016"
output: html_document
---

First we read in a large dataset containing six days worth of GMS data.
On four of these days temperatures were below freezing for (parts of) the day.
On two days temperatures were always above freezing.
The selected days each experienced different meteorological conditions.
For more details on the weather during the selected days we refer to the scripts:
- Met_conditions.Rmd
- Analyze_Days.R

```{r}
# Empty environment
rm(list=ls())

# Install packages if needed

# Read in the 6 Days data set
#load("F:/KNMI/Six_Days.Rda")
#load("/run/media/kleingel/Lexar/KNMI/Six_Days.Rda")
#load("/usr/people/kleingel/Projects/MLProject/Six_Days.Rda")
load("/usr/people/kleingel/Projects/MLProject/Three_Days.Rda")


# Rename dataframe
#The_Days <- Six_Days_3
#rm(Six_Days_3)
The_Days <- Three_Days_3
rm(Three_Days_3)

# Examine data frame
str(The_Days)
unique(The_Days$DOY_Days)

# Get day 68 from The_Days
Day_67 <- subset(The_Days,The_Days$DOY_Days == 67)

any(!(is.na(Day_67$TL)))
any(!(is.na(Day_67$TD)))


```

Remove NA values in TL and TD

```{r}
#How many NA values in all data frame columns?
#The_NAs <- apply(The_Days,2, is.na)
#apply(The_NAs, 2, sum)

# As you can see, only the TL and TD columns contain NA's
# Next, we remove all rows which contain TL/TD NA values
The_Days <- na.omit(The_Days)
```


Next, we need to select a six hour *train* set and a 1 hour and 30 minute *test* set.
The trains set will be used to build the model, the test set will be used to test it.

We use the createDataPartition function from the caret package to make these sets.

6 days times 24 hours is 144 hours. A subset of six hours from a total of 144 hours is 4.2% of all data. A subset of 3 hours is 2.1% of all data.

Unfortunately, the createDataPartition function can only split the data into pieces. We first split the data into 94% and 6%. Of the 6% we take 2% test and 4% train. We thus have a train that is twice as large as the test.

The createDataPartition function uses stratified random sampling to make the train/test split. The outcome of the createDataPartition function is an object which contains all the row numbers that should be in the train set.

We first make a 94%/6% split. The 6% Sub set is split into test and train.

**Check how many times you have to set the seed**
[link](http://stackoverflow.com/questions/20624698/fixing-set-seed-for-an-entire-session)
```{r}
library(caret)

# In order to generate the same sequence of random numbers everytime we run this script (so that we every time get the same test/train split) we set a seed
set.seed(442)

# Create the 94%/6% split
InSub <- createDataPartition(The_Days$TEMP, p = 1, list = FALSE)

# 6% of all data
Sub <- The_Days[InSub, ]

# Test if the split went well
(length(Sub$LOCATION)/ length(The_Days$LOCATION)) * 100


set.seed(4300)

# Split the subset into train and test (66% train, 33% test)
InTrain <- createDataPartition(Sub$TEMP, p = 0.6, list = FALSE)

# Build the train set
Train_data <- Sub[InTrain, ]

# Build the test set
Test_data <- Sub[-(InTrain), ]

# Is the test set indeed 33% of the data?
nrow(Test_data)/(nrow(Test_data) + nrow(Train_data)) * 100

```

The next step is to test the distribution of data in the test and train sets.
We also make a correlation plot to test wether the correlations between the predictors in the test and train sets are approximately similar.

The histograms show that although the distribution of the predictors in the test/train/original set is not identical, it is very similar. The correlation plot confirms this: The correlations between predictors in the test/train set is not the same, but very similar.


```{r}
library(corrplot)

# Make histograms of the distribution of the data in test, train and the original data and plot side by side

# Columns you wish to plot
PlotCol <- c(8)

for (i in (PlotCol)){
  par(mfrow=c(1,3))
  print(i)
  hist(The_Days[, i], freq = FALSE, xlab = colnames(The_Days)[i], main = paste("Original data", colnames(The_Days)[i]))
  hist(Train_data[, i], freq = FALSE, xlab = colnames(Train_data)[i], main = paste("Train data", colnames(Train_data)[i]))
  hist(Test_data[, i], freq = FALSE, xlab = colnames(Test_data)[i], main = paste("Test data", colnames(Test_data)[i]))
}

# Set device to one plot again
par(mfrow=c(1,1))


# Make a correlation plot for the numerical values for both test and train
Keep <- c("TL", "TD", "TEMP", "Unix_Time")
corTrain <- cor(Train_data[ ,(names(Train_data) %in% Keep)], use = "complete.obs")
corrplot(corTrain, method = "number")

corTest <- cor(Test_data[ ,(names(Test_data) %in% Keep)], use = "complete.obs")
corrplot(corTest, method = "number")

```

```{r}
# All stations in train are also in test
Stations_Train <- Train_data$LOCATION
Stations_Test <- Test_data$LOCATION

all(Stations_Test %in% Stations_Train)

# Station number + sensor number HOW TO TEST!!!

Stat_Sens_train <- cbind(Train_data$LOCATION, Train_data$SENSOR)
Stat_Sens_test <- cbind(Test_data$LOCATION, Test_data$SENSOR)

all(Stat_Sens_test %in% Stat_Sens_train)

# Test number of days in test/train
unique(Train_data$DOY_Days)
unique(Test_data$DOY_Days)
unique(The_Days$DOY_Days)
```


Finally, we need to save the test/train sets as .Rda files, so that we can read them in in the next script to be pre-processed. Here we save to a USB stick, but you can also set to save to another folder.

```{r}
# Save train
#save(Train_data, file = "F:/KNMI/Train_data_R2_BIG.Rda")
#save(Train_data, file = "/run/media/kleingel/Lexar/KNMI/Train_data_R2.Rda")
save(Train_data, file = "/usr/people/kleingel/Projects/MLProject/Train_data_3D.Rda")

# Save test
#save(Test_data, file = "F:/KNMI/Test_data_R2_BIG.Rda")
#save(Test_data, file = "/run/media/kleingel/Lexar/KNMI/Test_data_R2.Rda")
save(Test_data, file = "/usr/people/kleingel/Projects/MLProject/Test_data_3D.Rda")

```