Machine-Learning-assignments/Question4.py.txt at master · tputti2/Machine-Learning-assignments · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
Logistic Regression with Gradient Descent
We will use logistic regression to build a classifier to differentiate benign and malignant breast cancer cells based on cell features such as clump thickness, cell size, cell shape, and more. Source of data: UCI Machine Learning Repository.

Hopefully, you will be able to learn 3 things from this machine problem:

How to find the optimal weight vector for logistic regression

How to use 5 fold cross validation to evaluate performance

How to tune the hyper-parameters(stepsize in this case) using 5 fold cross validation

The binary classification problem of this dataset({benign, malignant}) lends itself nicely to logistic regression. We will learn to build the classifier from the ground up using the rules you have derived in the written portion for logistic regression of HW1. As a little refresher, we have to optimize the log likelihood to learn the weight vector ww. We will use gradient descent to solve this optimization problem.

To evaluate performance, you will have to implement 5 fold cross validation on the accuracy of your prediction against the labels. The idea is to split up the data into 5 equal segments and pick 4 segments to train and the remaining 1 segment to test. You will repeat this for all 5 possible choices and average your accuracy results.

Once we can evaluate 5 fold cross validation accuracy, we can fine-tune the stepsize of gradient descent by generating a range of stepsizes and testing their accuracy. The best stepsize corresponds to the one that generates the highest accuracy across the different folds. (NOTE: For best-practice, we should be using Nested Cross-Validation. But for this homework, we will simply implement the non-nested cross-validation for parameter tuning.)

The classifier will be broken up into many pieces to aid your understanding of the problem. You will follow these steps to finish the problem:

Step 1: Implement a function to compute the sigmoid and gradient of the log likelihood function

Step 2: Implement a function to update the weight vector using the entire training data once(a single epoch) and a function that initializes a weight vector of zeros and does 100 epochs(num_epoch_for_train).(These parameters in bold are important for our grading purposes)

Step 3: Implement function to predict and test accuracy on some input dataset

Step 4: Implement 5 fold cross validation and parameter tuning.

Step 5: Submit your answers according to the required OUTPUT below.

Note: Look at the the INPUT and OUTPUT descriptions below. They spell out exactly what is provided and what needs to be returned. Also look at the function descriptions for specific details that are important for our grading purposes.

INPUT (These will be available as global variables to you):

x_train: 2D numpy array of size 300×10300×10 with each data entry as [1,a1,a2,...,a9][1,a1,a2,...,a9] where aiai are features and the first element of 1 is added for the bias term.

y_train: 1D numpy array of size 300300 as labels of the data entry. A label of 00 represents benign and a label of 11 represents malignant.

x_test: 2D numpy array of size 100×10100×10
y_test: 1D numpy array of size 100100
num_epoch_for_train: number of epochs(passes through the entire training dataset) to be done for GD

default_stepsize: default stepsize

OUTPUT (You need to assign these variables):

w_single_epoch: 1D numpy array of size 1010 representing a weight vector from running gd_single_epoch using weight vector of zeros, x_train, y_train, and default_stepsize

w_optimized: 1D numpy array of size 1010 representing a weight vector from running gd using x_train, y_train, and default_stepsize

y_predictions: 1D numpy array of size 100100 representing predictions of the test data using the predict function you wrote with w_optimized weight vector to predict x_test

five_fold_average_accuracy: float representing average accuracy from running five_fold_cross_validation_avg_accuracy on x_train, y_train, and default_stepsize

tuned_stepsize: float representing the best stepsize chosen from 10 stepsizes from 0.001, 0.002,..., 0.01 in intervals of 0.001 by calling the tune function that you wrote with x_train and y_train as parameters

Hint: This is how the reference output is prepared. You can do the same or prepare it using another method.

w_single_epoch = gd_single_epoch(np.zeros(len(x_train[0])), x_train, y_train, default_stepsize)

w_optimized = gd(x_train, y_train, default_stepsize)

y_predictions = np.fromiter((predict(w_optimized, xi) for xi in x_test), x_test.dtype)

five_fold_average_accuracy = five_fold_cross_validation_avg_accuracy(x_train, y_train, default_stepsize)

tuned_stepsize = tune(x_train, y_train)


*******************************************************

import numpy as np
def sigm(z):
    """
    Computes the sigmoid function

    :type z: float
    :rtype: float
    """
    return 1./(1. + np.exp(-z))

def compute_grad(w, x, y):
    """
    Computes gradient of LL for logistic regression

    :type w: 1D np array of weights
    :type x: 2D np array of features where len(w) == len(x[0])
    :type y: 1D np array of labels where len(x) == len(y)
    :rtype: 1D numpy array
    """
    wx = np.dot(x, w)
    sigmwx = np.fromiter((sigm(wxi) for wxi in wx), wx.dtype)
    return np.sum(np.multiply(x, (y - sigmwx).reshape((len(y), 1))), 0)

def gd_single_epoch(w, x, y, step):
    """
    Updates the weight vector by processing the entire training data once

    :type w: 1D numpy array of weights
    :type x: 2D numpy array of features where len(w) == len(x[0])
    :type y: 1D numpy array of labels where len(x) == len(y)
    :rtype: 1D numpy array of weights
    """
    return w + step*compute_grad(w, x, y)

def gd(x, y, stepsize):
    """
    Iteratively optimizes the objective function by first
    initializing the weight vector with zeros and then
    iteratively updates the weight vector by going through
    the trianing data num_epoch_for_train(global var) times

    :type x: 2D numpy array of features where len(w) == len(x[0])
    :type y: 1D numpy array of labels where len(x) == len(y)
    :type stepsize: float
    :rtype: 1D numpy array of weights
    """
    w = np.zeros(len(x[0]))
    for i in range(num_epoch_for_train):
        w = gd_single_epoch(w, x, y, stepsize)
    return w

def predict(w, x):
    """
    Makes a binary decision {0,1} based the weight vector
    and the input features

    :type w: 1D numpy array of weights
    :type x: 1D numpy array of features of a single data point
    :rtype: integer {0,1}
    """
    if sigm(np.dot(x, w)) > 0.5:
        return 1
    return 0

def accuracy(w, x, y):
    """
    Calculates the proportion of correctly predicted results to the total

    :type w: 1D numpy array of weights
    :type x: 2D numpy array of features where len(w) == len(x[0])
    :type y: 1D numpy array of labels where len(x) == len(y)
    :rtype: float as a proportion of correct labels to the total
    """
    prediction = np.fromiter((predict(w, xi) for xi in x), x.dtype)
    return float(np.sum(np.equal(prediction, y)))/len(y)

def five_fold_cross_validation_avg_accuracy(x, y, stepsize):
    """
    Measures the 5 fold cross validation average accuracy
    Partition the data into five equal size sets like
    |-----|-----|-----|-----|
    For all 5 choose 1 permutations, train on 4, test on 1.

    Compute the average accuracy using the accuracy function
    you wrote.

    :type x: 2D numpy array of features where len(w) == len(x[0])
    :type y: 1D numpy array of labels where len(x) == len(y)
    :type stepsize: float
    :rtype: float as average accuracy across the 5 folds
    """
    x_folds = np.split(x, 5)
    y_folds = np.split(y, 5)
    percent_acc_ith_test_fold = np.zeros(5)
    for i in range(5):
        test_index = (i-1)%5
        x_test_fold = x_folds[test_index]
        y_test_fold = y_folds[test_index]
        x_train_folds = np.array([]).reshape((0,len(x[0])))
        y_train_folds = np.array([])
        for j in range(5-1):
            train_index = (i+j)%5
            x_train_folds = np.concatenate((x_train_folds, x_folds[train_index]), axis=0)
            y_train_folds = np.concatenate((y_train_folds, y_folds[train_index]), axis=0)
        w = gd(x_train_folds, y_train_folds, stepsize)
        percent_acc_ith_test_fold[i] = accuracy(w, x_test_fold, y_test_fold)
    return np.average(percent_acc_ith_test_fold)

def tune(x, y):
    """
    Optimizes the stepsize by calculating five_fold_cross_validation_avg_accuracy
    with 10 different stepsizes from 0.001, 0.002,...,0.01 in intervals of 0.001 and
    output the stepsize with the highest accuracy

    For comparison:
    If two accuracies are equal, pick the lower stepsize.

    NOTE: For best practices, we should be using Nested Cross-Validation for
    hyper-parameter search. Without Nested Cross-Validation, we bias the model to the
    data. We will not implement nested cross-validation for now. You can experiment with
    it yourself.
    See: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html

    :type x: 2D numpy array of features where len(w) == len(x[0])
    :type y: 1D numpy array of labels where len(x) == len(y)
    :rtype: float as best stepsize
    """
    stepsizes = np.linspace(0.001, 0.01, num=10)
    acc = np.zeros(10)
    for (i, step) in enumerate(stepsizes):
        acc[i] = five_fold_cross_validation_avg_accuracy(x_train, y_train, step)
    return stepsizes[np.argmax(acc)]

# Top three tuned stepsize values
def tune_top_three(x, y):
    stepsizes = np.linspace(0.001, 0.01, num=10)
    acc = np.zeros(10)
    for (i, step) in enumerate(stepsizes):
        acc[i] = five_fold_cross_validation_avg_accuracy(x_train, y_train, step)
    return stepsizes[np.argpartition(acc, -3)[-3:]]


# Correct Results
w_single_epoch = gd_single_epoch(np.zeros(len(x_train[0])), x_train, y_train, default_stepsize)
w_optimized = gd(x_train, y_train, default_stepsize)
y_predictions = np.fromiter((predict(w_optimized, xi) for xi in x_test), x_test.dtype)
five_fold_average_accuracy = five_fold_cross_validation_avg_accuracy(x_train, y_train, default_stepsize)
# tuned_stepsize = tune(x_train, y_train)
top_three_tuned_stepsizes = tune_top_three(x_train, y_train)