-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathHW4_Q2.nb
More file actions
93 lines (85 loc) · 3.91 KB
/
HW4_Q2.nb
File metadata and controls
93 lines (85 loc) · 3.91 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
(* Content-type: application/vnd.wolfram.mathematica *)
(*** Wolfram Notebook File ***)
(* http://www.wolfram.com/nb *)
(* CreatedBy='Mathematica 9.0' *)
(*CacheID: 234*)
(* Internal cache information:
NotebookFileLineBreakTest
NotebookFileLineBreakTest
NotebookDataPosition[ 157, 7]
NotebookDataLength[ 3809, 84]
NotebookOptionsPosition[ 3489, 68]
NotebookOutlinePosition[ 3834, 83]
CellTagsIndexPosition[ 3791, 80]
WindowFrame->Normal*)
(* Beginning of Notebook Content *)
Notebook[{
Cell[CellGroupData[{
Cell["Question 3:", "Subsubsection",
CellChangeTimes->{{3.576957310914856*^9, 3.576957313575008*^9}}],
Cell[TextData[{
"\tAll Q-learning runs have been conducted using the \[Epsilon]-greedy \
approach. For all experiments the following parameters are constant:\n\t\t\t",
StyleBox["Learning Rate, \[Alpha] = 0.1\n\t\t\tDiscount Rate, \[Gamma] = 0.9\
\n\t\t\tSampled Episodes = 1,000",
FontSlant->"Italic"],
"\n\nInitial experiment has been conducted with \[Epsilon] =0 i.e. the next \
step in each episode is chosen greedily based on expected return the move \
yields. \[Epsilon] = 0 implies random exploration is done only in case \
multiple moves yield the same expected return.\n\nThe second experiment has \
been conducted with \[Epsilon] = 0.1. This implies that 10% of the time the \
greedy decision based on expected return of a move is not followed and a move \
other than the greedy decision is randomly selected. This noise introduced in \
the greedy approach induces random exploration of the neighboring cells in \
the grid.\n\nThe learning curve for both experiments suggests that although \
both approaches converge the range of convergence is broader in case of the \
experiment with induced noise in the decision making. The optimal solution \
for a 15x15 grid is 28 steps.\n\nIn case of the pure greedy approach the \
learning algorithm converges in the range of 28-50`ish never settling on 28 \
since random stepping kicks in every time the Q-values for multiple moves \
converges on the optimal and mutually equal values. Steadily the new Q-values \
backup the grid and converge to the optimal relative intervals yielding the \
optimal value yet again. This beahvior is nearly periodic but does not \
exhibit symmetry due to random stepping.\n\nIn case of the noise-induced \
greedy approach the algorithm does converge but within a slightly broader \
range of 28-70`ish. This is due to noise which induces random movement on the \
grid irrespective of the optimal move indicated by the corresponding \
Q-values. Also we notice that this randomness causes the converging learning \
curve to spend lesser time around the optimal \[OpenCurlyDoubleQuote]28 steps\
\[CloseCurlyDoubleQuote] value and explore more than the vanilla greedy \
approach. This increased randomness causes a larger time-span to pass before \
the optimal relative intervals amongst Q-values are established."
}], "Text",
CellChangeTimes->{
3.576957295911998*^9, {3.5769573499170866`*^9, 3.5769577752714157`*^9}, {
3.5769581832477503`*^9, 3.576958190745179*^9}, {3.5769583131161785`*^9,
3.5769586088020906`*^9}, {3.5769586401528835`*^9, 3.576958757488595*^9}, {
3.5769588150168858`*^9, 3.57695881685299*^9}, {3.5769588612845316`*^9,
3.5769590033916597`*^9}, {3.5769590989231243`*^9,
3.5769591001551943`*^9}, {3.57695913504419*^9, 3.5769593259501095`*^9}},
TextJustification->1]
}, Open ]]
},
WindowSize->{707, 505},
WindowMargins->{{72, Automatic}, {38, Automatic}},
FrontEndVersion->"9.0 for Microsoft Windows (64-bit) (November 20, 2012)",
StyleDefinitions->"Default.nb"
]
(* End of Notebook Content *)
(* Internal cache information *)
(*CellTagsOutline
CellTagsIndex->{}
*)
(*CellTagsIndex
CellTagsIndex->{}
*)
(*NotebookFileOutline
Notebook[{
Cell[CellGroupData[{
Cell[579, 22, 102, 1, 34, "Subsubsection"],
Cell[684, 25, 2789, 40, 638, "Text"]
}, Open ]]
}
]
*)
(* End of internal cache information *)