Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 130 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Let's discuss functions for a bit!
It's common knowledge that a function is a relationship
between a set of possible inputs and a set of possible outputs.

Theory allows the pattern of this mapping to be totally random and useless,
Theory allows this mapping to be totally random and useless,
but all practical and useful functions embody a pattern.

For instance, when we look at the plot of $sin(\theta)$ we clearly see a wavy sinusoidal
Expand All @@ -25,13 +25,15 @@ the value of some variable $y$ given a variable $x$ as long as:

And that's essentially what linear regression is.

$$\large Everything\ has\ a\ pattern.$$
$$\large Everything\ has\ a\ Pattern.$$

$$\large Functions\ capture\ patterns.$$
$$\large Functions\ capture\ Patterns.$$

$$\large Complex\ patterns\ are\ captured\ by\ complex\ functions.$$
$$\large Complex\ Functions\ can\ capture\ Complex\ Patterns.$$

No surprise now when we think about conversational AIs like ChatGPT.


Its just a very complex - 175 Billion Parameter - function that has pretty
accurately captured the pattern of human language, and not just human language
it has gone one step meta, it has captured patterns that are based on top of
Expand All @@ -47,7 +49,7 @@ given the conversation with the new word added and it generates
the next word, and so on until it predicts an "end of text" token,
which is weirdly similar to how we decide when to shut up!*

# What I want to achieve with this corpus of text
# Understanding Backpropagation from First Principles

It's easy to understand how neural networks produce useful output
once they have been assigned correct parameters -
Expand All @@ -66,7 +68,7 @@ the only way this thing will ever make sense. Remember thinking "when will I eve

Here goes another attempt at slaying the BackProp dragon!

# Formalizing Forward Propagation
# Part 1.1: Formalizing Forward Propagation

Before we can start understanding BackProp we must express
forward propagation, mathematically - the process of converting a neural network's
Expand All @@ -77,35 +79,40 @@ the readers are expected to have a general idea about weights, biases, artificia

<!-- do a section for getting beginners upto speed and link it here -->



# Single Neuron
## Single Neuron

For $k$ features. The weighted input of a single neuron is given by the following equation

$$
\large
z = (\sum_{i=1}^k w_{i} .x_{i})+b \\
$$

The output of this neuron is given by

$$
\large
a = \sigma(z)
$$

where $\sigma$ is the sigmoid function. It can be some other activation function too.

Also another thing to note here is that the job of an activation function is to squish
the value of $z$ into a value between $0\ and\ 1$. This value is called the activation.


# Structure of the neural network
## Structure of the neural network

- Imagined from left to right, with a number of layers.
- Each layer is a vertical column of neurons.
- Leftmost layer is input layer. Each neuron in this layer connects to all the neurons in the next layer through weights.
- Rightmost layer is the output layer. It is fully connected with the layer on its left. Its activations are the output of the network.
- All the layers in the middle are called hidden layers, these layers are fully connected on both sides and they receive the input from the left, they produce activations which get forwarded to the layer on the right.

<!-- insert a picture of a neural network -->


# Layout of the Weight matrix $w^L$
## Layout of the Weight matrix $w^L$

Each layer stores what is called a weight matrix.
This matrix stores all the weights that connect this layer to
Expand All @@ -115,15 +122,16 @@ that decides the strength of a connection between two neurons.
We formally define a weight as:


## $$w_{jk}^L$$
### $$w_{jk}^L$$

Represents a weight that connects $j_{th}$ neuron in layer $L$ - current layer - to $k_{th}$ neuron in layer $(L-1)$ - previous layer.

*Example: $w_{27}^4$ represents a weight that connects $2_{nd}$ neuron in layer $4$ to $7_{th}$ neuron in layer $3$*
Example: $w_{27}^4$ represents a weight that connects $2_{nd}$ neuron in layer $4$ to $7_{th}$ neuron in layer $3$

<!-- insert picture of a weight between layer L and L-1 -->

Weight matrix for layer $L$ is denoted by
## $$w^L$$
### $$w^L$$
The number of rows in $w^L$ is equal to the number of neurons in layer $L$.
And the number of columns in $w^L$ is equal to the number of neurons in layer
$L-1$
Expand All @@ -140,46 +148,43 @@ w_{j0}^L \cdots w_{jk}^L \\
$$


# Definitions
## Definitions


## $b_{j}^L$
### Represents a bias for the $j_{th}$ neuron in layer $L$
*Example: $b_{4}^5$ represents the bias for the $4_{th}$ neuron
in layer $5$*
$\huge b_{j}^L$
#### Represents a bias for the $j_{th}$ neuron in layer $L$
Example: $b_{4}^5$ represents the bias for the $4_{th}$ neuron in layer $5$

## $z_{j}^L$
### Represents weighted input for neuron $j$ in layer $L$
*Example: $z_{4}^5$ represents the weighted input for the $4_{th}$ neuron
in layer $5$*
$\huge z_{j}^L$
#### Represents weighted input for neuron $j$ in layer $L$
Example: $z_{4}^5$ represents the weighted input for the $4_{th}$ neuron in layer $5$


## $a_{j}^L$
### Represents activation for the $j_{th}$ neuron in layer $L$
*Example: $a_{4}^5$ represents the activation for the $4_{th}$ neuron
in layer $5$*
$\huge a_{j}^L$
#### Represents activation for the $j_{th}$ neuron in layer $L$
Example: $a_{4}^5$ represents the activation for the $4_{th}$ neuron in layer $5$




# Calculations
## Calculations
The weighted input for a neuron $j$ in layer $L$ is calculated like this

## $z_{j}^L = (\sum_{k=0}^k w_{jk}^L .a_{k}^{L-1})+b_{j}^L $
### $$z_{j}^L = \left( \sum_{k=0}^k w_{jk}^L . a_{k}^{L-1} \right) +b_{j}^L$$

The activation for a neuron $j$ in layer $L$ is calculated like this

## $a_{j}^L = \sigma(z_{j}^L)$
### $$a_{j}^L = \sigma(z_{j}^L)$$


# Vectorized Form
## Vectorized Form

When we're implementing neural networks we rely on vector/matrix
operations to take advantage of SIMD operations available on modern
CPUs and GPUs. Therefore we should express equations for forward/backward
propagation in vectorized forms.

## Weight Matrix
### Weight Matrix

We've already defined the vectorized form of the weights as the weight matrix.
For a layer $L$ the weight matrix is given as:
Expand All @@ -194,7 +199,7 @@ w_{j0}^L \cdots w_{jk}^L \\
\end{bmatrix}
$$

## Bias Vector
### Bias Vector

For a layer $L$ the bias vector is expressed as:

Expand All @@ -210,7 +215,7 @@ $$

where $j+1$ is the number of neurons in layer $L$

## Weighted Input Vector
### Weighted Input Vector

For a layer $L$ the weighted input of neurons in layer $L$ is expressed as:

Expand All @@ -221,7 +226,7 @@ where:
- $a^{L-1}$ is the activation vector of layer $L-1$
- $b^L$ is the bias vector of layer $L$

## Activation Vector
### Activation Vector

For a layer $L$ the activation vector is expressed as:

Expand All @@ -232,24 +237,104 @@ where:
- $z^l$ is the weighted input vector for layer $L$


# Summary
# Part 1.1: Summary

Given
For any layer $L$ if we have:

$w^L$ the weight matrix for layer $L$
- $w^L$ the weight matrix for layer $L$

$a^{L-1}$ the activation vector for layer $L-1$
- $a^{L-1}$ the activation vector for layer $L-1$

$b^L$ the bias vector for layer $L$
- $b^L$ the bias vector for layer $L$

$\sigma$ an activation function
- $\sigma$ an activation function

we forward propagate by first computing $z^L$
we can calculate $a^L$ by using these two equations

$$ z^L = w^L . a^{L-1} + b^L $$

and then we calculate $a^L$

$$ a^L = \sigma(z^L) $$

this allows us to compute the activations of the next layer $a^{L+1}$

Also, when calculating $a^1$ we need $a^0$ but there's no such thing as layer 0.
Well, $a^0$ is always the input vector, so the activation of the first layer gets
computed using the input - makes sense!

If we think very carefully about the activation of the final layer, say $a^{Fin}$
it is a function of three things:
- $x$ : the input vector
- $w$ : the set of all the weights in the network
- $b$ : the set of all the biases in the network

So the last thing to remember is that a Neural Network is a function of these three.
And its value is given by the activation of the final layer.

$$ a^{Fin}(x,w,b) = \sigma(z^{Fin}(x,w,b)) $$

# Part 1b: The Cost Function

Training a neural network is a cost minimaztion problem. There's a cost function that calculates
how well the neural network is doing on the training data. The return value of the cost function is
the error, the greater this error is in magnitude the farther the neural network is from the desired state. Ideally the cost function should have values very close to zero.

But a value close to zero does not necessarily mean that the network has learned the pattern, it can
also mean that the network has memorized the training set and has overfitted itself on the noise in
the data too.


Anyways let us develop our version of the cost function:

$a^{Fin}(x,w,b)$ is the output of the neural network given input vector $x$, weights $w$, and biases
$b$.

$y(x)$ is the desired output of the neural network for input vector $x$.

The difference between these two will be the error $Err(x,w,b)$

$$ Err(x,w,b) = y(x) - a^{Fin}(x,w,b)$$

This will be a vector since its a difference of two vectors. But because we only care about the
total loss and we are not interested in the loss of the individual components let us take the
length/magnitude of this vector

$$ Err(x,w,b) = || y(x) - a^{Fin}(x,w,b) ||$$

This expression results in a scalar value which tells us the total error in the network for a
training example $x$.

To convert this into a cost function we will have to express this error in terms of the entire training set not just one $x$. Lets define this error as a sum to loose $x$ as the parameter.

$$ C(w,b) = \sum^x || y(x) - a^{Fin}(x,w,b) ||$$

The good news is the error is now expressed in terms of just $w$ and $b$.

The bad news is that it's a sum of the errors from all the training set, and so does not do a good job of representing the error for a general input $x$.

But we can fix it by averaging this value over the entire training set:

$$\large C_{\tiny MAE}(w,b) = \frac{1}{n} \sum^x || y(x) - a^{Fin}(x,w,b) ||$$

'$n$' is the number of training examples

This is a well known error function called $Mean\ Absolute\ Error\ (MAE)$

Let's upgrade this function to a $Mean\ Squared\ Error\ (MSE)$ AKA quadratic cost function:

$$\huge C_{\tiny MSE}(w,b) = \frac{1}{n} \sum^x || y(x) - a^{Fin}(x,w,b) ||^2$$


So it has been decided that we are going to use $MSE$ as our cost function on this journey of understanding back propagation.


# Summary of Part1 $a$ and $b$

If you have reached here make sure the following things are true

- We understand what a function is
- We have a sense of how input to a single neuron is converted to its output, and the relationship of $\large z$ and $\large a$
- We understand weight matrices, bias vectors, weighted input vectors, activation vectors.
- We understand that the neural network is a function of its weights, biases, and input.
- We understand forward propagation, i.e. how to start with an input vector $\large x$ and convert it to the
output vector $\large a^{Fin}$
- We understand how to calculate the error of the network using the MSE cost function.