From ade62653b4fc5052d9685633023ac6d623388b66 Mon Sep 17 00:00:00 2001 From: Ajay Dhangar Date: Tue, 16 Dec 2025 19:06:32 +0530 Subject: [PATCH] added content for calculus --- .../calculus/chain-rule.mdx | 110 +++++++++++++++++ .../calculus/derivatives.mdx | 100 +++++++++++++++ .../mathematics-for-ml/calculus/gradients.mdx | 93 ++++++++++++++ .../mathematics-for-ml/calculus/hessian.mdx | 84 +++++++++++++ .../mathematics-for-ml/calculus/jacobian.mdx | 85 +++++++++++++ .../calculus/partial-derivatives.mdx | 114 ++++++++++++++++++ 6 files changed, 586 insertions(+) diff --git a/docs/machine-learning/mathematics-for-ml/calculus/chain-rule.mdx b/docs/machine-learning/mathematics-for-ml/calculus/chain-rule.mdx index e69de29..3e6017a 100644 --- a/docs/machine-learning/mathematics-for-ml/calculus/chain-rule.mdx +++ b/docs/machine-learning/mathematics-for-ml/calculus/chain-rule.mdx @@ -0,0 +1,110 @@ +--- +title: "Chain Rule - The Engine of Backpropagation" +sidebar_label: Chain Rule +description: "Mastering the Chain Rule, the fundamental calculus tool for differentiating composite functions, and its direct application in the Backpropagation algorithm for training neural networks." +tags: + [ + chain-rule, + calculus, + mathematics-for-ml, + backpropagation, + composite-functions, + neural-networks, + gradient, + ] +--- + +The **Chain Rule** is a formula used to compute the derivative of a **composite function**, a function that is composed of one function inside another. If a function is built like a chain, the Chain Rule shows us how to differentiate it link by link. + +This is arguably the most important calculus concept for Deep Learning, as the entire structure of a neural network is one massive composite function. + +## 1. What is a Composite Function? + +A composite function is one where the output of an inner function becomes the input of an outer function. + +If $y$ is a function of $u$, and $u$ is a function of $x$, then $y$ is ultimately a function of $x$. + +$$ +y = f(u) \quad \text{where} \quad u = g(x) +$$ + +The overall composite function is $y = f(g(x))$. + +## 2. The Chain Rule Formula (Single Variable) + +The Chain Rule states that the rate of change of $y$ with respect to $x$ is the product of the rate of change of $y$ with respect to $u$, and the rate of change of $u$ with respect to $x$. + +$$ +\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} +$$ + +### Example + +Let $y = (x^2 + 1)^3$. This can be written as $y = u^3$ where $u = x^2 + 1$. + +1. **Find $\frac{dy}{du}$ (Outer derivative):** + $$ + \frac{dy}{du} = \frac{d}{du}(u^3) = 3u^2 + $$ +2. **Find $\frac{du}{dx}$ (Inner derivative):** + $$ + \frac{du}{dx} = \frac{d}{dx}(x^2 + 1) = 2x + $$ +3. **Apply the Chain Rule:** + $$ + \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = (3u^2) \cdot (2x) + $$ +4. **Substitute $u$ back:** + $$ + \frac{dy}{dx} = 3(x^2 + 1)^2 \cdot 2x = 6x(x^2 + 1)^2 + $$ + +## 3. The Chain Rule with Multiple Variables (Partial Derivatives) + +In neural networks, one variable can affect the final output through multiple different paths. This requires a slightly more complex version of the Chain Rule involving partial derivatives and summation. + +If $z$ is a function of $x$ and $y$, and both $x$ and $y$ are functions of $t$: $z = f(x, y)$, where $x=g(t)$ and $y=h(t)$. + +The total derivative of $z$ with respect to $t$ is: + +$$ +\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt} +$$ + +## 4. The Chain Rule and Backpropagation + +Backpropagation (short for "backward propagation of errors") is the algorithm used to train neural networks. It is nothing more than the repeated application of the multivariate Chain Rule. + +### The Neural Network Chain + +A neural network is a sequence of composite functions: + +$$ +\text{Loss} \leftarrow \text{Output Layer} \leftarrow \text{Hidden Layer 2} \leftarrow \text{Hidden Layer 1} \leftarrow \text{Input} +$$ + +The goal is to calculate how a small change in a parameter (weight $w$) in an **early layer** affects the final **Loss** $J$. + +$$ +\frac{\partial J}{\partial w_{\text{early}}} = \left(\frac{\partial J}{\partial \text{Output}}\right) \cdot \left(\frac{\partial \text{Output}}{\partial \text{Layer } 2}\right) \cdot \left(\frac{\partial \text{Layer } 2}{\partial \text{Layer } 1}\right) \cdot \left(\frac{\partial \text{Layer } 1}{\partial w_{\text{early}}}\right) +$$ + +:::important Backpropagation Flow +1. **Forward Pass:** Calculate the prediction and the Loss $J$. +2. **Backward Pass (Backprop):** Start at the end of the chain (the Loss $J$) and calculate the partial derivatives (gradients) layer by layer, multiplying them backward toward the input. +3. **Update:** Use the final calculated gradient $\frac{\partial J}{\partial w}$ to update the weight $w$ via Gradient Descent. +::: + +## 5. Summary of Calculus for ML + +You have now covered the three foundational concepts of Calculus required for Machine Learning: + +| Concept | Mathematical Tool | ML Application | +| :--- | :--- | :--- | +| **Derivatives** | $\frac{df}{dx}$ | Measures the slope of the loss function. | +| **Partial Derivatives** | $\nabla J$ (The Gradient) | Identifies the direction of steepest ascent in the high-dimensional loss surface. | +| **Chain Rule** | $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ | Propagates the gradient backward through all layers of a neural network to calculate parameter updates. | + +--- + +With the mathematical foundations of Linear Algebra and Calculus established, we are now ready to tackle the core optimization algorithm that brings these concepts together: Gradient Descent. \ No newline at end of file diff --git a/docs/machine-learning/mathematics-for-ml/calculus/derivatives.mdx b/docs/machine-learning/mathematics-for-ml/calculus/derivatives.mdx index e69de29..d7e858f 100644 --- a/docs/machine-learning/mathematics-for-ml/calculus/derivatives.mdx +++ b/docs/machine-learning/mathematics-for-ml/calculus/derivatives.mdx @@ -0,0 +1,100 @@ +--- +title: "Derivatives - The Rate of Change" +sidebar_label: Derivatives +description: "An introduction to derivatives, their definition, rules, and their crucial role in calculating the slope of the loss function, essential for optimization algorithms like Gradient Descent." +tags: + [ + derivatives, + calculus, + mathematics-for-ml, + rate-of-change, + slope, + gradient-descent, + optimization, + ] +--- + +Calculus is the mathematical foundation for optimization in Machine Learning. Specifically, **Derivatives** are the primary tool used to train almost every ML model, from Linear Regression to complex Neural Networks, via algorithms like Gradient Descent. + +## 1. What is a Derivative? + +The derivative of a function measures the **instantaneous rate of change** of that function. Geometrically, the derivative at any point on a curve is the **slope of the tangent line** to the curve at that point. + +### Formal Definition + +The derivative of a function $f(x)$ with respect to $x$ is defined using limits: + +$$ +f'(x) = \frac{dy}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} +$$ + +* $\frac{dy}{dx}$ is the common notation, read as "the derivative of $y$ with respect to $x$." +* The expression $\frac{f(x+h) - f(x)}{h}$ is the slope of the secant line between $x$ and $x+h$. +* Taking the limit as $h$ approaches zero gives the exact slope of the tangent line at $x$. + +## 2. Derivatives in Machine Learning: Optimization + +In Machine Learning, we define a **Loss Function** (or Cost Function) $J(\theta)$ which measures the error of our model, where $\theta$ represents the model's parameters (weights and biases). + +The goal of training is to find the parameter values $\theta$ that **minimize** the loss function. + +### A. Finding the Minimum + +1. A function's minimum (or maximum) occurs where the slope is zero. +2. The derivative tells us the slope. +3. Therefore, by setting the derivative $\frac{dJ}{d\theta}$ to zero, we can find the optimal parameters $\theta$. + +### B. Gradient Descent + +For most complex ML models, the loss function is too complex to solve by setting the derivative to zero directly. Instead, we use an iterative process called **Gradient Descent**. + +The derivative $\frac{dJ}{d\theta}$ tells us two things: + +* **Magnitude:** How steep the slope is (how quickly the loss is changing). +* **Direction (Sign):** Whether moving parameter $\theta$ in a positive direction will increase or decrease the loss. + +In Gradient Descent, we update the parameter $\theta$ in the **opposite direction** of the derivative (down the slope) to find the minimum: + +$$ +\theta_{\text{new}} = \theta_{\text{old}} - \alpha \frac{dJ}{d\theta} +$$ + +* $\alpha$ (alpha) is the **learning rate** (a small scalar). +* $\frac{dJ}{d\theta}$ is the derivative (the slope/gradient). + +## 3. Basic Differentiation Rules + +You must be familiar with the following rules to understand how derivatives are calculated for model training. + +| Rule Name | Function $f(x)$ | Derivative $\frac{d}{dx}f(x)$ | Example | +| :--- | :--- | :--- | :--- | +| **Constant Rule** | $c$ | $0$ | $\frac{d}{dx}(5) = 0$ | +| **Power Rule** | $x^n$ | $nx^{n-1}$ | $\frac{d}{dx}(x^3) = 3x^2$ | +| **Constant Multiple** | $c \cdot f(x)$ | $c \cdot f'(x)$ | $\frac{d}{dx}(4x^2) = 8x$ | +| **Sum/Difference** | $f(x) \pm g(x)$ | $f'(x) \pm g'(x)$ | $\frac{d}{dx}(x^2 - 3x) = 2x - 3$ | +| **Exponential** | $e^x$ | $e^x$ | | + +### Example: Quadratic Loss + +Linear Regression often uses Mean Squared Error (MSE), which is a quadratic function of the weights $w$. + +Let the simplified loss function be $J(w) = w^2 + 4w + 1$. +We apply the Sum and Power Rules: + +$$ +\frac{dJ}{dw} = \frac{d}{dw}(w^2) + \frac{d}{dw}(4w) + \frac{d}{dw}(1) = 2w + 4 + 0 = 2w + 4 +$$ + +If the current weight is $w=1$, the slope is $2(1) + 4 = 6$ (steep, positive). + +## References and Resources + +To solidify your understanding of differentiation, here are some excellent resources: + +* **[Khan Academy - Differential Calculus](https://www.khanacademy.org/math/differential-calculus):** Comprehensive video tutorials covering limits, derivatives, and rules. Excellent for visual learners. +* **Calculus: Early Transcendentals** by James Stewart (or any similar major textbook): Provides rigorous definitions and practice problems. +* **The Calculus of Computation** by Lars Kristensen: A good resource that connects calculus directly to computational methods. + +--- + +Most functions in ML depend on more than one parameter (e.g., $w_1, w_2, \text{bias}$). To find the slope in these multi-variable spaces, we must use Partial Derivatives. \ No newline at end of file diff --git a/docs/machine-learning/mathematics-for-ml/calculus/gradients.mdx b/docs/machine-learning/mathematics-for-ml/calculus/gradients.mdx index e69de29..b30a1ce 100644 --- a/docs/machine-learning/mathematics-for-ml/calculus/gradients.mdx +++ b/docs/machine-learning/mathematics-for-ml/calculus/gradients.mdx @@ -0,0 +1,93 @@ +--- +title: "Gradients - The Direction of Steepest Ascent" +sidebar_label: Gradients +description: "Defining the Gradient vector, its mathematical composition from partial derivatives, its geometric meaning as the direction of maximum increase, and its role as the central mechanism for learning in Machine Learning." +tags: + [ + gradients, + calculus, + mathematics-for-ml, + gradient-descent, + vector-calculus, + optimization, + loss-function, + ] +--- + +The **Gradient** is the ultimate expression of calculus in Machine Learning. It is the vector that consolidates all the partial derivatives of a multi-variable function (like our Loss Function) and points in the direction the function is increasing most rapidly. + +Understanding the gradient is essential because the primary optimization algorithm, **Gradient Descent**, simply involves moving in the direction *opposite* to the gradient. + +## 1. Defining the Gradient Vector + +The gradient of a scalar-valued function $f$ of several variables ($\theta_1, \theta_2, \dots, \theta_n$) is a **vector** that contains all the function's partial derivatives. + +### Notation + +The gradient of a function $J(\mathbf{\theta})$ (our Loss Function, $J$) with respect to the parameter vector $\mathbf{\theta}$ is denoted by the $\nabla$ symbol (nabla or del): + +$$ +\nabla J(\mathbf{\theta}) \quad \text{or} \quad \nabla_{\mathbf{\theta}} J +$$ + +### Composition + +If the loss function $J$ depends on $n$ parameters, $\mathbf{\theta} = (\theta_1, \theta_2, \dots, \theta_n)$, the gradient is the $n$-dimensional vector: + +$$ +\nabla J(\mathbf{\theta}) = \begin{bmatrix} +\frac{\partial J}{\partial \theta_1} \\ +\frac{\partial J}{\partial \theta_2} \\ +\vdots \\ +\frac{\partial J}{\partial \theta_n} +\end{bmatrix} +$$ + +## 2. Geometric Meaning + +The Gradient $\nabla J$ is the vector that has two crucial geometric properties: + +1. **Direction:** It points in the direction of the **steepest increase** (the fastest way uphill) on the function's surface. +2. **Magnitude (Length):** The length of the gradient vector, $||\nabla J||$, indicates the **steepness** of the slope in that direction. + +## 3. The Central Role in Gradient Descent + +Since the goal of training an ML model is to **minimize** the Loss Function $J(\mathbf{\theta})$, we must adjust the parameters $\mathbf{\theta}$ to move *downhill*. + +The most effective path downhill is to move in the exact opposite direction of the gradient. + +### A. The Update Rule + +The Gradient Descent update rule formalizes this movement: + +$$ +\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \nabla J(\mathbf{\theta}_{\text{old}}) +$$ + +| Term | Role in Optimization | Calculation | +| :--- | :--- | :--- | +| $\mathbf{\theta}_{\text{old}}$ | Current position (weights/biases). | Vector of current model parameters. | +| $\alpha$ (Alpha) | **Learning Rate** (a small scalar). | Hyperparameter defining the step size. | +| $\nabla J(\mathbf{\theta})$ | **Gradient** of the Loss. | Vector of all partial derivatives. | +| $-\nabla J(\mathbf{\theta})$ | **Negative Gradient**. | The direction of steepest descent (downhill). | + +### B. Convergence + +As the parameters approach the minimum (the "valley floor"), the slope of the Loss Function flattens. + +* At the minimum point, the Loss is flat, so all partial derivatives are zero. +* Therefore, the gradient $\nabla J$ is the zero vector ($\mathbf{0}$). +* The update step becomes $\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \cdot \mathbf{0}$. The parameters stop changing, and the model has converged. + +## 4. Analogy: Descending a Mountain + +Imagine being blindfolded on a vast mountain range (the Loss Surface). Your goal is to reach the valley floor (the minimum loss). + +* **You can't see the whole mountain:** You only know your local height and slope (your current loss $J(\mathbf{\theta})$). +* **The Gradient ($\nabla J$):** A guide who tells you, "The fastest way to go **up** from here is to take 3 steps North and 1 step East." +* **Gradient Descent:** You ignore the guide's direction and decide, "I will move the **opposite** of what you say," taking 3 steps South and 1 step West. +* **Learning Rate ($\alpha$):** Determines if your step size is a cautious hop or a giant leap. + +--- + +The Gradient is the core concept uniting all the calculus concepts we've covered. It moves the model from an initial, poor starting position to an optimal, converged solution. \ No newline at end of file diff --git a/docs/machine-learning/mathematics-for-ml/calculus/hessian.mdx b/docs/machine-learning/mathematics-for-ml/calculus/hessian.mdx index e69de29..afc1525 100644 --- a/docs/machine-learning/mathematics-for-ml/calculus/hessian.mdx +++ b/docs/machine-learning/mathematics-for-ml/calculus/hessian.mdx @@ -0,0 +1,84 @@ +--- +title: "The Hessian Matrix" +sidebar_label: Hessian +description: "Understanding the Hessian matrix, second-order derivatives, and how the curvature of the loss surface impacts optimization and model stability." +tags: + [ + hessian, + calculus, + mathematics-for-ml, + optimization, + second-order-derivatives, + curvature, + ] +--- + +While the **Gradient** tells us the direction of the steepest slope, it doesn't tell us about the "shape" of the ground. Is the slope getting steeper or flatter? Are we in a narrow canyon or a wide, shallow bowl? To answer these questions, we need second-order derivatives, organized into the **Hessian Matrix**. + +## 1. What is the Hessian? + +The Hessian is a square matrix of **second-order partial derivatives** of a scalar-valued function. It describes the local **curvature** of the function. + +If we have a function $f(x_1, x_2, \dots, x_n)$, the Hessian $\mathbf{H}$ is an $n \times n$ matrix: + +$$ +\mathbf{H} = \begin{bmatrix} +\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots \\ +\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots \\ +\vdots & \vdots & \ddots +\end{bmatrix} +$$ + +:::info Symmetry +If the second derivatives are continuous, the Hessian is a **symmetric matrix** (i.e., $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$). This makes it easier to work with using Linear Algebra tools like Eigen-decomposition. +::: + +## 2. Why does the Hessian matter in ML? + +The Hessian helps us understand the "topography" of the Loss Function $J(\theta)$. + +### A. Determining Maxima and Minima +The gradient only tells us if the slope is zero ($\nabla J = 0$), but that could be a peak, a valley, or a saddle point. The Hessian tells us which one: +* **Positive Definite Hessian:** The surface curves upward in all directions (a **Local Minimum**). +* **Negative Definite Hessian:** The surface curves downward in all directions (a **Local Maximum**). +* **Indefinite Hessian:** The surface curves up in some directions and down in others (a **Saddle Point**). + +### B. Curvature and Learning Rates +The Hessian determines the "width" of the valley: +* **High Curvature:** A narrow, steep valley. If the learning rate is too high, Gradient Descent will bounce back and forth across the valley walls. +* **Low Curvature:** A wide, flat valley. Gradient Descent will move very slowly toward the bottom. + +## 3. Second-Order Optimization + +Standard Gradient Descent is a **first-order** method; it only uses the gradient. There are **second-order** methods, like **Newton's Method**, that use the Hessian to take much more efficient steps. + +Instead of just moving in the negative gradient direction, Newton's method scales the step by the inverse of the Hessian: + +$$ +\theta_{new} = \theta_{old} - \mathbf{H}^{-1} \nabla J(\theta) +$$ + +:::caution The Computational Catch +In modern Deep Learning, the Hessian is rarely used directly. If a model has 10 million parameters, the Hessian matrix would have $10^{14}$ elements (100 trillion!), which is impossible to store in memory or invert. We use "quasi-Newton" methods or adaptive optimizers (like Adam) that approximate this curvature information. +::: + +## 4. Example Calculation + +Let $f(x, y) = x^2 + 4xy + y^2$. + +1. **First Partial Derivatives (Gradient):** + * $f_x = 2x + 4y$ + * $f_y = 4x + 2y$ +2. **Second Partial Derivatives (Hessian):** + * $f_{xx} = \frac{\partial}{\partial x}(2x + 4y) = 2$ + * $f_{yy} = \frac{\partial}{\partial y}(4x + 2y) = 2$ + * $f_{xy} = \frac{\partial}{\partial y}(2x + 4y) = 4$ + +The Hessian matrix is: +$$ +\mathbf{H} = \begin{bmatrix} 2 & 4 \\ 4 & 2 \end{bmatrix} +$$ + +--- + +Now that we have covered the mathematics of change (Calculus), we need to look at the mathematics of uncertainty. This allows us to handle noisy data and make predictions with confidence. \ No newline at end of file diff --git a/docs/machine-learning/mathematics-for-ml/calculus/jacobian.mdx b/docs/machine-learning/mathematics-for-ml/calculus/jacobian.mdx index e69de29..0764804 100644 --- a/docs/machine-learning/mathematics-for-ml/calculus/jacobian.mdx +++ b/docs/machine-learning/mathematics-for-ml/calculus/jacobian.mdx @@ -0,0 +1,85 @@ +--- +title: "The Jacobian Matrix" +sidebar_label: Jacobian +description: "Understanding the Jacobian matrix, its role in vector-valued functions, and its vital importance in backpropagation and modern deep learning frameworks." +tags: + [ + jacobian, + calculus, + mathematics-for-ml, + vector-calculus, + backpropagation, + neural-networks, + ] +--- + +While the **Gradient** is used for functions that take a vector and return a single scalar (like a Loss Function), the **Jacobian** is the generalization of the derivative for functions that take a vector and return *another vector*. + +In Deep Learning, almost every layer in a neural network is a vector-valued function. To pass gradients backward through these layers, we use the Jacobian. + +## 1. What is the Jacobian? + +The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function. + +Suppose we have a function $\mathbf{f}$ that maps an input vector $\mathbf{x}$ of size $n$ to an output vector $\mathbf{y}$ of size $m$: + +$$ +\mathbf{y} = \mathbf{f}(\mathbf{x}) \quad \text{where} \quad \mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m +$$ + +The Jacobian matrix $\mathbf{J}$ is an $m \times n$ matrix where each entry $(i, j)$ represents how much the $i$-th output changes with respect to the $j$-th input. + +$$ +\mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} +\frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_1}{\partial x_n} \\ +\vdots & \ddots & \vdots \\ +\frac{\partial y_m}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_n} +\end{bmatrix} +$$ + + +## 2. Why does the Jacobian matter in ML? + +The Jacobian is the mathematical bridge that allows the **Chain Rule** to work across entire layers of a neural network. + +### A. Backpropagation across Layers +Imagine a layer in a network that takes an input vector $\mathbf{h}_{in}$ and produces an output vector $\mathbf{h}_{out}$. During backpropagation, we receive the gradient of the loss $L$ with respect to the output: $\frac{\partial L}{\partial \mathbf{h}_{out}}$. + +To continue the "chain" and find the gradient with respect to the input, we must multiply by the Jacobian of that layer: + +$$ +\frac{\partial L}{\partial \mathbf{h}_{in}} = \frac{\partial L}{\partial \mathbf{h}_{out}} \cdot \mathbf{J} +$$ + +### B. Activation Functions +When you apply an activation function like **Sigmoid** or **ReLU** to a vector, you are essentially creating a vector-to-vector mapping. The derivative of this mapping is a Jacobian matrix. For element-wise activations, this Jacobian is a **diagonal matrix**, which makes computation very efficient. + +## 3. Example Calculation + +Let's say we have a function $\mathbf{f}(x_1, x_2)$ that outputs a 2D vector: +1. $y_1 = x_1^2 + x_2$ +2. $y_2 = 5x_1 + 2x_2^3$ + +To find the Jacobian $\mathbf{J}$: + +* **Row 1 (Derivatives of $y_1$):** + * $\frac{\partial y_1}{\partial x_1} = 2x_1$ + * $\frac{\partial y_1}{\partial x_2} = 1$ +* **Row 2 (Derivatives of $y_2$):** + * $\frac{\partial y_2}{\partial x_1} = 5$ + * $\frac{\partial y_2}{\partial x_2} = 6x_2^2$ + +The resulting Jacobian matrix is: +$$ +\mathbf{J} = \begin{bmatrix} 2x_1 & 1 \\ 5 & 6x_2^2 \end{bmatrix} +$$ + +## 4. Scaling the Chain Rule + +In modern frameworks like PyTorch or TensorFlow, we rarely compute the full Jacobian matrix explicitly because it can be massive (e.g., $1 \text{ million } \times 1 \text{ million}$ for a large layer). + +Instead, these frameworks perform **Vector-Jacobian Products (VJP)**. They directly calculate the result of $\mathbf{v}^T \mathbf{J}$ (where $\mathbf{v}$ is the incoming gradient), which is much faster and uses less memory. + +--- + +The Jacobian handles first-order changes. But to understand the "curvature" of our loss surface—whether we are in a narrow valley or a wide bowl—we need to look at second-order derivatives: The Hessian. \ No newline at end of file diff --git a/docs/machine-learning/mathematics-for-ml/calculus/partial-derivatives.mdx b/docs/machine-learning/mathematics-for-ml/calculus/partial-derivatives.mdx index e69de29..002017b 100644 --- a/docs/machine-learning/mathematics-for-ml/calculus/partial-derivatives.mdx +++ b/docs/machine-learning/mathematics-for-ml/calculus/partial-derivatives.mdx @@ -0,0 +1,114 @@ +--- +title: Partial Derivatives +sidebar_label: Partial Derivatives +description: "Defining partial derivatives, how they are calculated in multi-variable functions (like the Loss Function), and their role in creating the Gradient vector for optimization." +tags: + [ + partial-derivatives, + calculus, + mathematics-for-ml, + multi-variable, + gradient, + optimization, + ] +--- + +In the real world, and especially in Machine Learning, the functions we deal with rarely depend on just one variable. Our **Loss Function** depends on hundreds, even millions, of parameters (weights and biases). To navigate this high-dimensional space, we need the concept of a **Partial Derivative**. + +## 1. Multi-Variable Functions in ML + +Consider the simplest linear model. The predicted output $\hat{y}$ is a function of the input feature $x$, the weight $w$, and the bias $b$: + +$$ +\hat{y}(w, b) = w x + b +$$ + +The **Loss Function** $J$ (e.g., Mean Squared Error) depends on the parameters $w$ and $b$: + +$$ +J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 = J(w, b) +$$ + +The goal is to adjust *both* $w$ and $b$ simultaneously to minimize $J$. This requires us to know the rate of change of $J$ with respect to each parameter individually. + +## 2. What is a Partial Derivative? + +A partial derivative of a multi-variable function is simply the derivative of that function with respect to **one** variable, while treating all other variables as **constants**. + +### Notation + +The partial derivative of a function $f(x, y)$ with respect to $x$ is denoted by the curly $\partial$ symbol: + +$$ +\frac{\partial f}{\partial x} \quad \text{or} \quad f_x +$$ + +## 3. Calculating the Partial Derivative + +The calculation uses all the standard derivative rules, but with the assumption that everything not being differentiated is a constant. + +### Example + +Let the function be $f(x, y) = 3x^2 + 5xy^3 + 2y$. + +#### A. Partial Derivative with respect to $x$: $\frac{\partial f}{\partial x}$ + +We treat $y$ as a constant. + +$$ +\frac{\partial f}{\partial x} = \frac{\partial}{\partial x} (3x^2) + \frac{\partial}{\partial x} (5y^3 \cdot x) + \frac{\partial}{\partial x} (2y) +$$ + +1. $\frac{\partial}{\partial x} (3x^2) = 6x$ (Power Rule) +2. $\frac{\partial}{\partial x} (5y^3 \cdot x) = 5y^3$ (Treat $5y^3$ as the constant coefficient of $x$) +3. $\frac{\partial}{\partial x} (2y) = 0$ (Treat $2y$ as a constant) + +$$ +\frac{\partial f}{\partial x} = 6x + 5y^3 +$$ + +#### B. Partial Derivative with respect to $y$: $\frac{\partial f}{\partial y}$ + +We treat $x$ as a constant. + +$$ +\frac{\partial f}{\partial y} = \frac{\partial}{\partial y} (3x^2) + \frac{\partial}{\partial y} (5x \cdot y^3) + \frac{\partial}{\partial y} (2y) +$$ + +1. $\frac{\partial}{\partial y} (3x^2) = 0$ (Treat $3x^2$ as a constant) +2. $\frac{\partial}{\partial y} (5x \cdot y^3) = 5x \cdot (3y^2) = 15xy^2$ (Treat $5x$ as the constant coefficient of $y^3$) +3. $\frac{\partial}{\partial y} (2y) = 2$ (Constant Multiple Rule) + +$$ +\frac{\partial f}{\partial y} = 15xy^2 + 2 +$$ + +## 4. The Gradient Vector + +When we collect all the partial derivatives of a multi-variable function $J(\theta_1, \theta_2, \ldots, \theta_n)$ into a single vector, we get the **Gradient** of $J$. + +The gradient, denoted $\nabla J$ (read as "nabla J" or "del J"), is the vector of all partial derivatives: + +$$ +\nabla J(\theta) = \begin{bmatrix} +\frac{\partial J}{\partial \theta_1} \\ +\frac{\partial J}{\partial \theta_2} \\ +\vdots \\ +\frac{\partial J}{\partial \theta_n} +\end{bmatrix} +$$ + +### Significance in ML + +1. **Direction of Steepest Ascent:** The gradient vector $\nabla J$ points in the direction of the **steepest increase** (the "uphill" direction) of the Loss Function $J$. +2. **Gradient Descent:** Since we want to **minimize** the loss, the **Gradient Descent** update rule is to move the parameter vector $\mathbf{\theta}$ in the direction **opposite** to the gradient: + +$$ +\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \nabla J(\mathbf{\theta}) +$$ + +This single vector operation updates *all* weights and biases simultaneously, making the gradient the most fundamental mathematical quantity in training neural networks. + +--- + +Now that we can find the direction of steepest ascent (the Gradient), we must ensure that the update rule accurately propagates this signal through the entire multi-layered network, which is the role of the Chain Rule. \ No newline at end of file