localhost433
diff --git a/‎.gitignore‎
Lines changed: 0 additions & 3 deletions b/‎.gitignore‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 4 additions & 0 deletions b/‎README.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎TERMS.md‎
Lines changed: 31 additions & 0 deletions b/‎TERMS.md‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎components/footer.html‎
Lines changed: 1 addition & 0 deletions b/‎components/footer.html‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎components/header.html‎
Lines changed: 28 additions & 11 deletions b/‎components/header.html‎
Lines changed: 28 additions & 11 deletions
diff --git a/‎notes/courses/LING-UA-2/images/01.png‎
76.7 KB b/‎notes/courses/LING-UA-2/images/01.png‎
76.7 KB
diff --git a/‎notes/metadata/courses.json‎
Lines changed: 7 additions & 0 deletions b/‎notes/metadata/courses.json‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎posts/entries/010-nyc-predict.md‎
Lines changed: 1 addition & 1 deletion b/‎posts/entries/010-nyc-predict.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎posts/entries/011-MoE1.md‎
Lines changed: 151 additions & 0 deletions b/‎posts/entries/011-MoE1.md‎
Lines changed: 151 additions & 0 deletions
diff --git a/‎posts/entries/012-MoE2.md‎
Lines changed: 62 additions & 0 deletions b/‎posts/entries/012-MoE2.md‎
Lines changed: 62 additions & 0 deletions
@@ -30,12 +30,9 @@ build/
 # Forbidden post entries
 /posts/entries/004-*.md
 /posts/entries/010-*.md
-/posts/entries/011-*.md
-/posts/entries/012-*.md
 /posts/entries/013-*.md
 /posts/entries/015-*.md
 /posts/entries/016-*.md
-/posts/entries/017-*.md
 /posts/entries/018-*.md
 /posts/entries/019-*.md
 
 
@@ -34,6 +34,10 @@ This repository contains the source code for my personal website. The site serve
 
 This site is deployed on [Vercel](https://vercel.com/) with automatic GitHub-triggered builds.  
 
+## Terms / AI Usage
+
+Use of this content for training machine learning or AI models is expressly prohibited without prior written consent. See `TERMS.md` for full details.
+
 ## Contact
 
 Feel free to reach out via:
 
@@ -0,0 +1,31 @@
+# Terms of Use
+
+Copyright © 2025 Robin Chen. All rights reserved.
+
+Unless otherwise explicitly licensed in a specific file, all textual, media, code, and structured content in this repository ("Content") is protected by copyright.
+
+## Permitted Use
+- Personal viewing and reading.
+- Linking to public pages.
+- Quoting brief excerpts with attribution and a link back.
+
+## Prohibited Without Prior Written Consent
+- Using, scraping, copying, aggregating, or transforming the Content (in whole or part) for the purpose of training, fine‑tuning, or evaluating machine learning or AI models.
+- Bulk or systematic downloading / crawling beyond what a normal browser would perform.
+- Republishing or redistributing substantial portions of the Content.
+
+## Automated Access / AI Crawlers
+Robots and AI data collection systems must respect `robots.txt`, meta `noai`, `noimageai`, and the `X-Robots-Tag` headers served by this site. Access beyond those signals constitutes unauthorized use.
+
+## Attribution Requirement
+Where limited quotation is permitted, provide: (1) author name, (2) original page URL, (3) date accessed.
+
+## No Warranty
+Content is provided “as is” without any warranty of any kind.
+
+## Contact
+For licensing or usage inquiries (including AI/data use requests): robinchen@nyu.edu
+
+Use of this site constitutes acceptance of these terms.
+
+> Use of this content for training machine learning or AI models is expressly prohibited without prior written consent.
@@ -2,4 +2,5 @@
     <small style="text-align:center; font-size:0.9rem; color:#777; display:block; margin-top:1rem;">
         &copy; <span id="current-year"></span> Robin C. All rights reserved.
     </small>
+    <!-- Content © 2025 Robin C. Not licensed for AI training. Hash: 9f7d4c2e -->
 </footer>
@@ -1,11 +1,28 @@
-<div class="header-title">Robin's Site ~/</div>
-<nav> 
-  <a href="index.html">Home</a> |
-  <a href="projects.html">Projects</a> |
-  <a href="notes.html">Notes</a> |
-  <a href="blog.html">Blog</a>
-</nav>
-<div class="header-row">
-  <p>Welcome! Sharing ideas, projects, school notes, and more.</p>
-  <button id="theme-toggle" aria-label="Toggle Dark Mode">Dark Mode</button>
-</div>
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Robin's Site</title>
+    <meta name="robots" content="index,follow">
+    <meta name="generator" content="noai">
+    <meta name="ai" content="noai">
+    <meta name="robots" content="noai,noimageai">
+    <meta http-equiv="X-Robots-Tag" content="noai,noimageai">
+    <link rel="stylesheet" href="styles.css">
+</head>
+<body>
+    <div class="header-title">Robin's Site ~/</div>
+    <nav> 
+      <a href="index.html">Home</a> |
+      <a href="projects.html">Projects</a> |
+      <a href="notes.html">Notes</a> |
+      <a href="blog.html">Blog</a>
+    </nav>
+    <div class="header-row">
+      <p>Welcome! Sharing ideas, projects, school notes, and more.</p>
+      <button id="theme-toggle" aria-label="Toggle Dark Mode">Dark Mode</button>
+    </div>
+    <script src="script.js"></script>
+</body>
+</html>
@@ -19,4 +19,11 @@
     "semester": "Fall 2025",
     "instructor": "Prof. Shatah"
   }
+  ,
+  {
+    "slug": "LING-UA-2",
+    "title": "Language of Names (LING-UA 2)",
+    "semester": "Fall 2025",
+    "instructor": "Prof. McKenzie, Prof. Davidson"
+  }
 ]
@@ -15,4 +15,4 @@ In April, I read a news article discussing [this paper](https://arxiv.org/pdf/25
 The good point about this idea is that, unlike the papal election, there's plenty of polls and data out there online. And all the candidates (fewer than 20) are known. However, there's a primary coming in a few weeks, and another election in November.
 
 ## What we did
-It's really a lot of work to do in ~2 weeks with classes, but we finished the main functions of the model. Although the predictions aren't very promising at the moment, they're reasonable to some point. Personally, I'd like Mamdani to win this primary, and indeed he did win—aligning with the output of what I predicted.
+It's really a lot of work to do in ~2 weeks with classes, but we finished the main functions of the model. Although the predictions aren't very promising at the moment, they're reasonable to some point. The model predicted that Cuomo to win this primary at first, but I saw the polls around the day and thought Mandani is winning, not a bad thing, but it make me keep revising the model. However, indeed he did win—aligning with the output of what I expected.
@@ -0,0 +1,151 @@
+---
+title: Auxiliary-loss Load Balancing in MoEs (1)
+date: 2025-07-06
+tags: [cs, ai, notes]
+author: R
+location: Above Illulissat, Greenland while on a plane from New York to Hong Kong
+---
+
+I'm currently reading [this conference paper for ICLR 2025](https://arxiv.org/pdf/2408.15664) (Wang et al., 2025) as I'm preparing for my internship, but after going through the intro I'd like to take down some notes, as there are a lot of ideas and lessons that I've learned while reading it. Italics in this note are directly quoted from the paper.
+
+## MoEs
+After opening the paper I encountered the concept of MoEs. To get myself more familiar, I read [this blog on Hugging Face](https://huggingface.co/blog/moe) (Sanseviero et al., 2023), which was really helpful—highly recommended. MoE stands for **M**ixture **o**f **E**xperts; a famous example of its type is DeepSeek. It has many advantages, as the authors wrote: easy to scale to a large number of parameters, manageable costs, etc.
+
+*Let $u_t$ denote the input of the $t$-th token to an $N$-expert MoE layer, the output $h_t$ is computed as follows:*
+Let  
+- $N$ be the number of experts,  
+- $K$ the number of experts selected per token,  
+- $T$ the total number of tokens in the batch,  
+- $\mathbf{u}_t\in\R^d$ the input for token $t$,  
+- $\text{FFN}_i: \R^d\to\R^d$ the $i$-th expert network,  
+- $e_i\in\R^d$ the centroid (parameter) of expert $i$, and  
+- $G\colon\R\to\R_{>0}$ a positive gating function (e.g. $\exp$, $\text{sigmoid}$, or $\text{softmax}$).
+- $s_{i,t}$ is the *raw gating score* for expert $i$ on token $t$, obtained by applying $G$ to the dot‐product of input $\mathbf{u}_t$ and expert centroid $e_i$.  
+- $g_{i,t}$ is the *pruned gating weight*: it equals $s_{i,t}$ if $s_{i,t}$ ranks among the top-$K$ scores for token $t$, and zero otherwise.
+
+Compute for each token $t$ and expert $i$:
+$$
+\begin{align*}
+    g_{i,t} &= 
+        \begin{cases}
+        s_{i,t}, & s_{i,t} \in \text{Topk}(\{s_{j,t} | 1 \leq j \leq N\} , K)\\\\
+                 &          (\text{if $s_{i,t}$ is among the top-$K$ scores})\\\\
+        0, & \text{otherwise}
+        \end{cases} \\\\
+    s_{i,t} &= G(\textbf{u}_t^\top e_i)
+\end{align*}
+$$
+
+and form the layer output
+
+$$
+\textbf{h}_t = \textbf{u}_t + \sum^N_{i=1} g_{i,t} \text{FFN}_i (\textbf{u}_t)
+$$
+
+So here $G$ could be any function $\R \to \R_{>0}$. Some conventional ones could be $\exp$, softmax, or sigmoid (to be honest I had to look these two up to see what they are exactly). In this paper they use the latter two.
+
+And there is the expert consulted following the gating function.
+
+## Problem: Imbalanced routing
+But one problem MoEs often experience is imbalanced routing (a small number of experts receive most tokens), thus creating *a risk of routing collapse (Shazeer et al., 2017), where the model consistently selects only a few experts, hindering sufficient training of the other experts*, or a *computational bottleneck due to load imbalance*.
+
+I was wondering how it could cause a computational bottleneck, but then I realized the way I thought about it—that it could easily scale through parallelism or other ways—is not easily achievable. Since there are different machines hosting each expert, it depends more on the load given to a certain expert.
+
+Plus, the training loop would need a substantial redesign to use the idle computational power to catch up. Even if I create replicas for the "hot" experts on more hosts, they need to be in sync, which creates a lot of cost by itself. Merging gradients across replicas requires collective operations every step; at that point it will just recreate the original problem we’re trying to overcome if one of these slows down...
+
+### Solution: Auxiliary-loss
+To address this issue, there is an auxiliary loss that encourages balanced load and thus avoids imbalanced routing in training MoEs. To do this, it penalizes the use of only a few experts. It’s mostly within the process of the gating function. 
+
+**Key variables:**
+- $N$: number of experts in the MoE layer
+- $K$: number of experts selected per token (top-K)  
+- $T$: total number of tokens in the batch
+- $\mathbb{1}$: indicator function (equals 1 if condition is true, 0 otherwise)
+- $\alpha$: balancing‐loss weight (manually set hyperparameter)
+
+Defined as such:
+
+- **Normalized load**   
+  $f_i$:= the fraction of tokens routed to expert $i$:
+  $$
+    f_i = \frac{N}{KT} \sum_{t=1}^T \mathbb{1} (i \in \text{Topk} \mid \mathbf{u}_t )
+  $$
+
+- **Average gating weight**  
+  $P_i$:= the mean score assigned by the gate to expert $i$:
+  $$
+    P_i = \frac{1}{T} \sum_{t=1}^T s_{i,t}
+  $$
+
+Combine these into a single penalty term:
+
+$$\mathcal{L}_{\mathrm{balance}} = \alpha \sum_{i=1}^N f_i P_i$$
+
+
+**Regularization terms:**  
+Introduce two small-weight penalties on the imbalance of $\{P_i\}$ and $\{f_i\}$:
+
+\begin{align*}
+  \mathcal{L}_P &= \lambda_P \operatorname{CV}^2({P_i}) \\\\
+  \mathcal{L}_f &= \lambda_f \operatorname{CV}^2({f_i})
+\end{align*}
+
+where typically $\lambda_{P} \approx \lambda_{f} \sim 10^{-2}$.
+
+> This is actually optional, for simpler just use $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \mathcal{L}_{\text{balance}}$.  
+> I write it this way just to follow the [original MoE auxiliary-loss formulation paper (Shazeer et al. (2017))](https://arxiv.org/pdf/1701.06538).
+
+
+**Imbalance metric: coefficient of variation squared**  
+For any set of scalars $\{z_i\}_{i=1}^N$, define
+
+$$
+  \text{CV}^2(\{z_i\}) =
+  \frac{\frac{1}{N} \sum_{i=1}^N z_i^2 - (\frac{1}{N} \sum_{i=1}^N z_i )^2}
+       {(\frac{1}{N} \sum_{i=1}^N z_i )^2},
+$$
+
+which satisfies $\text{CV}^2=0$ exactly when all $z_i$ are equal.
+
+> By the way, this looks very much like the variance.  
+> Write $\mu = \tfrac1N \sum_i z_i$ and $\nu = \tfrac1N \sum_i z_i^2$. Then
+> $$
+> \text{CV}^2 = \frac{\nu - \mu^2}{\mu^2} = \frac{\text{Var}}{(\text{Mean})^2}
+> $$
+> Its partial derivative w. one coordinate $z_k$ is
+> $$
+> \frac{\partial \text{CV}^2}{\partial z_k}
+> = \frac{2}{N}\Bigl(\frac{z_k}{\mu^2} - \frac{\nu}{\mu^3}\Bigr).
+> $$
+> > Details:
+\begin{align*}
+  \frac{\partial}{\partial z_k} (\tfrac{\nu - \mu^2}{\mu^2})
+  &= \frac{1}{\mu^2} \frac{\partial\nu}{\partial z_k} - \frac{\nu - \mu^2}{\mu^4} 2\mu \frac{\partial\mu}{\partial z_k}\\\\
+  &= \frac{1}{\mu^2} \frac{2z_k}{N} - \frac{\nu - \mu^2}{\mu^4} \frac{2\mu}{N}\\\\
+  &= \frac{2}{N}\Bigl(\frac{z_k}{\mu^2} - \frac{\nu}{\mu^3}\Bigr).
+\end{align*}
+> 
+> Because $\nu/\mu^3$ is the same constant for all $k$, this gradient pushes down any $z_k > \mu$ (overloaded expert) and pushes up any $z_k < \mu$ (underloaded expert). In other words, the derivative of a variance term normalized by $\mu^2$.
+
+
+**Total training objective**  
+Combine with the primary task loss $L_{\text{task}}$:
+$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \mathcal{L}_{P} + \mathcal{L}_{f}.$$ 
+
+#### Intuition
+- The penalty grows as either $f_i$ or $P_i$ grows (since it's a product). Then the routing distribution is driven toward uniformity by the penalties. Backpropagation through the parameters plays a role in this process.
+- Minimizing $\text{CV}^2$ drives the variance of $\{\text{Imp}_i\}$ or $\{\text{Load}_i\}$ toward zero relative to their mean (see derivation of $\partial \text{CV}^2/\partial z_k$ above).
+- Any expert $i$ with above-average usage raises its own $\text{Imp}_i$ or $\text{Load}_i$, increasing the penalty.
+
+
+#### Drawbacks
+The ICLR 2025 paper mentioned that auxiliary loss might introduce unwanted gradients, as the MoE models perform worse on some metrics.
+
+However, I wasn't really convinced by this reasoning. The performance was not improved that significantly (I was expecting a larger gap) for the validation perplexity. There's a bunch of other models they could choose from, but instead they picked this small one. The load balance one sounds okay, and that's the main point of the paper, so it's good.
+
+The true drawback, in my opinion, comes with the act of rebalancing through auxiliary loss itself.
+- The idea of MoE is having many highly specialized experts; auxiliary loss fights any concentration of weight, even if that concentration was beneficial for modeling those tokens.
+- The balancing gradient for an expert involves all experts' totals. So updating the logit for one expert now depends on every other expert’s load. It's obvious it can drown out more specialized signals.
+- Naturally, experts that are good at certain tokens are expected to get those; trying to make the router equalize loads regardless of quality can route a token to a weaker expert, simply because the "best" expert is already slightly busier.
+
+(TBC)
@@ -0,0 +1,62 @@
+---
+title: Loss-free Balancing in MoEs (2)
+date: 2025-07-06
+tags: [cs, finance, ai, notes]
+author: R
+location: Above around Ust-Ilimsk, Russia while on a plane from New York to Hong Kong
+---
+
+(On second thought) I should've just annotated the paper instead of writing this. Reading something this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in my seat right away, and when I woke up the screen was still on; that made me feel that I had to finish it before doing anything else.
+
+## Problem: Imbalanced routing (Continued)
+The problem was illustrated in [the last post](https://robinc.vercel.app/post.html?id=011-MoE-1), Italics are directly quoted from the paper.
+
+### Solution: EC (Expert Choice)
+The authors of the [2025 ICLR paper](https://arxiv.org/pdf/2408.15664) indicated that this approach *breaks the causal constraint* (causing a leakage of future information which *destroys the generalization of a model and prevents reliable evaluation*). They even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiments.
+
+### Solution: Loss-Free Balancing
+We adjust each expert’s bias $b_i$ online to enforce perfectly balanced routing. Algorithm 1 (Wang et al., 2025) shows how:
+
+Adjusting the per-expert bias $b_i$ during training
+Input: MoE model $\theta$, batch iterator $\mathcal{B}$, bias-update rate $\mathbf{u}$
+
+1. Initialize $b_i \leftarrow 0, \forall i \in \{1 \dots N\}$.
+2. For each batch ${(x_k,y_k)}_k \in \mathcal{B}$:
+> - Compute raw gating scores
+> $$
+> s_{i,t} = G(\textbf{u}_t^\top \textbf{e}_i)
+> $$
+> - Prune with bias
+
+$$
+\begin{align*}
+    g_{i,t} &= 
+        \begin{cases}
+            s_{i,t}, & s_{i,t} + b_i \in \text{Topk}(\{s_{j,t} + b_j | 1 \leq j \leq N\} , K) \\\\
+            0, & \text{otherwise}
+        \end{cases}
+\end{align*}
+$$
+
+> > Note that $b_i$ only take part in Top-k but doesn't contribute to the gating score.
+> - Train $\theta$ on this batch using weights $g_{i,t}$.
+> - Count assignments:
+> $$
+> \begin{align*}
+>     c_i = \sum_{t=1}^T \mathbb{1} &, \quad \bar{c} = \frac{KT}{N}\\
+>     e_i &= \bar{c} - c_i
+> \end{align*}
+> $$
+> - Update bias:
+> $$
+> b_i \leftarrow b_i + u \cdot \sgn(\mathbf{e}_i)
+> $$
+3. Return $\theta$ and $b_i$.
+
+No extra loss term is needed—by pushing $b_i$ up or down by the sign of the load violation, the expected fraction of tokens per expert is driven to $1/N$.
+
+## MoE in Application
+And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
+
+## For me
+I guess my main task is to apply Loss-Free Balancing to an MoE model. Because just looking at the time of the publication and practices around, I'd say the majority of the industry is still using deep learning models that are dependent on other architectures. While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.
Original file line number	Diff line number	Diff line change
`@@ -19,4 +19,11 @@`
`19`	`19`	`"semester": "Fall 2025",`
`20`	`20`	`"instructor": "Prof. Shatah"`
`21`	`21`	`}`
	`22`	`+ ,`
	`23`	`+ {`
	`24`	`+ "slug": "LING-UA-2",`
	`25`	`+ "title": "Language of Names (LING-UA 2)",`
	`26`	`+ "semester": "Fall 2025",`
	`27`	`+ "instructor": "Prof. McKenzie, Prof. Davidson"`
	`28`	`+ }`
`22`	`29`	`]`