Skip to content

Commit ea3d5fa

Browse files
committed
New posts, added terms
1 parent d0a25c3 commit ea3d5fa

17 files changed

Lines changed: 402 additions & 21 deletions

File tree

.gitignore

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,9 @@ build/
3030
# Forbidden post entries
3131
/posts/entries/004-*.md
3232
/posts/entries/010-*.md
33-
/posts/entries/011-*.md
34-
/posts/entries/012-*.md
3533
/posts/entries/013-*.md
3634
/posts/entries/015-*.md
3735
/posts/entries/016-*.md
38-
/posts/entries/017-*.md
3936
/posts/entries/018-*.md
4037
/posts/entries/019-*.md
4138

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@ This repository contains the source code for my personal website. The site serve
3434

3535
This site is deployed on [Vercel](https://vercel.com/) with automatic GitHub-triggered builds.
3636

37+
## Terms / AI Usage
38+
39+
Use of this content for training machine learning or AI models is expressly prohibited without prior written consent. See `TERMS.md` for full details.
40+
3741
## Contact
3842

3943
Feel free to reach out via:

TERMS.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Terms of Use
2+
3+
Copyright © 2025 Robin Chen. All rights reserved.
4+
5+
Unless otherwise explicitly licensed in a specific file, all textual, media, code, and structured content in this repository ("Content") is protected by copyright.
6+
7+
## Permitted Use
8+
- Personal viewing and reading.
9+
- Linking to public pages.
10+
- Quoting brief excerpts with attribution and a link back.
11+
12+
## Prohibited Without Prior Written Consent
13+
- Using, scraping, copying, aggregating, or transforming the Content (in whole or part) for the purpose of training, fine‑tuning, or evaluating machine learning or AI models.
14+
- Bulk or systematic downloading / crawling beyond what a normal browser would perform.
15+
- Republishing or redistributing substantial portions of the Content.
16+
17+
## Automated Access / AI Crawlers
18+
Robots and AI data collection systems must respect `robots.txt`, meta `noai`, `noimageai`, and the `X-Robots-Tag` headers served by this site. Access beyond those signals constitutes unauthorized use.
19+
20+
## Attribution Requirement
21+
Where limited quotation is permitted, provide: (1) author name, (2) original page URL, (3) date accessed.
22+
23+
## No Warranty
24+
Content is provided “as is” without any warranty of any kind.
25+
26+
## Contact
27+
For licensing or usage inquiries (including AI/data use requests): robinchen@nyu.edu
28+
29+
Use of this site constitutes acceptance of these terms.
30+
31+
> Use of this content for training machine learning or AI models is expressly prohibited without prior written consent.

components/footer.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
<small style="text-align:center; font-size:0.9rem; color:#777; display:block; margin-top:1rem;">
33
&copy; <span id="current-year"></span> Robin C. All rights reserved.
44
</small>
5+
<!-- Content © 2025 Robin C. Not licensed for AI training. Hash: 9f7d4c2e -->
56
</footer>

components/header.html

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,28 @@
1-
<div class="header-title">Robin's Site ~/</div>
2-
<nav>
3-
<a href="index.html">Home</a> |
4-
<a href="projects.html">Projects</a> |
5-
<a href="notes.html">Notes</a> |
6-
<a href="blog.html">Blog</a>
7-
</nav>
8-
<div class="header-row">
9-
<p>Welcome! Sharing ideas, projects, school notes, and more.</p>
10-
<button id="theme-toggle" aria-label="Toggle Dark Mode">Dark Mode</button>
11-
</div>
1+
<!DOCTYPE html>
2+
<html lang="en">
3+
<head>
4+
<meta charset="UTF-8">
5+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
6+
<title>Robin's Site</title>
7+
<meta name="robots" content="index,follow">
8+
<meta name="generator" content="noai">
9+
<meta name="ai" content="noai">
10+
<meta name="robots" content="noai,noimageai">
11+
<meta http-equiv="X-Robots-Tag" content="noai,noimageai">
12+
<link rel="stylesheet" href="styles.css">
13+
</head>
14+
<body>
15+
<div class="header-title">Robin's Site ~/</div>
16+
<nav>
17+
<a href="index.html">Home</a> |
18+
<a href="projects.html">Projects</a> |
19+
<a href="notes.html">Notes</a> |
20+
<a href="blog.html">Blog</a>
21+
</nav>
22+
<div class="header-row">
23+
<p>Welcome! Sharing ideas, projects, school notes, and more.</p>
24+
<button id="theme-toggle" aria-label="Toggle Dark Mode">Dark Mode</button>
25+
</div>
26+
<script src="script.js"></script>
27+
</body>
28+
</html>
76.7 KB
Loading

notes/metadata/courses.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,11 @@
1919
"semester": "Fall 2025",
2020
"instructor": "Prof. Shatah"
2121
}
22+
,
23+
{
24+
"slug": "LING-UA-2",
25+
"title": "Language of Names (LING-UA 2)",
26+
"semester": "Fall 2025",
27+
"instructor": "Prof. McKenzie, Prof. Davidson"
28+
}
2229
]

posts/entries/010-nyc-predict.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ In April, I read a news article discussing [this paper](https://arxiv.org/pdf/25
1515
The good point about this idea is that, unlike the papal election, there's plenty of polls and data out there online. And all the candidates (fewer than 20) are known. However, there's a primary coming in a few weeks, and another election in November.
1616

1717
## What we did
18-
It's really a lot of work to do in ~2 weeks with classes, but we finished the main functions of the model. Although the predictions aren't very promising at the moment, they're reasonable to some point. Personally, I'd like Mamdani to win this primary, and indeed he did win—aligning with the output of what I predicted.
18+
It's really a lot of work to do in ~2 weeks with classes, but we finished the main functions of the model. Although the predictions aren't very promising at the moment, they're reasonable to some point. The model predicted that Cuomo to win this primary at first, but I saw the polls around the day and thought Mandani is winning, not a bad thing, but it make me keep revising the model. However, indeed he did win—aligning with the output of what I expected.

posts/entries/011-MoE1.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
title: Auxiliary-loss Load Balancing in MoEs (1)
3+
date: 2025-07-06
4+
tags: [cs, ai, notes]
5+
author: R
6+
location: Above Illulissat, Greenland while on a plane from New York to Hong Kong
7+
---
8+
9+
I'm currently reading [this conference paper for ICLR 2025](https://arxiv.org/pdf/2408.15664) (Wang et al., 2025) as I'm preparing for my internship, but after going through the intro I'd like to take down some notes, as there are a lot of ideas and lessons that I've learned while reading it. Italics in this note are directly quoted from the paper.
10+
11+
## MoEs
12+
After opening the paper I encountered the concept of MoEs. To get myself more familiar, I read [this blog on Hugging Face](https://huggingface.co/blog/moe) (Sanseviero et al., 2023), which was really helpful—highly recommended. MoE stands for **M**ixture **o**f **E**xperts; a famous example of its type is DeepSeek. It has many advantages, as the authors wrote: easy to scale to a large number of parameters, manageable costs, etc.
13+
14+
*Let $u_t$ denote the input of the $t$-th token to an $N$-expert MoE layer, the output $h_t$ is computed as follows:*
15+
Let
16+
- $N$ be the number of experts,
17+
- $K$ the number of experts selected per token,
18+
- $T$ the total number of tokens in the batch,
19+
- $\mathbf{u}_t\in\R^d$ the input for token $t$,
20+
- $\text{FFN}_i: \R^d\to\R^d$ the $i$-th expert network,
21+
- $e_i\in\R^d$ the centroid (parameter) of expert $i$, and
22+
- $G\colon\R\to\R_{>0}$ a positive gating function (e.g. $\exp$, $\text{sigmoid}$, or $\text{softmax}$).
23+
- $s_{i,t}$ is the *raw gating score* for expert $i$ on token $t$, obtained by applying $G$ to the dot‐product of input $\mathbf{u}_t$ and expert centroid $e_i$.
24+
- $g_{i,t}$ is the *pruned gating weight*: it equals $s_{i,t}$ if $s_{i,t}$ ranks among the top-$K$ scores for token $t$, and zero otherwise.
25+
26+
Compute for each token $t$ and expert $i$:
27+
$$
28+
\begin{align*}
29+
g_{i,t} &=
30+
\begin{cases}
31+
s_{i,t}, & s_{i,t} \in \text{Topk}(\{s_{j,t} | 1 \leq j \leq N\} , K)\\\\
32+
& (\text{if $s_{i,t}$ is among the top-$K$ scores})\\\\
33+
0, & \text{otherwise}
34+
\end{cases} \\\\
35+
s_{i,t} &= G(\textbf{u}_t^\top e_i)
36+
\end{align*}
37+
$$
38+
39+
and form the layer output
40+
41+
$$
42+
\textbf{h}_t = \textbf{u}_t + \sum^N_{i=1} g_{i,t} \text{FFN}_i (\textbf{u}_t)
43+
$$
44+
45+
So here $G$ could be any function $\R \to \R_{>0}$. Some conventional ones could be $\exp$, softmax, or sigmoid (to be honest I had to look these two up to see what they are exactly). In this paper they use the latter two.
46+
47+
And there is the expert consulted following the gating function.
48+
49+
## Problem: Imbalanced routing
50+
But one problem MoEs often experience is imbalanced routing (a small number of experts receive most tokens), thus creating *a risk of routing collapse (Shazeer et al., 2017), where the model consistently selects only a few experts, hindering sufficient training of the other experts*, or a *computational bottleneck due to load imbalance*.
51+
52+
I was wondering how it could cause a computational bottleneck, but then I realized the way I thought about it—that it could easily scale through parallelism or other ways—is not easily achievable. Since there are different machines hosting each expert, it depends more on the load given to a certain expert.
53+
54+
Plus, the training loop would need a substantial redesign to use the idle computational power to catch up. Even if I create replicas for the "hot" experts on more hosts, they need to be in sync, which creates a lot of cost by itself. Merging gradients across replicas requires collective operations every step; at that point it will just recreate the original problem we’re trying to overcome if one of these slows down...
55+
56+
### Solution: Auxiliary-loss
57+
To address this issue, there is an auxiliary loss that encourages balanced load and thus avoids imbalanced routing in training MoEs. To do this, it penalizes the use of only a few experts. It’s mostly within the process of the gating function.
58+
59+
**Key variables:**
60+
- $N$: number of experts in the MoE layer
61+
- $K$: number of experts selected per token (top-K)
62+
- $T$: total number of tokens in the batch
63+
- $\mathbb{1}$: indicator function (equals 1 if condition is true, 0 otherwise)
64+
- $\alpha$: balancing‐loss weight (manually set hyperparameter)
65+
66+
Defined as such:
67+
68+
- **Normalized load**
69+
$f_i$:= the fraction of tokens routed to expert $i$:
70+
$$
71+
f_i = \frac{N}{KT} \sum_{t=1}^T \mathbb{1} (i \in \text{Topk} \mid \mathbf{u}_t )
72+
$$
73+
74+
- **Average gating weight**
75+
$P_i$:= the mean score assigned by the gate to expert $i$:
76+
$$
77+
P_i = \frac{1}{T} \sum_{t=1}^T s_{i,t}
78+
$$
79+
80+
Combine these into a single penalty term:
81+
82+
$$\mathcal{L}_{\mathrm{balance}} = \alpha \sum_{i=1}^N f_i P_i$$
83+
84+
85+
**Regularization terms:**
86+
Introduce two small-weight penalties on the imbalance of $\{P_i\}$ and $\{f_i\}$:
87+
88+
\begin{align*}
89+
\mathcal{L}_P &= \lambda_P \operatorname{CV}^2({P_i}) \\\\
90+
\mathcal{L}_f &= \lambda_f \operatorname{CV}^2({f_i})
91+
\end{align*}
92+
93+
where typically $\lambda_{P} \approx \lambda_{f} \sim 10^{-2}$.
94+
95+
> This is actually optional, for simpler just use $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \mathcal{L}_{\text{balance}}$.
96+
> I write it this way just to follow the [original MoE auxiliary-loss formulation paper (Shazeer et al. (2017))](https://arxiv.org/pdf/1701.06538).
97+
98+
99+
**Imbalance metric: coefficient of variation squared**
100+
For any set of scalars $\{z_i\}_{i=1}^N$, define
101+
102+
$$
103+
\text{CV}^2(\{z_i\}) =
104+
\frac{\frac{1}{N} \sum_{i=1}^N z_i^2 - (\frac{1}{N} \sum_{i=1}^N z_i )^2}
105+
{(\frac{1}{N} \sum_{i=1}^N z_i )^2},
106+
$$
107+
108+
which satisfies $\text{CV}^2=0$ exactly when all $z_i$ are equal.
109+
110+
> By the way, this looks very much like the variance.
111+
> Write $\mu = \tfrac1N \sum_i z_i$ and $\nu = \tfrac1N \sum_i z_i^2$. Then
112+
> $$
113+
> \text{CV}^2 = \frac{\nu - \mu^2}{\mu^2} = \frac{\text{Var}}{(\text{Mean})^2}
114+
> $$
115+
> Its partial derivative w. one coordinate $z_k$ is
116+
> $$
117+
> \frac{\partial \text{CV}^2}{\partial z_k}
118+
> = \frac{2}{N}\Bigl(\frac{z_k}{\mu^2} - \frac{\nu}{\mu^3}\Bigr).
119+
> $$
120+
> > Details:
121+
\begin{align*}
122+
\frac{\partial}{\partial z_k} (\tfrac{\nu - \mu^2}{\mu^2})
123+
&= \frac{1}{\mu^2} \frac{\partial\nu}{\partial z_k} - \frac{\nu - \mu^2}{\mu^4} 2\mu \frac{\partial\mu}{\partial z_k}\\\\
124+
&= \frac{1}{\mu^2} \frac{2z_k}{N} - \frac{\nu - \mu^2}{\mu^4} \frac{2\mu}{N}\\\\
125+
&= \frac{2}{N}\Bigl(\frac{z_k}{\mu^2} - \frac{\nu}{\mu^3}\Bigr).
126+
\end{align*}
127+
>
128+
> Because $\nu/\mu^3$ is the same constant for all $k$, this gradient pushes down any $z_k > \mu$ (overloaded expert) and pushes up any $z_k < \mu$ (underloaded expert). In other words, the derivative of a variance term normalized by $\mu^2$.
129+
130+
131+
**Total training objective**
132+
Combine with the primary task loss $L_{\text{task}}$:
133+
$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \mathcal{L}_{P} + \mathcal{L}_{f}.$$
134+
135+
#### Intuition
136+
- The penalty grows as either $f_i$ or $P_i$ grows (since it's a product). Then the routing distribution is driven toward uniformity by the penalties. Backpropagation through the parameters plays a role in this process.
137+
- Minimizing $\text{CV}^2$ drives the variance of $\{\text{Imp}_i\}$ or $\{\text{Load}_i\}$ toward zero relative to their mean (see derivation of $\partial \text{CV}^2/\partial z_k$ above).
138+
- Any expert $i$ with above-average usage raises its own $\text{Imp}_i$ or $\text{Load}_i$, increasing the penalty.
139+
140+
141+
#### Drawbacks
142+
The ICLR 2025 paper mentioned that auxiliary loss might introduce unwanted gradients, as the MoE models perform worse on some metrics.
143+
144+
However, I wasn't really convinced by this reasoning. The performance was not improved that significantly (I was expecting a larger gap) for the validation perplexity. There's a bunch of other models they could choose from, but instead they picked this small one. The load balance one sounds okay, and that's the main point of the paper, so it's good.
145+
146+
The true drawback, in my opinion, comes with the act of rebalancing through auxiliary loss itself.
147+
- The idea of MoE is having many highly specialized experts; auxiliary loss fights any concentration of weight, even if that concentration was beneficial for modeling those tokens.
148+
- The balancing gradient for an expert involves all experts' totals. So updating the logit for one expert now depends on every other expert’s load. It's obvious it can drown out more specialized signals.
149+
- Naturally, experts that are good at certain tokens are expected to get those; trying to make the router equalize loads regardless of quality can route a token to a weaker expert, simply because the "best" expert is already slightly busier.
150+
151+
(TBC)

posts/entries/012-MoE2.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: Loss-free Balancing in MoEs (2)
3+
date: 2025-07-06
4+
tags: [cs, finance, ai, notes]
5+
author: R
6+
location: Above around Ust-Ilimsk, Russia while on a plane from New York to Hong Kong
7+
---
8+
9+
(On second thought) I should've just annotated the paper instead of writing this. Reading something this bright on an airplane with the lights off felt like some type of torture. I couldn't fall asleep in my seat right away, and when I woke up the screen was still on; that made me feel that I had to finish it before doing anything else.
10+
11+
## Problem: Imbalanced routing (Continued)
12+
The problem was illustrated in [the last post](https://robinc.vercel.app/post.html?id=011-MoE-1), Italics are directly quoted from the paper.
13+
14+
### Solution: EC (Expert Choice)
15+
The authors of the [2025 ICLR paper](https://arxiv.org/pdf/2408.15664) indicated that this approach *breaks the causal constraint* (causing a leakage of future information which *destroys the generalization of a model and prevents reliable evaluation*). They even proved their hypothesis (*that the loss drop originates from the model’s accessing and exploiting future token information*) through experiments.
16+
17+
### Solution: Loss-Free Balancing
18+
We adjust each expert’s bias $b_i$ online to enforce perfectly balanced routing. Algorithm 1 (Wang et al., 2025) shows how:
19+
20+
Adjusting the per-expert bias $b_i$ during training
21+
Input: MoE model $\theta$, batch iterator $\mathcal{B}$, bias-update rate $\mathbf{u}$
22+
23+
1. Initialize $b_i \leftarrow 0, \forall i \in \{1 \dots N\}$.
24+
2. For each batch ${(x_k,y_k)}_k \in \mathcal{B}$:
25+
> - Compute raw gating scores
26+
> $$
27+
> s_{i,t} = G(\textbf{u}_t^\top \textbf{e}_i)
28+
> $$
29+
> - Prune with bias
30+
31+
$$
32+
\begin{align*}
33+
g_{i,t} &=
34+
\begin{cases}
35+
s_{i,t}, & s_{i,t} + b_i \in \text{Topk}(\{s_{j,t} + b_j | 1 \leq j \leq N\} , K) \\\\
36+
0, & \text{otherwise}
37+
\end{cases}
38+
\end{align*}
39+
$$
40+
41+
> > Note that $b_i$ only take part in Top-k but doesn't contribute to the gating score.
42+
> - Train $\theta$ on this batch using weights $g_{i,t}$.
43+
> - Count assignments:
44+
> $$
45+
> \begin{align*}
46+
> c_i = \sum_{t=1}^T \mathbb{1} &, \quad \bar{c} = \frac{KT}{N}\\
47+
> e_i &= \bar{c} - c_i
48+
> \end{align*}
49+
> $$
50+
> - Update bias:
51+
> $$
52+
> b_i \leftarrow b_i + u \cdot \sgn(\mathbf{e}_i)
53+
> $$
54+
3. Return $\theta$ and $b_i$.
55+
56+
No extra loss term is needed—by pushing $b_i$ up or down by the sign of the load violation, the expected fraction of tokens per expert is driven to $1/N$.
57+
58+
## MoE in Application
59+
And there's [the 2023 paper](https://personal.ntu.edu.sg/boan/papers/KDD23_Stock.pdf) (Sun et al., 2023) which builds a MoE in order to account for multiple metrics, which differs from mainstream deep learning models being used in the industry.
60+
61+
## For me
62+
I guess my main task is to apply Loss-Free Balancing to an MoE model. Because just looking at the time of the publication and practices around, I'd say the majority of the industry is still using deep learning models that are dependent on other architectures. While attempts could be made (and I'm sure there have been such attempts all around) to take in more indicators, it seems that building a MoE model which overcomes drawbacks in auxiliary-loss implementations could be advantageous.

0 commit comments

Comments
 (0)