raminaeye.github.io/nas.html at main · raminaeye/raminaeye.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Neural Architecture Search (NAS)</title>
    <style>
        body {
            font-family: sans-serif;
            line-height: 1.6;
            margin: 20px;
            max-width: 900px;
            margin-left: auto;
            margin-right: auto;
            color: #333;
        }
        h1, h2, h3 {
            color: #2c3e50;
        }
        h1 {
            border-bottom: 2px solid #3498db;
            padding-bottom: 10px;
        }
        h2 {
            border-bottom: 1px solid #ddd;
            padding-bottom: 5px;
        }
        code {
            background-color: #f4f4f4;
            padding: 2px 6px;
            border-radius: 4px;
            font-family: monospace;
        }
        pre code {
            display: block;
            padding: 15px;
            overflow-x: auto;
            white-space: pre;
        }
        table {
            width: 100%;
            border-collapse: collapse;
            margin-bottom: 20px;
        }
        th, td {
            border: 1px solid #ddd;
            padding: 8px;
            text-align: left;
        }
        th {
            background-color: #f2f2f2;
        }
        ul {
            margin-bottom: 15px;
        }
        li {
            margin-bottom: 5px;
        }
        nav { margin-bottom: 30px; padding: 10px; background: #ecf0f1; border: 1px solid #bdc3c7; border-radius: 4px;}
        nav ul { list-style: none; padding: 0; }
        nav li { display: inline-block; margin-right: 15px; }
        nav a { text-decoration: none; color: #3498db; font-weight: bold;}
        nav a:hover { text-decoration: underline; color: #2980b9;}

    </style>
    <script type="text/javascript" async
        src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML">
    </script>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({
            tex2jax: {
                inlineMath: [['$','$'], ['\\(','\\)']],
                displayMath: [['$$','$$'], ['\\[','\\]']],
                processEscapes: true
            }
        });
    </script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/monokai.min.css">
    <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
    <script>hljs.highlightAll();</script>
</head>
<body>
    <nav>
     <ul>
      <li><a href="index.html">Home</a></li>
     </ul>
   </nav>
    <h1>Neural Architecture Search (NAS)</h1>

    <p><strong>Neural Architecture Search (NAS)</strong> automates the design of neural networks. The process involves defining a <strong>search space</strong> of possible architectural components, a <strong>search strategy</strong> to explore this space, and an <strong>evaluation strategy</strong> to score candidate models. The primary goal is to discover the optimal architecture considering metrics like accuracy, latency/FLOPs, model size, and hardware constraints.</p>

    <p>A NAS algorithm can manipulate various architectural elements, including:</p>
    <ul>
        <li>Convolutional filter sizes</li>
        <li>Number of channels</li>
        <li>Block types (e.g., MBConv, regular conv)</li>
        <li>Network depth</li>
        <li>Skip connections</li>
        <li>Attention heads</li>
    </ul>

    <hr>

    <h2>Core Components of NAS</h2>

    <h3>Search Strategies</h3>
    <p>Several strategies are employed to navigate the vast search space:</p>
    <ul>
        <li><strong>Reinforcement Learning (RL)</strong>: An RNN controller samples different architecture configurations. A reward signal (e.g., accuracy) is fed back to the controller to guide its search. Google's NASNet utilized this approach.</li>
        <li><strong>Evolutionary Algorithms</strong>: These methods maintain a population of network architectures. Offspring are generated through random mutations, and the fittest individuals are selected for the next generation. AmoebaNet is an example.</li>
        <li><strong>Bayesian Optimization</strong>: This strategy treats architecture search as a black-box optimization problem. It leverages past evaluations to predict which architectures are likely to perform well.</li>
        <li><strong>Gradient-based Search</strong>: A "supernet" is defined, encompassing all possible architectural choices (soft choices of layers). This supernet is trained using gradient descent to identify the best operations.</li>
    </ul>

    <h3>Evaluation Strategies</h3>
    <p>Efficiently evaluating candidate architectures is crucial:</p>
    <ul>
        <li><strong>Proxy Task</strong>: Training candidate models for a reduced number of epochs or on a smaller dataset.</li>
        <li><strong>Weight Sharing</strong>: Candidate architectures share weights from a pre-trained supernet, significantly speeding up evaluation.</li>
        <li><strong>Low-fidelity Scoring</strong>: Estimating performance based on metrics like latency or FLOPs without full training.</li>
    </ul>

    <hr>

    <h2>Notable Applications of NAS</h2>
    <p>NAS has been instrumental in designing state-of-the-art models, including:</p>
    <ul>
        <li><strong>MobileNets</strong>: Searched over block types and network widths.</li>
        <li><strong>EfficientNet</strong>: Employed compound scaling for depth, width, and resolution.</li>
        <li><strong>EdgeTPU/NPU Models</strong>: Utilized latency-aware NAS to optimize for edge devices.</li>
    </ul>

    <hr>

    <h2>NAS for EfficientNet</h2>
    <p>EfficientNet's development leveraged a <strong>reinforcement learning-based NAS</strong>. The search focused on:</p>
    <ul>
        <li>Operator types (MBConv vs. regular convolution)</li>
        <li>Kernel sizes</li>
        <li>Expansion ratios for MobileNet-style inverted residual blocks</li>
        <li>Number of layers per stage</li>
        <li>Inclusion of Squeeze-and-Excitation (SE) blocks</li>
        <li>Stride and width per block</li>
    </ul>
    <p>The search target was a model optimized for both <strong>accuracy and latency on mobile hardware</strong>. Candidate models were trained on ImageNet, and latency was measured on actual devices rather than just relying on FLOP counts. An RL controller guided this search, leading to the <strong>EfficientNet-B0</strong> architecture—the foundational model of the EfficientNet family.</p>
    <p>After identifying B0, NAS was no longer used. Instead, a <strong>compound scaling method</strong> was introduced to systematically scale B0 to generate larger models (EfficientNet B1-B7). The compact search space and on-device latency measurement were key to EfficientNet's success, even compared to contemporary Vision Transformer models.</p>

    <hr>

    <h2>Reinforcement Learning (RL) for NAS</h2>
    <p>In supervised learning, models learn from input/output pairs using a loss function. In <strong>Reinforcement Learning (RL)</strong>, an <strong>agent</strong> interacts with an <strong>environment</strong> by taking <strong>actions</strong>. After each action, it receives a <strong>reward</strong>. The agent's objective is to learn a <strong>policy</strong>—a strategy for choosing actions—that maximizes the cumulative long-term reward.</p>

    <h3>REINFORCE Algorithm</h3>
    <p>A fundamental RL algorithm is <strong>REINFORCE</strong>. The core idea is to increase the probability of actions that lead to high rewards. The REINFORCE loss function is:</p>
    <p>$$ \text{Loss} = -\log(\text{probability of the action}) \times \text{reward} $$</p>
    <p>If the reward is high, minimizing this loss increases the log-probability of the action. This is a type of <strong>policy gradient</strong> method, where the controller's policy is directly adjusted to favor high-reward actions.</p>

    <h3>Mapping RL Concepts to NAS</h3>
    <table>
        <thead>
            <tr>
                <th>RL Concept</th>
                <th>NAS Version</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Agent</td>
                <td>Controller (e.g., an RNN)</td>
            </tr>
            <tr>
                <td>Action</td>
                <td>Selecting an architectural choice (e.g., kernel size, layer type)</td>
            </tr>
            <tr>
                <td>Environment</td>
                <td>Training and evaluating the candidate neural network</td>
            </tr>
            <tr>
                <td>Reward</td>
                <td>Performance metric (e.g., accuracy, latency-adjusted accuracy)</td>
            </tr>
            <tr>
                <td>Policy</td>
                <td>How the controller samples actions (architectural choices)</td>
            </tr>
        </tbody>
    </table>
    <p>To enable the controller to learn over time, it's often implemented as an RNN. The RNN outputs logits for different architectural choices (actions). The REINFORCE algorithm (policy gradient) then updates the RNN's parameters based on the received reward.</p>

    <h3>Example: Using RL to Discover the Best Kernel Size</h3>
    <p>Let's consider finding the optimal kernel size for a convolutional block using RL:</p>
    <ol>
        <li><strong>Search Space / Action Space</strong>: Define possible kernel sizes, e.g., <code>[3, 5, 7]</code>.</li>
        <li><strong>Controller</strong>: Implement a controller (e.g., an RNN or even a random sampler initially) that selects a kernel size.</li>
        <li><strong>Evaluation Loop</strong>: For each sampled kernel size:
            <ul>
                <li>Build a small model using that kernel.</li>
                <li>Train it briefly or evaluate it using weight-sharing.</li>
                <li>Measure its performance (e.g., validation accuracy).</li>
                <li>Provide this performance measure as a reward to the controller.</li>
            </ul>
        </li>
    </ol>
    <p>If the controller outputs a softmax distribution over kernel sizes, and it samples <code>kernel = 3</code> with a probability of, say, <code>0.62</code>, and the resulting reward is <code>0.85</code>, the REINFORCE loss would be:</p>
    <p>$$ \text{loss} = -\log(0.62) \times 0.85 \approx -(-0.478) \times 0.85 \approx 0.406 $$</p>
    <p>Backpropagation through this loss adjusts the controller's logits to increase the probability of selecting <code>kernel = 3</code> if the reward is high. Conversely, if the reward is low, the probability of choosing that kernel decreases. Often, a <strong>baseline reward</strong> (e.g., the moving average of past rewards) is subtracted from the current reward. This encourages actions that perform better than average.</p>
    <p>The following Python code demonstrates a simplified version of this process. A controller samples kernel sizes, and it's trained using REINFORCE to maximize a simulated reward. We run 50 iterations of this sampling and learning process. The code will also generate a plot showing the controller's probability of choosing each kernel size over the training steps.</p>

    <pre><code class="language-python">

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

"""
    Toy architecture choices: kernel sizes
    This is the policy that chooses which architecture to build. It's a probability distribution over actions.
    In RL the controller is the agent.
    The controller trying to learn a probability distribution over architectural choices.
    This is action taken by the agent, sampling kernel sizes. You sample from the controller's distribution.
    It starts with choosing each kernel equally at first. These logits will get transformed to probabilities. Then you sample from this categorical distribution
    Then you measure the reward from taking this action. This is the environment returning a reward based on the action the agent took.
    The REINFORCE algorithm updates these logits to prefer higher-reward options, it's updating the contoller.
    And you repeat with the newly updated controller. Sampling, training, evaluating and improving the controller.
"""
kernel_sizes = [3, 5, 7]
logits = torch.nn.Parameter(torch.zeros(len(kernel_sizes))) # internal parameters of the controller
optimizer = torch.optim.Adam([logits], lr=0.1)

# Simulated reward function (pretend model performance)
# Let's say kernel=5 is the best
"""
# In practice, this is how you migth get the reward from accuracy.
def get_reward(kernel):
    model = SimpleCNN(kernel_size=kernel)
    train_model(model, train_loader)
    accuracy = evaluate_model(model, val_loader)
    return accuracy
# Or maybe a multi-objvective:
        reward = accuracy - 0.01 * latency - 0.005 * model_size
"""
def get_reward(kernel):
    true_perf = {3: 0.7, 5: 0.9, 7: 0.6} # we're assuming these are the accuracy from training the model
    noise = torch.randn(1).item() * 0.02 # small noise
    return true_perf[kernel] + noise

# Tracking
history = {
    "step": [],
    "chosen_kernel": [],
    "reward": [],
    "probs": []
}

for step in range(50):
    probs = F.softmax(logits, dim=0)
    dist = torch.distributions.Categorical(probs)
    action = dist.sample()
    kernel = kernel_sizes[action.item()]
    reward = get_reward(kernel)
    loss = -dist.log_prob(action) * reward # REINFORCE
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log
    history["step"].append(step)
    history["chosen_kernel"].append(kernel)
    history["reward"].append(reward)
    history["probs"].append(probs.detach().numpy())

# Convert logged data for visualization
probs_over_time = list(zip(*history["probs"]))

# Plot probabilities over time
# The following code would generate a plot if run in a Python environment
with matplotlib.pyplot imported as plt.
plt.figure(figsize=(10, 6))
for i, k in enumerate(kernel_sizes):
     plt.plot(history["step"], probs_over_time[i], label=f"kernel={k}")
plt.title("Controller's Probability of Choosing Each Kernel Size")
plt.xlabel("Training Step")
plt.ylabel("Probability")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
    </code></pre>

    <hr>

    <h2>Weight Sharing in NAS: The Supernet Approach</h2>

    <p>As you can imagine, Neural Architecture Search (NAS) can be computationally expensive, primarily because training an entire model for each sampled hyperparameter or architectural configuration takes a significant amount of time. That’s where <strong>weight sharing</strong> comes into play, offering a more efficient paradigm.</p>
    <p>In this approach, you typically build a <strong>supernet</strong> (also known as a one-shot model) that encompasses many or all of the architectural choices from the search space. For instance, if you're searching for the optimal kernel size, the supernet would be designed to support multiple kernel sizes within a single, larger model structure. The weights for common parts of the network, or even for the different choices themselves, can be shared.</p>

    <p>An example of a convolutional block within such a supernet might look like this in PyTorch:</p>

    <pre><code class="language-python">
class SuperConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        # Using nn.ModuleDict to hold different convolution operations
        self.ops = nn.ModuleDict({
            'k3': nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False), # bias=False is common in blocks followed by BN
            'k5': nn.Conv2d(in_channels, out_channels, kernel_size=5, padding=2, bias=False),
            'k7': nn.Conv2d(in_channels, out_channels, kernel_size=7, padding=3, bias=False),
        })

    def forward(self, x, choice):
        # The 'choice' argument determines which path (kernel size) to activate
        return self.ops[choice](x)

# Example of a simplified SuperNet structure
class SuperNet(nn.Module):
    def __init__(self, num_classes=10): # Example: 10 classes for classification
        super().__init__()
        # Assume input channels is 3 (e.g., for RGB images)
        # Output channels for the super conv block could be, e.g., 16
        self.conv_block = SuperConvBlock(in_channels=3, out_channels=16)
        self.relu = nn.ReLU()
        self.pool = nn.AdaptiveAvgPool2d(1) # Global average pooling
        self.fc = nn.Linear(16, num_classes) # Classifier

    def forward(self, x, arch_choice):
        x = self.conv_block(x, arch_choice)
        x = self.relu(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1) # Flatten
        x = self.fc(x)
        return x
    </code></pre>

    <p>The supernet is designed to support all three kernel sizes (k3, k5, k7 in this example), and the key idea is that the "weights" for these operations (or parts of them) can be "shared" or jointly trained. During the search or training phase, in each forward pass, only one path (e.g., one kernel size) is typically activated and trained, often based on a decision from a controller or a sampling strategy.</p>

    <p>Each architectural choice (e.g., kernel size) gets trained and selected in turn, but critically, the updates are applied to the shared weights within the supernet. This means you're not training many individual models from scratch but rather training different sub-networks within the larger supernet.</p>
    <p>A simplified training loop for the supernet might look like this:</p>
    <pre><code class="language-python">
for epoch in range(E):
    for batch_idx, (x, y) in enumerate(train_loader):
        # Randomly select an architectural choice for this batch/step
        # In a more advanced setup, a controller or search strategy makes this choice
        choice = random.choice(['k3', 'k5', 'k7'])

        pred = supernet(x, arch_choice=choice)
        loss = F.cross_entropy(pred, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    </code></pre>

    <p>During this training process, only a part of the supernet (the selected path) is active and updated in each step. A controller might sample the desired architecture, or paths might be sampled uniformly or according to a schedule. The performance of these sampled sub-architectures is evaluated, often using these shared weights.</p>
    <p>Once the supernet training is complete, a separate search phase might occur where different sub-architectures are evaluated using the trained supernet weights. The final, top-performing subnet (the "discovered" architecture) is then typically retrained from scratch to achieve its optimal performance, as the shared training process can sometimes lead to slight miscalibrations for any single architecture.</p>

    <p>This weight-sharing approach might seem counterintuitive because individual sub-architectures might not fully converge optimally during the supernet training. However, this "non-convergence" (or rather, coupled training) is by design and is precisely what makes this approach significantly faster and cheaper than training every candidate model independently. The goal of supernet training is often to quickly estimate the relative potential of different architectural choices. You're not aiming to train one model to perfect convergence immediately, but rather to train a population of potential sub-models simultaneously, allowing each to receive partial updates and guiding the search towards promising regions of the architecture space. The weights of each operator within the supernet are gradually improved over many such updates.</p>

    <hr>
</body>
</html>