You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<li>Middle weeks shift into concrete GPU and Trainium experimentation, with profiling and tool use.</li>
88
+
<li>Later weeks increasingly revolve around paper discussion, project reviews, and system-building.</li>
89
+
<li>Short student presentations are threaded throughout to connect reading with active experimentation.</li>
90
+
</ul>
91
+
</div>
92
+
</div>
67
93
<p>
68
94
The detailed course document lays out a semester that moves from basic
69
95
numerical and performance foundations toward concrete GPU experiments on
@@ -130,6 +156,44 @@ <h2>Resources and Logistics</h2>
130
156
archived course materials and the semester documents in the repository.
131
157
This homepage is meant to provide the compact public-facing summary.
132
158
</p>
159
+
<divclass="catalog-grid">
160
+
<articleclass="catalog-card">
161
+
<h3>Software Tools</h3>
162
+
<p>
163
+
The shared course document points students toward a hands-on stack
164
+
of systems for writing, checking, and profiling GPU primitives.
165
+
</p>
166
+
<ul>
167
+
<li><strong>AWS Trainium + Neuron/NKI</strong>: the main accelerator experimentation path in the syllabus, including NKI kernels, Neuron Explorer, profiling traces, and attention and matrix-multiplication tutorials.</li>
168
+
<li><strong>CHPC GPU workflow</strong>: CUDA-capable campus systems, `nvcc`, `nvidia-smi`, `nsys`, and batch allocation workflows for NVIDIA profiling.</li>
169
+
<li><strong>Faial</strong>: a race and cost-analysis direction used in the course to reason about warp-level behavior and correctness/performance interactions.</li>
170
+
<li><strong>GKLEE</strong>: symbolic and concolic GPU bug-finding, used as a reference point for race exposure and schedule-sensitive failures.</li>
171
+
<li><strong>Tilus</strong>: a tile-level GPGPU language for low-precision computation, treated as a language-design case study for structured primitive construction.</li>
172
+
<li><strong>Mojo</strong>: discussed as an emerging systems language for high-performance kernel and HPC-oriented experimentation.</li>
173
+
<li><strong>MLIR and MLIR-AIR</strong>: compiler infrastructure and accelerator-lowering frameworks used to connect loop nests, transformations, and hardware realization.</li>
174
+
<li><strong>AIR2CUDA and related tooling</strong>: software artifacts used to inspect lowering pathways from MLIR-AIR-style flows toward GPU code generation.</li>
175
+
<li><strong>NVBit and custom instrumentation</strong>: dynamic GPU instrumentation ideas, including barrier-focused tooling and low-level runtime inspection.</li>
176
+
<li><strong>Vercors, CIVL, and FP analysis tools</strong>: formal and numeric-analysis tools for proving race freedom, checking semantics, and studying floating-point error.</li>
177
+
</ul>
178
+
</article>
179
+
<articleclass="catalog-card">
180
+
<h3>Papers by Topic</h3>
181
+
<p>
182
+
The readings in the shared syllabus cluster naturally into a few
183
+
recurring themes.
184
+
</p>
185
+
<ul>
186
+
<li><strong>Performance and throughput modeling</strong>: papers such as <em>uiCA</em>, <em>Facile</em>, the shared-memory atomic bottleneck work, and modular static cost analysis build the vocabulary for predicting and explaining kernel throughput.</li>
187
+
<li><strong>GPU execution cost and productivity</strong>: works such as NPBench, data-centric Python, and CUDA cost-model papers connect user productivity, performance portability, and evaluation-cost reasoning.</li>
188
+
<li><strong>Race detection and GPU verification</strong>: the syllabus groups FastTrack, FSE 2010 SMT-based GPU verification, GKLEE, GPUVerify, HiRace, Memory Access Protocols, and Vercors as complementary approaches to proving or detecting correctness properties.</li>
189
+
<li><strong>Formal semantics and Hoare-style reasoning</strong>: materials such as Hoare logic for GPU programs, memory-model readings, and CIVL point students toward specification-first reasoning instead of purely empirical debugging.</li>
190
+
<li><strong>Floating-point rigor</strong>: the background includes Goldberg’s classic essay, floating-point error-analysis work, Herbie-style rewriting, and scalable rigorous FP analysis, tying numerical semantics directly to kernel trustworthiness.</li>
191
+
<li><strong>Scheduling, mapping, and specialization</strong>: software pipelining, warp specialization, distributed tensor mapping, and distributed Fourier mapping papers capture the scheduling side of making kernels and tensor systems fast.</li>
192
+
<li><strong>Compiler and accelerator design</strong>: MLIR, MLIR-AIR, Tilus, and recent accelerator-lowering work show how modern compiler structures can encode performance intent and hardware structure more systematically.</li>
193
+
<li><strong>Project-facing frontier systems</strong>: RenderMan XPU, tritonBLAS, ParallelKittens, ProofWright, GEAK, TileGym, and Tensor Core survey material serve as examples of current systems that students can study, reimplement, or benchmark against.</li>
0 commit comments