website/publications.html at main · SysML-Princeton/website · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>

<html lang="en-us">
  <head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable=no">

    <!-- Font Awesome for social media icons -->
    <script src="https://kit.fontawesome.com/791291c78f.js" crossorigin="anonymous"></script>

    <!-- Bootstrap CSS -->
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">

    <!-- Site Information -->
    <title> SAIL@Princeton </title>

    <style type="text/css">
      .smlinks {
        color: black;
      }
      .smlinks:hover {
        color: rgb(7, 107, 255);
      }
      .paper-item {
        margin-bottom: 15px; /* Adjust this value to increase/decrease the space */
      }
      .badge.badge-secondary {
        cursor: pointer;
      }
    </style>

    <!-- Favicon -->
    <!-- TODO(ruipan): we could add a favicon of the website here -->
    <!-- https://realfavicongenerator.net/ -->
    <!-- <link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">
    <link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">
    <link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">
    <link rel="manifest" href="/site.webmanifest">
    <link rel="mask-icon" href="/safari-pinned-tab.svg" color="#5bbad5">
    <meta name="msapplication-TileColor" content="#da532c">
    <meta name="theme-color" content="#ffffff"> -->

    <!-- Functionality for searching papers -->
    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
    <script>
      $(document).ready(function () {
          // Function to get URL parameters
          function getQueryParam(name) {
              let urlParams = new URLSearchParams(window.location.search);
              return urlParams.get(name) || "";
          }

          // Function to perform search
          function filterPapers(query) {
              query = query.toLowerCase();
              $(".paper-item").each(function () {
                  let text = $(this).text().toLowerCase();
                  $(this).closest("li").toggle(text.includes(query));
              });
          }

          // Populate search bar and apply filter if "search" parameter exists
          let searchQuery = getQueryParam("search");
          if (searchQuery) {
              $("#search").val(searchQuery);
              filterPapers(searchQuery); // Directly apply the filter
          }

          // Attach event listener for manual searches
          $("#search").on("input", function () {
              filterPapers($(this).val());
          });

          // Add click event to badge elements
          $(".badge.badge-secondary").on("click", function () {
              let keyword = $(this).text().trim();
              $("#search").val(keyword).trigger("input"); // Update search bar and trigger filtering
          });

          // Clear search when "Clear" button is clicked
          $("#clear-search").on("click", function () {
              $("#search").val("").trigger("input"); // Clear input and reset filter
          });
      });
    </script>
  </head>

  <body>
    <!-- Nav Bar -->
    <!-- TODO(ruipan): figure out how to align the nav items to the right rather than the left -->
    <nav class="navbar navbar-expand-lg navbar-light sticky-top navbar-custom" style="background-color: #f58025">
        <a class="navbar-brand" href="index.html">
          <img src="./images/princeton_square.jpg" width="30" height="30" class="d-inline-block align-top">
          SAIL@Princeton
        </a>
      <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarSupportedContent" aria-controls="navbarSupportedContent" aria-expanded="false" aria-label="Toggle navigation">
        <span class="navbar-toggler-icon"></span>
      </button>

      <div class="collapse navbar-collapse" id="navbarSupportedContent">
        <ul class="navbar-nav mr-auto">
          <li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
            <a class="nav-link" href="index.html#projects">Projects</a>
          </li>
          <li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
            <a class="nav-link" href="people.html">People</a>
          </li>
          <li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
            <a class="nav-link" href="publications.html">Publications</a>
          </li>
          <li class="nav-item" data-toggle="collapse" data-target=".navbar-collapse.show">
            <a class="nav-link" href="blogs.html">Blogs</a>
          </li>
        </ul>
      </div>
    </nav>

    <!-- Jumbotron -->
    <div class="jumbotron jumbotron-fluid text-center">
      <div class="container">
        <div class="row align-items-center">
          <div class="col-sm-12">
            <h2 class="jumbotron-heading">Publications of SAIL@Princeton</h2>
            <p class="lead">Our publications showcase cutting-edge research at the intersection of systems and machine learning,
              advancing efficient, scalable, and secure AI/ML systems. From novel models and algorithms to optimized runtime systems for training and inference,
              our work pushes the boundaries of next-generation AI infrastructure. Explore our latest contributions to AI/ML and systems research below.</p>
          </div>
        </div>
      </div>
    </div>

    <!-- Search bar -->
    <div class="container">
      <div class="row">
        <div class="col-sm-12">
          <div class="d-flex mb-3">
              <input type="text" id="search" class="form-control" placeholder="Search by title, author, or keyword..." style="flex: 1;">
              <button id="clear-search" class="btn btn-outline-secondary ml-2">Reset</button>
          </div>
        </div>
      </div>
    </div>


    <!-- Preprints -->
    <div class="container">
      <div class="row">
        <div class="col-sm-12">
          <h3>Preprints</h3>
          <ul>

            <li class="paper-item">
              <h5>Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs</h5>
              Rui Pan, Zhuofu Chen, Hongyi Liu, Arvind Krishnamurthy, Ravi Netravali <br>
              arXiv 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Emerging Paradigms</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2512.20573" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#failfast-abstract" role="button" aria-expanded="false" aria-controls="failfast-abstract">Abstract</a>
              </div>
              <div class="collapse" id="failfast-abstract">
                <div class="card card-body">
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9× speedup over vanilla decoding, 1.7× over the best naive dLLM drafter, and 1.4× over EAGLE-3 across diverse models and workloads. We open-source FailFast at this https URL.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows</h5>
              Yinwei Dai, Zhuofu Chen, Anand Iyer, Ravi Netravali <br>
              arXiv 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Compound AI Systems</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2511.20975" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#aragog-abstract" role="button" aria-expanded="false" aria-controls="aragog-abstract">Abstract</a>
              </div>
              <div class="collapse" id="aragog-abstract">
                <div class="card card-body">
Agentic workflows have emerged as a powerful paradigm for solving complex, multi-stage tasks, but serving them at scale is computationally expensive given the many LLM inferences that each request must pass through. Configuration selection, or the cost-aware assignment of workflow agents to specific LLMs, can reduce these costs, but existing approaches bind configuration decisions before request execution, making them ill-suited for the heterogeneous and lengthy execution of workflows. Specifically, system loads can fluctuate rapidly and substantially during a request's lifetime, causing fixed configurations to quickly become suboptimal. We present Aragog, a system that progressively adapts a request's configuration throughout its execution to match runtime dynamics. To make this practical despite the massive space of workflow configurations, Aragog decouples the problem into two core elements -- a one-time routing step that identifies all accuracy-preserving configurations, and a cheap per-stage scheduler that selects among them using up-to-date system observations -- and introduces novel strategies to accelerate each. Across diverse workflows and model families, Aragog increases maximum serving throughput by 50.0--217.0\% and reduces median latency by 32.5--78.9\% at peak request rates, while maintaining accuracy comparable to the most expensive configurations.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning</h5>
              Lijie Yang, Zhihao Zhang, Arti Jain, Shijie Cao, Baihong Yuan, Yiwei Chen, Zhihao Jia, Ravi Netravali <br>
              arXiv 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Emerging Paradigms</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2508.07101" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#lessismore-abstract" role="button" aria-expanded="false" aria-controls="lessismore-abstract">Abstract</a>
              </div>
              <div class="collapse" id="lessismore-abstract">
                <div class="card card-body">
Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a 1.1× average decoding speed-up compared to full attention. Moreover, LessIsMore attends to 2× fewer tokens without accuracy loss, achieving a 1.13× end-to-end speed-up compared to existing sparse attention methods.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines</h5>
              Mike Wong, Ulysses Butler, Emma Farkash, Praveen Tammana, Anirudh Sivaraman, Ravi Netravali <br>
              arXiv 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Compound AI Systems</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2502.15712" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#nics-abstract" role="button" aria-expanded="false" aria-controls="nics-abstract">Abstract</a>
              </div>
              <div class="collapse" id="nics-abstract">
                <div class="card card-body">
The increasing prominence of AI necessitates the deployment of inference platforms for efficient and effective management of AI pipelines and compute resources. As these pipelines grow in complexity, the demand for distributed serving rises and introduces much-dreaded network delays. In this paper, we investigate how the network can instead be a boon to the excessively high resource overheads of AI pipelines. To alleviate these overheads, we discuss how resource-intensive data processing tasks -- a key facet of growing AI pipeline complexity -- are well-matched for the computational characteristics of packet processing pipelines and how they can be offloaded onto SmartNICs. We explore the challenges and opportunities of offloading, and propose a research agenda for integrating network hardware into AI pipelines, unlocking new opportunities for optimization.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>How to Train Long-Context Language Models (Effectively)</h5>
              Tianyu Gao*, Alexander Wettig*, Howard Yen, Danqi Chen <br>
              arXiv 2025<br>
              <span class="badge badge-secondary">Efficient Training</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2410.02660" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#prolong-abstract" role="button" aria-expanded="false" aria-controls="prolong-abstract">Abstract</a>
              </div>
              <div class="collapse" id="prolong-abstract">
                <div class="card card-body">
                  We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.
                  We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests,
                  we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities.
                  Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset,
                  and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data;
                  (2) training with a sequence length beyond the evaluation length boosts long-context performance;
                  (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks.
                  Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
                  ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training.
                  Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>Certifiably Robust RAG against Retrieval Corruption</h5>
              Chong Xiang*, Tong Wu*, Zexuan Zhong, David Wagner, Danqi Chen, Prateek Mittal. <br>
              arXiv 2025<br>
              <span class="badge badge-secondary">Compound AI Systems</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2405.15556" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#robustrag-abstract" role="button" aria-expanded="false" aria-controls="robustrag-abstract">Abstract</a>
              </div>
              <div class="collapse" id="robustrag-abstract">
                <div class="card card-body">
                  Retrieval-augmented generation (RAG) has been shown vulnerable to retrieval corruption attacks: an attacker can inject malicious passages into retrieval results to induce inaccurate responses.
                  In this paper, we propose RobustRAG as the first defense framework against retrieval corruption attacks.
                  The key insight of RobustRAG is an isolate-then-aggregate strategy: we get LLM responses from each passage in isolation and then securely aggregate these isolated responses.
                  To instantiate RobustRAG, we design keyword-based and decoding-based algorithms for securely aggregating unstructured text responses.
                  Notably, RobustRAG can achieve certifiable robustness: we can formally prove and certify that, for certain queries, RobustRAG can always return accurate responses,
                  even when the attacker has full knowledge of our defense and can arbitrarily inject a small number of malicious passages. We evaluate RobustRAG on open-domain QA and long-form text generation datasets and demonstrate its effectiveness and generalizability across various tasks and datasets.
                </div>
              </div>
            </li>
          </ul>
        </div>
      </div>
    </div>


    <!-- 2026 -->
    <div class="container">
      <div class="row">
        <div class="col-sm-12">
          <h3>2026</h3>
          <ul>
            <li class="paper-item">
              <h5>Remembrall: Leaning into Memory for Accurate Video Analytics on System-on-Chip GPUs</h5>
              Murali Ramanujam, Yinwei Dai, Kyle Jamieson, Ravi Netravali <br>
              NSDI 2026 (to appear)<br>
              <span class="badge badge-secondary">Edge AI Systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#remembrall-abstract" role="button" aria-expanded="false" aria-controls="remembrall-abstract">Abstract</a>
              </div>
              <div class="collapse" id="remembrall-abstract">
                <div class="card card-body">
                  [Abstract content to be added]
                </div>
              </div>
            </li>
          </ul>
        </div>
      </div>
    </div>

    <!-- 2025 -->
    <div class="container">
      <div class="row">
        <div class="col-sm-12">
          <h3>2025</h3>
          <ul>


            <li class="paper-item">
              <h5>SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning</h5>
              Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali <br>
              NeurIPS 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Emerging Paradigms</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2504.07891" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#specreason-neurips-abstract" role="button" aria-expanded="false" aria-controls="specreason-neurips-abstract">Abstract</a>
                <a href="https://github.com/ruipeterpan/specreason" target="_blank"><button type="button" class="btn btn-outline-primary btn-sm">Code</button></a>
              </div>
              <div class="collapse" id="specreason-neurips-abstract">
                <div class="card card-body">
                  Recent advances in inference-time compute have significantly improved performance on complex tasks by
                  generating long chains of thought (CoTs) using Large Reasoning Models (LRMs).
                  However, this improved accuracy comes at the cost of high inference latency due to
                  the length of generated reasoning sequences and the autoregressive nature of decoding.
                  Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds,
                  is highly tolerant of approximations: complex tasks are typically broken down into simpler steps,
                  each of which brings utility based on the semantic insight it provides for downstream steps
                  rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that
                  automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out
                  simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct)
                  the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of
                  thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques,
                  most notably speculative decoding, which demands token-level equivalence at each step. Across a variety
                  of reasoning benchmarks, SpecReason achieves 1.5-2.5× speedup over vanilla LRM inference while
                  improving accuracy by 1.0-9.9%. Compared to speculative decoding without SpecReason,
                  their combination yields an additional 19.4-44.2% latency reduction.
                  We open-source SpecReason at https://github.com/ruipeterpan/specreason.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>Software Managed Networks via Coarsening</h5>
              Pradeep Dogga, Rachee Singh, Suman Nath, Ravi Netravali, Jens Palsberg, George Varghese <br>
              HotNets 2025<br>
              <span class="badge badge-secondary">ML for Systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#coarsening-abstract" role="button" aria-expanded="false" aria-controls="coarsening-abstract">Abstract</a>
              </div>
              <div class="collapse" id="coarsening-abstract">
                <div class="card card-body">
We propose moving from Software Defined Networks (SDN) to Software Managed Networks (SMN) where all information for managing the life cycle of a network (from deployment to operations to upgrades), across all layers (from Layer 1 through 7) is stored in a central repository. Crucially, a SMN also has a generalized control plane that, unlike SDN, controls all aspects of the cloud including traffic management (e.g., capacity planning) and reliability (e.g., incident routing) at both short (minutes) and large (years) time scales. Just as SDN allows better routing, a SMN improves visibility and enables cross-layer optimizations for faster response to failures and better network planning and operations. Implemented naively, SMN for planetary sc6ale networks requires orders of magnitude larger and more heterogeneous data (e.g., alerts, logs) than SDN. We address this using coarsening — mapping complex data to a more compact abstract representation that has approximately the same effect, and is more scalable, maintainable, and learnable. We show examples including Coarse Bandwidth Logs for capacity planning and Coarse Dependency Graphs for incident routing. Coarse Dependency Graphs improve an incident routing metric from 45% to 78% while for a distributed approach like Scouts the same metric was 22%. We end by discussing how to realize SMN, and suggest cross-layer optimizations and coarsenings for other operational and planning problems in networks.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>Toward Bandwidth-adaptive Fully-Immersive Volumetric Video Conferencing</h5>
              Rajrup Ghosh, Christina Suyong Shin, Lei Zhang, Muyang Ye, Tao Jin, Harsha V. Madhyastha, Ravi Netravali, Antonio Ortega, Sanjay Rao, Anthony Rowe, Ramesh Govindan <br>
              CoNEXT 2025<br>
              <span class="badge badge-secondary">ML for systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#volumetric-abstract" role="button" aria-expanded="false" aria-controls="volumetric-abstract">Abstract</a>
              </div>
              <div class="collapse" id="volumetric-abstract">
                <div class="card card-body">
Volumetric video allows users 6 degrees of freedom (6-DoF) in viewing continuously evolving scenes in 3D. Given broadband speeds today, volumetric video conferencing will soon be feasible. Even so, these scenes will need to be compressed, and compression will need to adapt to variations in bandwidth availability. Existing 3D compression techniques cannot adapt to bandwidth availability, are slow, and utilize bandwidth inefficiently, so they don't scale well to large scene descriptions. LiVo achieves low-latency and large-scene two-way conferencing by maximally leveraging existing 2D video infrastructure, including compression standards, rate-adaptive codecs, and real-time transport protocols. To achieve high quality, LiVo must carefully compose scenes from multiple cameras into multiple streams, encode scene geometry in a novel way, adapt to and apportion available bandwidth dynamically between streams to ensure high reconstruction quality, and cull content outside the receiver's field of view to reduce information sent into the network. These novel contributions enable LiVo to outperform the state-of-the-art by over 20% in objective quality. In a user study, LiVo achieves a mean opinion score of 4.1, while other approaches achieve significantly lower values.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation</h5>
              Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang <br>
              SOSP 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Compound AI Systems</span>
              <div class="mt-2">
                <a href="https://arxiv.org/pdf/2412.10543" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#ragserve-sosp-abstract" role="button" aria-expanded="false" aria-controls="ragserve-sosp-abstract">Abstract</a>
              </div>
              <div class="collapse" id="ragserve-sosp-abstract">
                <div class="card card-body">
                  RAG (Retrieval Augmented Generation) allows LLMs (large language models) to
                  generate better responses with external knowledge, but using more external
                  knowledge often improves generation quality at the expense of response delay.
                  Prior work either reduces the response delay (through better scheduling of RAG
                  queries) or strives to maximize quality (which involves tuning the RAG workflow),
                  but they fall short in optimizing the \emph {tradeoff} between the delay
                  and quality of RAG responses. This paper presents RAGServe, the first RAG system
                  that jointly schedules queries and adapts the key RAG configurations of each
                  job, such as the number of retrieved text chunks and synthesis methods,
                  in order to balance quality optimization and response delay reduction.
                  Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art
                  RAG scheduling system, RAGServe reduces the generation latency by 1.64--2.54×
                  without sacrificing generation quality.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Metadata Conditioning Accelerates Language Model Pre-training</h5>
              Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen. <br>
              ICML 2025<br>
              <span class="badge badge-secondary">Efficient training</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#ladder-residual-abstract" role="button" aria-expanded="false" aria-controls="ladder-residual-abstract">Abstract</a>
              </div>
              <div class="collapse" id="ladder-residual-abstract">
                <div class="card card-body">
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like en.wikipedia.org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia.org to reduce harmful generations or factquizmaster.com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.


                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping</h5>
              Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao <br>
              ICML 2025<br>
              <span class="badge badge-secondary">Edge AI Systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#ladder-residual-abstract" role="button" aria-expanded="false" aria-controls="ladder-residual-abstract">Abstract</a>
              </div>
              <div class="collapse" id="ladder-residual-abstract">
                <div class="card card-body">
Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. We release our code for training and inference for easier replication of experiments.

                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>Long-context state-space video world models</h5>
              Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, Xun Huang <br>
              ICCV 2025<br>
              <span class="badge badge-secondary">Sequence Modeling</span>
              <span class="badge badge-secondary">State Space Models</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#long-content-state-video-abstract" role="button" aria-expanded="false" aria-controls="long-context-state-video-abstract">Abstract</a>
              </div>
              <div class="collapse" id="long-context-state-video-abstract">
                <div class="card card-body">
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Hardware-Efficient Attention for Fast Decoding</h5>
              Ted Zadouri, Hubert Strauss, Tri Dao <br>
              COLM 2025<br>
              <span class="badge badge-secondary">Hardware Design for ML</span>
              <span class="badge badge-secondary">Efficient Inference</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#hw-attn-fast-decode-abstract" role="button" aria-expanded="false" aria-controls="hw-attn-fast-decode-abstract">Abstract</a>
              </div>
              <div class="collapse" id="hw-attn-fast-decode-abstract">
                <div class="card card-body">
LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay among arithmetic intensity, parallelization, and model quality and question whether current architectures fully exploit modern hardware. This work redesigns attention to perform more computation per byte loaded from memory to maximize hardware efficiency without trading off parallel scalability. We first propose Grouped-Tied Attention (GTA), a simple variant that combines and reuses key and value states, reducing memory transfers without compromising model quality. We then introduce Grouped Latent Attention (GLA), a parallel-friendly latent attention paired with low-level optimizations for fast decoding while maintaining high model quality. Experiments show that GTA matches Grouped-Query Attention (GQA) quality while using roughly half the KV cache and that GLA matches Multi-head Latent Attention (MLA) and is easier to shard. Our optimized GLA kernel is up to 2× faster than FlashMLA, for example, in a speculative decoding setting when the query length exceeds one. Furthermore, by fetching a smaller KV cache per device, GLA reduces end-to-end latency and increases throughput in online serving benchmarks by up to 2×.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Scalable Video Conferencing Using SDN Principles</h5>
              Oliver Michel, Satadal Sengupta, Hyojoon Kim, Ravi Netravali, Jennifer Rexford <br>
              SIGCOMM 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">ML for systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#scalable-video-sdn-abstract" role="button" aria-expanded="false" aria-controls="scalable-video-sdn-abstract">Abstract</a>
              </div>
              <div class="collapse" id="scalable-video-sdn-abstract">
                <div class="card card-body">
Video-conferencing applications face an unwavering surge in traffic, stressing their underlying infrastructure in unprecedented ways. This paper rethinks the key building block for conferencing infrastructures -- selective forwarding units (SFUs). SFUs relay and adapt media streams between participants and, today, run in software on general-purpose servers. Our main insight, discerned from dissecting the operation of production SFU servers, is that SFUs largely mimic traditional packet-processing operations such as dropping and forwarding. Guided by this, we present Scallop, an SDN-inspired SFU that decouples video-conferencing applications into a hardware-based data plane for latency-sensitive and frequent media operations, and a software control plane for the (infrequent) remaining tasks, such as analyzing feedback signals. Our Tofino-based implementation fully supports WebRTC and delivers 7-210 times improved scaling over a 32-core commodity server, while reaping performance improvements by cutting forwarding-induced latency by 26 times.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Hypervisors for Isolating Malicious AIs</h5>
              James Mickens, Sarah Radway, Ravi Netravali <br>
              HotOS 2025<br>
              <span class="badge badge-secondary">Privacy and Security</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#hypervisors-abstract" role="button" aria-expanded="false" aria-controls="hypervisors-abstract">Abstract</a>
              </div>
              <div class="collapse" id="hypervisors-abstract">
                <div class="card card-body">
As AI models become more embedded in critical sectors like finance, healthcare, and the military, their inscrutable behavior poses ever-greater risks to society. To mitigate this risk, we propose Guillotine, a hypervisor architecture for sandboxing powerful AI models -- models that, by accident or malice, can generate existential threats to humanity. Although Guillotine borrows some well-known virtualization techniques, Guillotine must also introduce fundamentally new isolation mechanisms to handle the unique threat model posed by existential-risk AIs. For example, a rogue AI may try to introspect upon hypervisor software or the underlying hardware substrate to enable later subversion of that control plane; thus, a Guillotine hypervisor requires careful co-design of the hypervisor software and the CPUs, RAM, NIC, and storage devices that support the hypervisor software, to thwart side channel leakage and more generally eliminate mechanisms for AI to exploit reflection-based vulnerabilities. Beyond such isolation at the software, network, and microarchitectural layers, a Guillotine hypervisor must also provide physical fail-safes more commonly associated with nuclear power plants, avionic platforms, and other types of mission critical systems. Physical fail-safes, e.g., involving electromechanical disconnection of network cables, or the flooding of a datacenter which holds a rogue AI, provide defense in depth if software, network, and microarchitectural isolation is compromised and a rogue AI must be temporarily shut down or permanently destroyed.

                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Marconi: Prefix Caching for the Era of Hybrid LLMs</h5>
              Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali <br>
              MLSys 2025<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#marconi-abstract" role="button" aria-expanded="false" aria-controls="marconi-abstract">Abstract</a>
              </div>
              <div class="collapse" id="marconi-abstract">
                <div class="card card-body">
Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Tarzan: Passively-Learned Real-Time Rate Control for Video Conferencing</h5>
              Neil Agarwal, Rui Pan, Francis Yan, Ravi Netravali<br>
              NSDI 2025<br>
              <span class="badge badge-secondary">Novel ML Applications</span>
              <span class="badge badge-secondary">ML for Systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#tarzan-abstract" role="button" aria-expanded="false" aria-controls="tarzan-abstract">Abstract</a>
              </div>
              <div class="collapse" id="tarzan-abstract">
                <div class="card card-body">
Rate control algorithms are at the heart of video conferencing platforms, determining target bitrates that match dynamic network characteristics for high quality. Recent data-driven strategies have shown promise for this challenging task, but the performance degradation they introduce during training has been a nonstarter for many production services, precluding adoption. This paper aims to bolster the practicality of data-driven rate control by presenting an alternative avenue for experiential learning: leveraging purely existing telemetry logs produced by the incumbent algorithm in production. We observe that these logs contain effective decisions, although often at the wrong times or in the wrong order. To realize this approach despite the inherent uncertainty that log-based learning brings (i.e., lack of feedback for new decisions), our system, Mowgli, combines a variety of robust learning techniques (i.e., conservatively reasoning about alternate behavior to minimize risk and using a richer model formulation to account for environmental noise). Across diverse networks (emulated and real-world), Mowgli outperforms the widely deployed GCC algorithm, increasing average video bitrates by 15-39% while reducing freeze rates by 60-100%.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Physical Visualization Design: Decoupling Interface and System Design</h5>
              Yiru Chen, Xupeng Li, Jeff Tao, Lana Ramjit, Ravi Netravali, Subrata Mitra, Aditya Parameswaran, Javad Ghaderi, Dan Rubenstein, Eugene Wu<br>
              SIGMOD 2025<br>
              <span class="badge badge-secondary">Edge AI Systems</span>
              <div class="mt-2">
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#physical-visual-abstract" role="button" aria-expanded="false" aria-controls="physical-visual-abstract">Abstract</a>
              </div>
              <div class="collapse" id="physical-visual-abstract">
                <div class="card card-body">
Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer's desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures.
This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations.
To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.
                </div>
              </div>
            </li>


           </ul>
        </div>
      </div>
    </div>

    <!-- 2024 -->
    <div class="container">
      <div class="row">
        <div class="col-sm-12">
          <h3>2024</h3>
          <ul>
            <li class="paper-item">
              <h5>Catastrophic jailbreak of open-source LLMs via exploiting generation</h5>
              Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen<br>
              ICLR 2024<br>
              <div class="mt-2">
                <a href="https://arxiv.org/pdf/2310.06987" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#genexploit-abstract" role="button" aria-expanded="false" aria-controls="genexploit-abstract">Abstract</a>
              </div>
              <div class="collapse" id="genexploit-abstract">
                <div class="card card-body">
                  The rapid progress in open-source large language models (LLMs) is significantly
                  advancing AI development. Extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. However, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as
                  “jailbreaks”. These jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. In this work, we propose the generation
                  exploitation attack, an extremely simple approach that disrupts model alignment
                  by only manipulating variations of decoding methods. By exploiting different
                  generation strategies, including varying decoding hyper-parameters and sampling
                  methods, we increase the misalignment rate from 0% to more than 95% across
                  11 language models including LLAMA2, VICUNA, FALCON, and MPT families,
                  outperforming state-of-the-art attacks with 30× lower computational cost. Finally, we propose an effective alignment method that explores diverse generation
                  strategies, which can reasonably reduce the misalignment rate under our attack.
                  Altogether, our study underscores a major failure in current safety evaluation and
                  alignment procedures for open-source LLMs, strongly advocating for more comprehensive red teaming and better alignment before releasing such models1
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>MadEye: Boosting Live Video Analytics Accuracy with Adaptive Camera Configurations</h5>
              Mike Wong, Murali Ramanujam, Guha Balakrishnan, Ravi Netravali<br>
              NSDI 2024<br>
              <span class="badge badge-secondary">Edge AI Systems</span>
              <div class="mt-2">
                <a href="https://michaeldwong.github.io/papers/madeye-nsdi24.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#madeye-abstract" role="button" aria-expanded="false" aria-controls="madeye-abstract">Abstract</a>
              </div>
              <div class="collapse" id="madeye-abstract">
                <div class="card card-body">
                  Camera orientations (i.e., rotation and zoom) govern the
                  content that a camera captures in a given scene, which in
                  turn heavily influences the accuracy of live video analytics
                  pipelines. However, existing analytics approaches leave this
                  crucial adaptation knob untouched, instead opting to only
                  alter the way that captured images from fixed orientations
                  are encoded, streamed, and analyzed. We present MadEye,
                  a camera-server system that automatically and continually
                  adapts orientations to maximize accuracy for the workload
                  and resource constraints at hand. To realize this using commodity pan-tilt-zoom (PTZ) cameras, MadEye embeds (1) a
                  search algorithm that rapidly explores the massive space of
                  orientations to identify a fruitful subset at each time, and (2) a
                  novel knowledge distillation strategy to efficiently (with only
                  camera resources) select the ones that maximize workload accuracy. Experiments on diverse workloads show that MadEye
                  boosts accuracy by 2.9-25.7% for the same resource usage, or
                  achieves the same accuracy with 2-3.7× lower resource costs.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>ADR-X: ANN-Assisted Wireless Link Rate Adaptation for Compute-Constrained Embedded Gaming Devices</h5>
              Hao Yin, Murali Ramanujam, Joe Schaefer, Stan Adermann, Srihari Narlanka, Perry Lea, Ravi Netravali, Krishna Chintalapudi<br>
              NSDI 2024<br>
              <span class="badge badge-secondary">ML for Systems</span>
              <div class="mt-2">
                <a href="https://www.usenix.org/system/files/nsdi24-yin.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#adrx-abstract" role="button" aria-expanded="false" aria-controls="adrx-abstract">Abstract</a>
              </div>
              <div class="collapse" id="adrx-abstract">
                <div class="card card-body">
                  The wireless channel between gaming console and accessories
                  e.g. controllers and headsets, experiences extremely rapid
                  variations due to abrupt head and hand movements amidst
                  an exciting game. In the absence of prior studies on wireless
                  packet losses for console gaming, through extensive evaluations and user studies, we find that state-of-the-art rate adaptation schemes, unable to keep up with these rapid changes,
                  experience packet loss rates of 2-10% while loss rates that
                  are 10× lower (0.1-0.5%) are required to ensure a high quality gaming experience. We present ADR-X, an ANN-based
                  contextual multi-armed bandit rate adaptation technique that
                  continuously predicts and tracks the channel and picks appropriate data rates. A key challenge for ADR-X is that it must
                  run on power and compute constrained embedded devices
                  under realtime constraints. ADR-X addresses this challenge
                  by meticulously crafting an ANN that leverages existing communication theory results to incorporate domain knowledge.
                  This allows ADR-X to achieve 10× lower packet losses than
                  existing schemes while also running 100× faster than stateof-the-art reinforcement learning schemes, making it suitable
                  for deployment on embedded gaming devices.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>NetVigil: Robust and Low-Cost Anomaly Detection for East-West Data Center Security</h5>
              Kevin Hsieh*, Mike Wong*, Santiago Segarra, Sathiya Kumaran Mani, Trevor Eberl, Anatoliy Panasyuk, Ravi Netravali, Ranveer Chandra, Srikanth Kandula<br>
              NSDI 2024<br>
              <span class="badge badge-secondary">ML for Systems</span>
              <span class="badge badge-secondary">Privacy and Security</span>
              <span class="badge badge-secondary">Novel ML Applications</span>
              <div class="mt-2">
                <a href="https://michaeldwong.github.io/papers/netvigil-nsdi24.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#netvigil-abstract" role="button" aria-expanded="false" aria-controls="netvigil-abstract">Abstract</a>
              </div>
              <div class="collapse" id="netvigil-abstract">
                <div class="card card-body">
                  The growing number of breaches in data centers
                  underscores an urgent need for more effective security. Traditional perimeter defense measures and static zero-trust approaches are unable to address the unique challenges that arise
                  from the scale, complexity, and evolving nature of today’s
                  data center networks. To tackle these issues, we introduce
                  NetVigil, a robust and cost-efficient anomaly detection system
                  specifically designed for east-west traffic within data center
                  networks. NetVigil adeptly extracts security-focused, graphbased features from network flow logs and employs domainspecific graph neural networks (GNNs) and contrastive learning techniques to strengthen its resilience against normal
                  traffic variations and adversarial evasion strategies. Our evaluation, over various attack scenarios and traces from real-world
                  production clusters, shows that NetVigil delivers significant
                  improvements in accuracy, cost, and detection latency compared to state-of-the-art anomaly detection systems, providing
                  a practical, supplementary security mechanism to protect the
                  east-west traffic within data center networks.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>
                Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
                <img src="images/acm_available_1.1.png" height="25"/><img src="images/acm_functional_1.1.png" height="25"/><img src="images/acm_reproduced_1.1.png" height="25"/>
              </h5>
              Yinwei Dai*, Rui Pan*, Anand Iyer, Kai Li, Ravi Netravali <br>
              SOSP 2024<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <div class="mt-2">
                <a href="https://dl.acm.org/doi/pdf/10.1145/3694715.3695963" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#apparate-abstract" role="button" aria-expanded="false" aria-controls="apparate-abstract">Abstract</a>
                <a href="https://github.com/dywsjtu/apparate" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">Code</button>
                </a>
              </div>
              <div class="collapse" id="apparate-abstract">
                <div class="card card-body">
                  Machine learning (ML) inference platforms are tasked with balancing two competing goals:
                  ensuring high throughput given many requests, and delivering low-latency responses to support interactive applications.
                  Unfortunately, existing platform knobs (e.g., batch sizes) fail to ease this fundamental tension,
                  and instead only enable users to harshly trade off one property for the other.
                  This paper explores an alternate strategy to taming throughput-latency tradeoffs by changing the granularity
                  at which inference is performed.
                  We present Apparate, a system that automatically applies and manages early exits (EEs) in ML models,
                  whereby certain inputs can exit with results at intermediate layers.
                  To cope with the time-varying overhead and accuracy challenges that EEs bring,
                  Apparate repurposes exits to provide continual feedback that powers several novel runtime monitoring and adaptation strategies.
                  Apparate lowers median response latencies by 40.5-91.5% and 10.0-24.2% for diverse CV and NLP classification workloads,
                  and median time-per-token latencies by 70.4-77.9% for generative scenarios,
                  without affecting throughputs or violating tight accuracy constraints.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation</h5>
              Anand Iyer, Mingyu Guan, Yinwei Dai, Rui Pan, Swapnil Gandhi, Ravi Netravali <br>
              SOSP 2024<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <div class="mt-2">
                <a href="https://dl.acm.org/doi/pdf/10.1145/3694715.3695978" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#e3-abstract" role="button" aria-expanded="false" aria-controls="e3-abstract">Abstract</a>
              </div>
              <div class="collapse" id="e3-abstract">
                <div class="card card-body">
                  Machine learning inference platforms continue to face high request rates and strict latency constraints.
                  Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations.
                  This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a per-input granularity:
                  early exit models, which selectively allow certain inputs to exit a model from an intermediate layer.
                  Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization)
                  throughout model execution. We present E3, the first system that makes early exit models practical for realistic inference deployments.
                  Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution,
                  all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that E3 can deliver up to 1.74×
                  improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput).
                  Additionally, E3's goodput wins generalize to autoregressive LLMs (2.8-3.8×) and compressed models (1.67×).
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers</h5>
              Hongjie Wang, Bhishma Dedhia, Niraj K Jha<br>
              CVPR 2024<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <div class="mt-2">
                <a href="https://openaccess.thecvf.com/content/CVPR2024/papers/Wang_Zero-TPrune_Zero-Shot_Token_Pruning_through_Leveraging_of_the_Attention_Graph_CVPR_2024_paper.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#zerotprune-abstract" role="button" aria-expanded="false" aria-controls="zero-tprune-abstract">Abstract</a>
              </div>
              <div class="collapse" id="zero-tprune-abstract">
                <div class="card card-body">
                Deployment of Transformer models on edge devices is
becoming increasingly challenging due to the exponentially
growing inference cost that scales quadratically with the
number of tokens in the input sequence. Token pruning is an
emerging solution to address this challenge due to its ease
of deployment on various Transformer backbones. However, most token pruning methods require computationally
expensive fine-tuning, which is undesirable in many edge
deployment cases. In this work, we propose Zero-TPrune,
the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for
tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning
for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead, Zero-TPrune can prune
large models at negligible computational cost, switch between different pruning configurations at no computational
cost, and perform hyperparameter tuning efficiently. We
evaluate the performance of Zero-TPrune on vision tasks
by applying it to various vision Transformer backbones and
testing them on ImageNet. Without any fine-tuning, ZeroTPrune reduces the FLOPs cost of DeiT-S by 34.7% and
improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so
with only 0.1% accuracy loss. Compared with state-of-theart fine-tuning-free pruning methods, Zero-TPrune reduces
accuracy loss by up to 49% with similar FLOPs budgets.
Project webpage: https://jha-lab.github.io/zerotprune.
                </div>
              </div>
            </li>

            <li class="paper-item">
              <h5>AT-EDM: Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models</h5>
              Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K. Jha, Yuchen Liu<br>
              CVPR 2024<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <span class="badge badge-secondary">Emerging Paradigms</span>
              <div class="mt-2">
                <a href="https://arxiv.org/pdf/2405.05252" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#atedm-abstract" role="button" aria-expanded="false" aria-controls="atedm-abstract">Abstract</a>
                <a href="https://atedm.github.io/" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">Website</button>
                </a>
              </div>
              <div class="collapse" id="atedm-abstract">
                <div class="card card-body">
                  Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works
                  mainly adopt a retraining process to enhance DM efficiency.
                  This is computationally expensive and not very scalable. To
                  this end, we introduce the Attention-driven Training-free
                  Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel
                  ranking algorithm, Generalized Weighted Page Rank (GWPR), to identify redundant tokens, and a similarity-based
                  recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware
                  Pruning (DSAP) approach to adjust the pruning budget
                  across different denoising timesteps for better generation
                  quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency
                  (e.g., 38.8% FLOPs saving and up to 1.53× speed-up over
                  Stable Diffusion XL) while maintaining nearly the same
                  FID and CLIP scores as the full model. Project webpage:
                  https://atedm.github.io.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling</h5>
              Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj Jha, Yilin Shen, Hongxia Jin<br>
              NAACL 2024<br>
              <span class="badge badge-secondary">Efficient Inference</span>
              <div class="mt-2">
                <a href="https://aclanthology.org/2024.naacl-long.182.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#dynamo-abstract" role="button" aria-expanded="false" aria-controls="dynamo-abstract">Abstract</a>
              </div>
              <div class="collapse" id="dynamo-abstract">
                <div class="card card-body">
                  Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models *dynamically* predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57× speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference</h5>
              Hengrui Zhang, August Ning, Rohan Baskar Prabhakar, and David Wentzlaff<br>
              ISCA 2024<br>
              <span class="badge badge-secondary">Hardware Design for ML</span>
              <div class="mt-2">
                <a href="https://parallel.princeton.edu/papers/isca24_llmcompass.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#llmcompass-abstract" role="button" aria-expanded="false" aria-controls="llmcompass-abstract">Abstract</a>
              </div>
              <div class="collapse" id="llmcompass-abstract">
                <div class="card card-body">
                  The past year has witnessed the increasing popularity of Large Language Models (LLMs). Their unprecedented
                  scale and associated high hardware cost have impeded their
                  broader adoption, calling for efficient hardware designs. With the
                  large hardware needed to simply run LLM inference, evaluating
                  different hardware designs becomes a new bottleneck.
                  This work introduces LLMCompass1
                  , a hardware evaluation
                  framework for LLM inference workloads. LLMCompass is fast,
                  accurate, versatile, and able to describe and evaluate different
                  hardware designs. LLMCompass includes a mapper to automatically find performance-optimal mapping and scheduling. It also
                  incorporates an area-based cost model to help architects reason
                  about their design choices. Compared to real-world hardware,
                  LLMCompass’ estimated latency achieves an average 10.9% error rate across various operators with various input sizes and an
                  average 4.1% error rate for LLM inference. With LLMCompass,
                  simulating a 4-NVIDIA A100 GPU node running GPT-3 175B
                  inference can be done within 16 minutes on commodity hardware,
                  including 26,400 rounds of the mapper’s parameter search.
                  With the aid of LLMCompass, this work draws architectural
                  implications and explores new cost-effective hardware designs. By
                  reducing the compute capability or replacing High Bandwidth
                  Memory (HBM) with traditional DRAM, these new designs
                  can achieve as much as 3.41x improvement in performance/cost
                  compared to an NVIDIA A100, making them promising choices
                  for democratizing LLMs.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference</h5>
              Rohan Baskar Prabhakar, Hengrui Zhang, and David Wentzlaff <br>
              NeurIPS 2024<br>
              <span class="badge badge-secondary">Hardware Design for ML</span>
              <div class="mt-2">
                <a href="https://parallel.princeton.edu/papers/Kraken.pdf" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#kraken-abstract" role="button" aria-expanded="false" aria-controls="kraken-abstract">Abstract</a>
              </div>
              <div class="collapse" id="kraken-abstract">
                <div class="card card-body">
                  Large Transformer networks are increasingly used in settings where low inference latency is necessary to enable new applications and improve the end-user
                  experience. However, autoregressive inference is resource intensive and requires
                  parallelism for efficiency. Parallelism introduces collective communication that
                  is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer
                  architecture that is designed to complement existing tensor parallelism schemes
                  for efficient inference on multi-device systems. By introducing a fixed degree of
                  intra-layer model parallelism, the architecture allows collective operations to be
                  overlapped with compute, decreasing latency and increasing hardware utilization.
                  When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities as
                  evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU
                  systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by
                  a mean of 35.6% across a range of model sizes, context lengths, and degrees of
                  tensor parallelism
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>SimPO: Simple Preference Optimization with a Reference-Free Reward</h5>
              Yu Meng*, Mengzhou Xia*, Danqi Chen <br>
              NeurIPS 2024<br>
              <span class="badge badge-secondary">Efficient Training</span>
              <div class="mt-2">
                <a href="https://arxiv.org/pdf/2405.14734" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#simpo-abstract" role="button" aria-expanded="false" aria-controls="simpo-abstract">Abstract</a>
              </div>
              <div class="collapse" id="simpo-abstract">
                <div class="card card-body">
                  Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability.
                  In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward.
                  This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient.
                  Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance.
                  We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3.
                  We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark.
                  Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard.
                  Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.
                </div>
              </div>
            </li>
            <li class="paper-item">
              <h5>Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training</h5>
              Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis <br>
              COLM 2024<br>
              <span class="badge badge-secondary">Emerging Paradigms</span>
              <div class="mt-2">
                <a href="https://arxiv.org/abs/2405.03133" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#lory-abstract" role="button" aria-expanded="false" aria-controls="lory-abstract">Abstract</a>
              </div>
              <div class="collapse" id="lory-abstract">
                <div class="card card-body">
                  Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective.
                  Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks.
                  In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
                  Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models;
                  (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances.
                  We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters.
                  Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%).
                  Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision.
                  Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
                </div>
              </div>
            </li>


            <li class="paper-item">
              <h5>Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality</h5>
              Tri Dao, Albert Gu <br>
              ICML 2024<br>
              <span class="badge badge-secondary">Emerging Paradigms</span>
              <div class="mt-2">
                <a href="https://arxiv.org/pdf/2405.21060" target="_blank">
                  <button type="button" class="btn btn-outline-primary btn-sm">PDF</button>
                </a>
                <a class="btn btn-outline-primary btn-sm" data-toggle="collapse" href="#transformers-are-ssms-abstract" role="button" aria-expanded="false" aria-controls="transformers-are-ssms-abstract">Abstract</a>
              </div>
              <div class="collapse" id="transformers-are-ssms-abstract">
                <div class="card card-body">
While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster, while continuing to be competitive with Transformers on language modeling.
                </div>
              </div>