cuTimeWarp/cuTimeWarp.tex at main · alexkyllo/cuTimeWarp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt, letterpaper]{article}
\usepackage[margin=2cm]{geometry}
\usepackage{fourier}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{placeins}
\usepackage{bookmark}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{textcomp}
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{parskip}
\usepackage{epstopdf}
\usepackage{pgf}
\usepackage{pgfplots}
\usepackage[sorting=none,style=ieee]{biblatex}
\usepackage[utf8]{inputenc}\DeclareUnicodeCharacter{2212}{-}
\DeclareUnicodeCharacter{2009}{\,}
\setcounter{secnumdepth}{-\maxdimen}

\title{cuTimeWarp: Accelerating Soft Dynamic Time Warping on GPU}

\author{Alex Kyllo \and Afrooz Rahmati}

\addbibresource{cuTimeWarp.bib}

\begin{document}

\maketitle

\begin{abstract}

In this report we explore techniques for optimizing the parallel computation of
Soft Dynamic Time Warping, a differentiable sequence dissimilarity measure, on
graphics processing units (GPU), for the purpose of enabling high-performance
machine learning on time series datasets.

\end{abstract}

\section{Introduction}

Time series machine learning is a research area with countless useful
applications such as recognizing sounds and gestures. Clustering or classifying
large time series datasets is challenging partly because of the need to define a
measure of dissimilarity between any two given time series, which is not as
straightforward as computing distance between independent points in dimensional
space. This is because practical applications require finding common structure
in time series despite different speeds, phases or lengths; for example, a word
means the same whether spoken quickly or slowly, loudly or quietly, or in a high
vs. low voice, but such variations tend to produce large discrepancies if
traditional distance metrics are used. Another requirement for machine learning
tasks is that the dissimilarity measure must be differentiable so that its
gradient can be used as to minimize it as a loss function to find the best fit
model. Finally, the measure must be efficient to calculate, because it will be
calculated repeatedly many times during the model fitting process. To this end
we will explore the GPU acceleration of Soft Dynamic Time Warping (Soft-DTW)
\cite{cuturi_soft-dtw_2018}, a differentiable sequence dissimilarity measure, to
enable high performance time series machine learning.

\subsection{Background}

\subsubsection{Time series data}

Time series data generally refers to data containing quantitative measurements
collected from the same subject or process at multiple points in time and it
exhibits several peculiarities in comparison to time-independent data. Time
series data can exhibit autocorrelation, meaning that the value at any point in
time is highly correlated to the value at a previous point in time, with some
delay or \emph{lag} in between. Time series data can also exhibit
\emph{seasonality} or cyclical patterns as well as \emph{trend} which is a
tendency for the average value (over some rolling time window) to increase or
decrease as a function of time. Due to these characteristics, time series data
does not conform to the I.I.D. (independent and identically distributed)
assumption that typically applies in the study of random variables, and
therefore it requires special techniques for analysis and modeling.

Time series data can be either univariate or multivariate. For example, a set of
electrocardiogram (ECG) measurements has a single variable (heart voltage)
collected over time, but if the dataset also included the patient's blood
pressure and oxygen levels measured at each time point, that would constitute a
multivariate time series where correlations among the different variables could
be studied in addition to the autocorrelation within each of the variables.

\subsubsection{Time series distance and dissimilarity measures}

A fundamental capability that enables learning from data is the ability to
quantify a metric of distance or dissimilarity between observations, because
this allows comparison among data observations as well as quantification of
error between observed data and the predictions of an estimator model. For
time-independent data, multiple valid notions of distance exist; the most
commonly used is Euclidean distance, but others such as Manhattan distance,
cosine distance, or Mahalanobis distance are often used, depending on which is
best suited to the problem domain.

Likewise, there are multiple valid ways to compute a measure of dissimliarity
between two sequences or time series; for real-valued time series measurements
the simplest is also Euclidean distance, which is the square root of the sum of
squared pairwise differences between two time series $x$ and $y$, each of length
$n$ (equation \ref{euclid}).

\begin{equation} \label{euclid}
d(x, y) = \sqrt{\sum_{t=1}^{n}(x_t-y_t)^2}
\end{equation}

\begin{wrapfigure}[12]{R}{0.3\textwidth}
  \centering
  \includegraphics[height=0.25\textwidth]{img/peaks.png}
  \caption{Euclidean vs. DTW on out of phase time series \cite{rakthanmanon_addressing_2013}}
  \label{peaks}
\end{wrapfigure}

However, a significant drawback of Euclidean distance in time series
applications is that two structurally similar time series will produce a large
distance if they have different lengths, speeds or phases. In time series
applications it is often desirable to produce a low dissimilarity for
structurally similar time series despite these variations, so an alternative
method of quantifying dissimilarity is needed. This is where Dynamic Time
Warping (DTW) comes in, providing an optimal nonlinear pairwise matching path
between the elements of two time series (Figure \ref{peaks}).

\subsubsection{Dynamic Time Warping}

Dynamic Time Warping (DTW) was devised in the 1960s as an alternative time
series dissimilarity measure to address this shortcoming, and was applied to
spoken word recognition tasks by Sakoe and Chiba in the 1970s
\cite{sakoe_dynamic_1978}. Whereas Euclidean distance is a linear one-to-one
mapping, DTW is a nonlinear mapping from each point in one time series to the
nearest point in a second time series; by allowing this nonlinearity, the
algorithm is able to minimize the computed distance between time series that are
similar in shape but misaligned in the time domain. While DTW is technically
not considered a ``distance'' because it does not conform to the triangle
inequality \cite{jain_semi-metrification_2018}, and therefore we refer to it as
a ``dissimilarity'' instead, it can still be used in place of Euclidean distance
or other distance measures for many practical applications of time series data.

The mapping problem can be visualized as a matrix of pairwise distances between
every element of time series $x$ and every element of time series $y$; each row
represents an element of one sequence while each column represents an element of
the other sequence. Under this representation, the Euclidean distance measure
translates to the shortest diagonal path across the matrix from one corner to
the other, which is always the leading anti-diagonal in the case of two equal
length time series, and is ill-defined for unequal length time series. In
contrast, DTW allows the path to ``wander'' from the diagonal in search of the
lowest cost path (Figure \ref{euclidean_matrix}), subject to the constraints
that the path must begin at one corner and end at the opposite corner, and make
monotonic progress in between.

\begin{figure}[htbp]
\includegraphics[height=2in]{img/euclidean_dtw_matrix.png}
\centering
\caption{Euclidean Distance vs. DTW represented as a shortest
  vs. lowest-cost diagonal path across a pairwise distance matrix}
\label{euclidean_matrix}
\end{figure}

The basic algorithm for DTW is to use Bellman's recursion, a dynamic programming
technique, to find the lowest-cost path diagonally across a pairwise distance
matrix; at each step, the cost is updated to the current cell's cost plus the
minimum of the three previous adjacent cells' costs. The computation cost for
this approach is quadratic ($O(mn)$) for time series vectors of length m and n
\cite{cuturi_soft-dtw_2018}. The formula for the DTW between time series x and y
is given by equation \ref{dtw}, which expresses a minimization of the root sum
of squared differences between each pairing of $x_{i}$ and $y_{j}$.

\begin{equation} \label{dtw}
DTW(x,y) = min_{\pi}\sqrt{\sum_{(i,j)\in\pi}d(x_{i},y_{j})^2}
\end{equation}

The loss function for DTW is not differentiable due to the presence of the $min$
operation within the formula; a small change in the input time series may result
in zero change in the path cost, resulting in flat regions in the gradient
topology. However, Cuturi and Blondel found that we can remedy this problem and
create a differentiable version called Soft-DTW, by replacing the min function
with a soft-min function (equation \ref{softdtw}) \cite{cuturi_soft-dtw_2018}.

\begin{equation} \label{softdtw}
\text{soft-min}_\gamma(a_1,...,a_n) = -\gamma log\sum_{i}e^{-a_i/\gamma}
\end{equation}

Hence, Soft-DTW is parameterized by the smoothing constant gamma ($\gamma$), which
must be greater than or equal to zero, and produces regular DTW when $\gamma = 0$ and
greater smoothing as $\gamma$ increases. This parameter $\gamma$ becomes a tunable
hyperparameter in machine learning model training applications.

\subsubsection{Applications}

DTW is a widely used technique employed in many areas of study, including signal
processing, biology, economics, and finance. DTW can be used to discover
patterns in time series or search within signal databases to find the closest
matches for a given pattern \cite{keogh_derivative_2001}; for example, searching
an audio track for all instances of a given spoken word. It is also utilized in
machine learning applications like clustering, regression, and classification on
time series data. Practical applications of DTW include tasks such as:

\begin{itemize}
    \item Voice command recognition
    \item Sound detection and classification
    \item Handwriting and gesture recognition
    \item Human activity recognition
    \item Heart rhythm anomaly detection
    \item Sales or asset price pattern detection
\end{itemize}

As a differentiable time series dissimilarity measure, DTW can be applied as a
loss function or minimization objective in data modeling techniques such as:

\begin{itemize}
  \item \emph{k}-means Clustering
  \item Hierarchical agglomerative clustering
  \item Motif (similar subsequence) search
  \item k-nearest neighbor or nearest centroid classification
  \item Recurrent neural networks
\end{itemize}

A common technique in machine learning with Soft-DTW is the computation of
barycenters, which are centroids within the space of a set of time series. The
differentiability of Soft-DTW allows for barycenter finding via gradient descent
or semi-Newtonian function minimization methods. First, a candidate time series
is initialized with random points, and then, iteratively, its Soft-DTW
dissimilarity to a set of multiple time series representing a target class is
calculated, then the gradient of the Soft-DTW dissimilarity is taken with
respect to the input values, and the inputs are updated by a small step size in
the direction of the gradient, until convergence. After the best-fit barycenters
are found, new observations can be classified by finding the nearest barycenter.
Sequence prediction and generation is also possible using recurrent neural
networks with Soft-DTW as a loss function \cite{cuturi_soft-dtw_2018}.

Prior to computing the Soft-DTW dissimilarity between any two time series, each
time series should be \emph{z-normalized}, that is, scaled so that its mean is
equal to 0 and its standard deviation is equal to 1, to remove the problem of
``wandering baselines'' or ``drift'' in the measurements, as illustrated in
\cite{rakthanmanon_addressing_2013} with an ECG classifier that yields incorrect
results on un-normalized data due to drift that ``may be caused by patient
movement, dirty lead wires/electrodes, loose electrodes, and so on,'' and which
has no medical significance. In the case of speech, Z-normalization would
prevent loud (high amplitude) voices from producing a large discrepancy with
quiet (low amplitude) voices. Z-normalization also tends to make the iterative
process of minimizing the cost function through gradient descent more efficient
because it prevents parameters with larger magnitude from dominating the
direction of the weight update step \cite{jordan_normalizing_2018}. For this
reason, all input data used in our performance experiments is z-normalized in
advance.

\subsubsection{Parallelizing DTW}

A naive, sequential implementation of DTW would involve a nested loop to iterate
over each row and column of the cost matrix to update each cell's cost based on
the three neighboring cells' costs, hence the $O(n^2)$ time complexity. Because
each cell has a data dependency on the three cells to the top, left, and
top-left, there is no dependency between cells that are on the same antidiagonal
of the matrix, therefore computation of these cells can be handled by parallel
threads. One thread computes the upper-leftmost cell, then two threads compute
the next antidiagonal, then three threads compute the next, and so on. Because
the length of the longest antidiagonal of a matrix is $min(m,n)$ that is also
the number of threads that can be utilized in computing parallel DTW, and a
subset of these threads will be inactive in computing antidiagonals that are
shorter than the longest antidiagonal.

\subsection{Related Work}

Because DTW has so many useful data mining applications but suffers from
quadratic complexity, scholars have devised a wide variety of exact and
approximate approaches to improve its performance.

Dr. Eamonn Keogh and several others have utilized various indexing schemes to
construct lower bounds on warping distance as an optimization technique for
speeding up nearest neighbor search via early removal of poor candidates
\cite{keogh_exact_2002}.

Shen and Chi (2021) propose an optimization of nearest neighbor search of
multivariate time series, leveraging the triangle inequality and
quantization-based point clustering to restrict the search
\cite{shen_tc-dtw_2021}.

Xiao et al (2013) introduced a prefix computation technique for transforming the
diagonal data dependence to improve parallel computation of the cost matrix on
GPU \cite{xiao_parallelizing_2013}.

Zhu et al (2018) demonstrate a method of optimizing memory access by taking
advantage of the diagonal data dependency to rearrange the matrix so that
elements on the same diagonal are stored contiguously
\cite{zhu_developing_2018}.

Salvador and Chan (2004) \cite{salvador_fastdtw_2004} proposed a method of
accelerating and approximation of DTW, dubbed FastDTW, by first downsampling the
time series to a fraction of its length to find an approximate best path and
then recursively upsampling and reapplying the algorithm, using the previous
lower-resolution best path to construct upper and lower bounds for next
pass. However, Wu and Keogh (2020) demonstrate that this method is generally
slower than exact DTW on most realistic datasets.

A prior implementation of Soft-DTW on CUDA using PyTorch and Numba claims to
achieve 100x performance improvement over the original Soft-DTW reference
implementation in Cython code, but is limited to sequence lengths of 1024 (CUDA
maximum thread block size) and leaves many opportunities for further CUDA
optimizations such as the use of shared memory
\cite{maghoumi_pytorch-softdtw-cuda_2021}.

A 2015 paper describes a tiling approach to problems with diagonal data
dependencies such as DTW, called \emph{PeerWave}, which utilizes direct
synchronization between neighboring streaming multiprocessors (SMs) to handle
the inter-tile data dependency without atomic operations, locks, or other global
barriers, leading to improved load balance and scaling properties
\cite{belviranli_peerwave_2015}. This tiling strategy also enables the PeerWave
algorithm to handle longer time series than 1024 by dividing the work for a pair
of time series among multiple CUDA thread blocks.

In our work, we primarily focus on this general area of opportunity, optimizing
DTW cost matrix structure and memory access patterns to increase parallelism and
minimize memory latency in the computation of the warping path matrix.

\section{Methods}

To evaluate various performance optimizations on the Soft-DTW computation, we
implemented a C++ and CUDA library called cuTimeWarp, which includes functions
for computing the SoftDTW on CPU and GPU. Our library is public on GitHub at:
\href{https://github.com/alexkyllo/cutimewarp}{github.com/alexkyllo/cutimewarp}.

We defined the task as computing all Soft-DTW discrepancies among a batch of
time series. Given a set of many univariate or multivariate time series of the
same length and number of variables, we can compute the Soft-DTW distance
between every time series and every other time series in the set by computing,
in parallel for each pair, a pairwise squared Euclidian matrix, then applying
the Soft-DTW calculation on the resulting array of distance matrices. This
pairwise squared Euclidean distance computation, however, also has an $O(n^2)$
complexity and can potentially even take longer than the DTW computation
itself. For univariate time series, we could save this cost by computing the DTW
cost matrix on the two input time series directly, computing the absolute
distance between each pair of values from the two time series within the nested
loop of the DTW procedure, and only pre-compute the distance matrix for the case
of multivariate time series, and this would be our recommended approach for a
production-grade implementation.

To verify program correctness, we wrote a suite of unit tests using short
example time series with results verified against those of Cuturi and Blondel's
original sequential CPU implementation of Soft-DTW \cite{cuturi_soft-dtw_2018},
which is available on GitHub at
\href{https://github.com/mblondel/soft-dtw}{github.com/mblondel/soft-dtw}.

To test performance, we developed small C++ command-line programs to run
Soft-DTW on either a space-delimited input data file or random unit normal data
of a specified length and count (by specifying the word ``random'' instead of a
filename).

The programs utilize separate compilation; a Makefile is included and
\verb|make build| will compile the library and all of the programs.

Their output contains four columns of data, separated by spaces, one line per
kernel:

\begin{enumerate}
  \item Kernel function name
  \item The input time series length (number of columns per row)
  \item The input time series count (number of rows)
  \item The execution time in microseconds
\end{enumerate}

Example usage of the CPU Soft-DTW program \verb|soft_dtw_perf_cpu|:

\begin{verbatim}
$ ./bin/soft_dtw_perf_cpu
./bin/soft_dtw_perf_cpu
Usage: ./bin/soft_dtw_perf_cpu [INPUT_FILENAME] | random [length] [count]

$ ./bin/soft_dtw_perf_cpu ./data/ECG200/ECG200_ALL.txt
Data file ./data/ECG200/ECG200_ALL.txt contains 200 time series of length 96
sq_euclidean_distance 96 200 1375707
softdtw 96 200 18064611

$ ./bin/soft_dtw_perf_cpu random 100 100
sq_euclidean_distance 100 100 373003
softdtw 100 100 4960461

\end{verbatim}

Example usage of the GPU Soft-DTW program \verb|soft_dtw_perf_multi|:

\begin{verbatim}
$ ./bin/soft_dtw_perf_multi
Usage: ./bin/soft_dtw_perf_multi [INPUT_FILENAME] | random [length] [count]

$  ./bin/soft_dtw_perf_multi ./data/ECG200/ECG200_ALL.txt
Data file ./data/ECG200/ECG200_ALL.txt contains 200 time series of length 96
sq_euclid_dist_multi 96 200 515037
softdtw_cuda_naive_multi 96 200 264987
softdtw_cuda_naive_multi_bw_80 96 200 235089
softdtw_cuda_naive_multi_bw_60 96 200 168621
softdtw_cuda_naive_multi_bw_40 96 200 83501
softdtw_cuda_naive_multi_bw_20 96 200 51338
softdtw_cuda_stencil_multi 96 200 100990
softdtw_cuda_stencil_multi_80 96 200 100408
softdtw_cuda_stencil_multi_60 96 200 100844
softdtw_cuda_stencil_multi_40 96 200 101215
softdtw_cuda_stencil_multi_40 96 200 100436
softdtw_cuda_stencil_multi_20 96 200 100647
convert_diagonal_multi 96 200 332664
softdtw_cuda_diagonal_multi 96 200 149158

$ ./bin/soft_dtw_perf_multi random 100 100
sq_euclid_dist_multi 100 100 335883
softdtw_cuda_naive_multi 100 100 61576
softdtw_cuda_naive_multi_bw_80 100 100 52272
softdtw_cuda_naive_multi_bw_60 100 100 32211
softdtw_cuda_naive_multi_bw_40 100 100 18919
softdtw_cuda_naive_multi_bw_20 100 100 18725
softdtw_cuda_stencil_multi 100 100 26558
softdtw_cuda_stencil_multi_80 100 100 25803
softdtw_cuda_stencil_multi_60 100 100 31000
softdtw_cuda_stencil_multi_40 100 100 26120
softdtw_cuda_stencil_multi_40 100 100 25804
softdtw_cuda_stencil_multi_20 100 100 30992
convert_diagonal_multi 100 100 87427
softdtw_cuda_diagonal_multi 100 100 43893
\end{verbatim}
\newpage
\subsection{Optimization Techniques}

We opted to experiment with the following performance optimization techniques:

\begin{enumerate}
\item Wavefront tiling (based on PeerWave)
\item Diagonal-major data layout
\item Shared memory stencil
\item Sakoe-Chiba bands
\end{enumerate}

Before delving into the performance results, we will provide a brief explanation
of how each of these techniques is implemented and why it provides a performance
improvement.

\subsubsection{Wavefront Tiling}
Wavefront parallelism is one of the most useful methods to overcome the
dependencies of nested loops by multiple processing units. The idea is to
re-order the loop iterations in such a way that they form a diagonal wave and
each wave can be computed in parallel. Barriers will be utilized to control the
data dependencies among consecutive waves. Elements inside the wave are grouped
together using tiled technique to enhance data locality and performance. This
methodology presents a second degree of parallelism where tiles can be computed
in parallel through a signal variable.

GPU and specifically CUDA can accelerate the process of wavefront
parallelism. Each tile assigns to a block to be process in parallel by an SM,
relying on block-SM affinity. Meanwhile, the iteration along diagonals within a
tile are also parallelized. In order to enforce the dependencies, barrier
synchronization is used among the tiles and among the threads within each tile
\cite{belviranli_peerwave_2015}.

In our Soft-DTW implementation, we utilized the wavefront technique to manage
the dependencies of neighboring cells for computing the minimum warp path. In
the algorithm, each path cost value depends on the minimum cost of the upper,
left and upper-left diagonal cell's cost. Figure \ref{DTW_dependency}
illustrates this dependency structure. Each $D_{i,j}$ depends on the upper,
left, and upper-left neighbor's cost.

\begin{figure}[htbp]
\includegraphics[height=2in]{img/tiling_dependencies.png}
\centering
\caption{computation dependencies for DTW \cite{belviranli_peerwave_2015}.}
\label{DTW_dependency}
\end{figure}

The wavefront process is split into three major steps:

\begin{itemize}
  \item Populating the dependencies in a loop. In this step, we keep the global
    variable WaveId to seperate the processing of waves. The wave length
    increases one by one on each loop iteration until reaching the total number
    of tiles. The primary kernel, \verb|softdtw_global_tiled| is called with the
    wave length, number of blocks, and tile width in number of threads per block.
  \item In this kernel, the row and column index for each tile is calculated and
    passed as a global variable to the corresponding kernel for the tiles’
    computation.
  \item The third step is the main kernel to process inter-tiles in
    parallel. Shared memory is employed to keep the tile within the cache and
    improve the overall performance. As we mentioned earlier, we need to
    calculate the soft-min for the dependent cells. Therefore, the soft-min is
    calculated for the upper, left and diagonal upper-left indices at this
    stage. (equation \ref{dependencies})
\end{itemize}

\begin{equation} \label{dependencies}
  D (i,j) = D(i-1, j-1) + Soft-min
  \begin{cases}
        D(i-1,j) \\
        D(i,j-1)\\
        D(i-1,j-1)
  \end{cases}
\end{equation}

A drawback is that due to the usage of barrier synchronization, there would be
lower GPU utilization. Few threads are active at the beginning and at the end of
processing each tile, because the antidiagonals are shorter. Also, fewer tiles
are accessible toward the beginning and end of wave iterations. Therefore, a
large portion of the device SMs remain idle during the initiation and end of the
process. The overall benefit of the wavefront tiling strategy is that it
addresses the problem of long time series by subdividing the problem among
multiple thread blocks, and also improves cache efficiency with shared memory.

\subsubsection{Diagonal-major layout}

Another important optimization that we explored was using a diagonal-major array
layout for the cost matrix. Because the data dependency structure of the DTW
algorithm results in elements of the distance and cost matrices on the same
antidiagonal being processed in parallel, storing these matrices in row-major or
column major order will cause a performance impact from cache misses and
non-coalesced accesses to global memory.

If the data is first rearranged into an antidiagonal-major layout, then at each
iteration of the wavefront loop, processor threads will make coalesced accesses
to data elements that are laid out contiguously in memory. As illustrated in
Figure \ref{diagonal_layout}, this transforms an $m \times n$ matrix that must be
iterated over diagonally, into a $(m+n-1) \times (min(m,n))$ matrix that can be
iterated from top to bottom, with one thread assigned to each column. At each
iteration step (i.e. each row of the diagonal-major matrix), contiguous array
elements are read from memory into cache and written from cache back to memory,
resulting in coalesced accesses and fewer cache misses.

\begin{figure}[htbp]
  \includegraphics[height=2.5in]{img/diagonal_layout.png}
  \centering
  \caption{Cost matrix transformation to antidiagonal-major layout (arrows show
    iteration direction)}
  \label{diagonal_layout}
  \end{figure}

\subsubsection{Shared memory stencil computation}

We also explored the use of shared memory as a ``stencil'' for memoizing
previously computed costs to be reused in subsequent computations, before
writing them back to the cost matrix array. As the program iterates diagonally
across the distance matrix to find the optimal warping path, each cell in the
path utilizes the previously computed results of three previous neighboring
cells; if the iteration is visualized as proceeding from the upper left to the
lower right corner of the matrix, the cost value in each cell depends on the
(soft) minimum of the costs in the cell above, the cell to the left, and the
cell to the diagonal upper-left, which were computed in the previous two
iterations (Figure \ref{DTW_dependency}). If the cost matrix \verb|R| resides in
global memory, then non-contiguous accesses to \verb|R[i-1][j]|,
\verb|R[i][j-1]| and \verb|R[i-1][j-1]| will result in cache misses, incurring a
significant performance cost. Since each element of \verb|R| will be referenced
up to three times in the computation of dependent cells, these cache misses can
be avoided via a stencil computation using shared memory in CUDA. The stencil
serves as a cache for the current and previous two diagonals; once a diagonal is
no longer in use, its elements are written back to the cost matrix \verb|R| in
device global memory.

\FloatBarrier The algorithm can be modified to use shared memory according to
the following pseudocode:

\begin{verbatim}
D is a squared euclidean distance matrix of two time series
R is a cost matrix initialized to positive infinity except for R[0, 0] = 0
for each anti-diagonal of R starting from R[0, 0]
    if the current thread index < length of the current anti-diagonal
        copy R[i][j] from global memory into the stencil array
        read R[i-1][j], R[i][j-1] and R[i-1][j-1] from the stencil array
        compute cost as softmin(R[i-1][j], R[i][j-1], R[i-1][j-1]) + D[i-1][j-1]
        write the cost back to the stencil
        copy the cost in R[i-1][j-1] from the stencil back to global memory
\end{verbatim}

A modulo operation is utilized to keep track of the rotation of which section
of the shared memory array represents the current, previous, and previous-1
diagonal, so that each data element is written to shared memory only once.

\FloatBarrier

\subsubsection{Sakoe-Chiba bands}

Sakoe-Chiba bands, proposed by Sakoe and Chiba in their 1978 paper "Dynamic
programming algorithm optimization for spoken word recognition," introduce a
"slope constraint" to the warping path to limit the maximum allowed warping
distance beyond which a pair will not be considered in the optimal path
calculation \cite{sakoe_dynamic_1978}. Pruning the search space by removing some
of the extreme diagonals from consideration yields an approximation of the
optimal warping path that can be calculated in sub-quadratic time.

This optimization is simple to implement for square matrices (i.e. DTW on time
series of equal length) by checking that the absolute difference between the
loop counter variables \verb|i| and \verb|j| does not exceed a fixed bandwidth
threshold value (Figure \ref{sakoe_chiba}). For rectangular matrices, since the
leading diagonal does not end at the lower right corner, the implementation must
be slightly modified to ensure that the counter variable along the longer of the
two dimensions stays within a defined radius. Either way the result is a
diagonal band matrix.

In a parallel programming environment such as CUDA, this optimization can also
allow for the computation of the warping path using fewer threads, as threads
assigned to cost matrix cells outside the band would go unused. Space savings
can also be obtained if the bandwidth is known in advance, by storing the
distance matrix and cost matrix in band storage format, omitting the zeroes
in the corners.

While this technique produces only an approximation of the optimal path, in
practice it has been shown to improve task performance by preventing
pathological alignments where a very small portion of one time series maps onto
a large portion of the other \cite{keogh_exact_2002}. The width of the band
can be a tunable hyperparameter for time series classification tasks.

\begin{figure}[htbp]
\includegraphics[height=2in]{img/sakoe_chiba.png}
\centering
\caption{Illustration of one possible optimal DTW path for a length 5 series and
a length 8 series with Sakoe-Chiba band radius = 1.}
\label{sakoe_chiba}
\end{figure}

\subsection{Complexity Analysis}

In order to quantify the performance of our implementations as a floating point
operation execution rate, we needed to count the floating-point operations used
to compute the cost matrix. Calculating a single cell of the cost matrix in
Soft-DTW involves a softmin function of three real-valued arguments. The softmin
function for three arguments is implemented in C++ as:

\begin{verbatim}
float softmin(float a, float b, float c, const float gamma)
{
    float ng = -gamma;
    a /= ng;
    b /= ng;
    c /= ng;
    float max_of = max(max(a, b), c);
    float sum = exp(a - max_of) + exp(b - max_of) + exp(c - max_of);

    return ng * (log(sum) + max_of);
}
\end{verbatim}

This function contains 17 floating point operations:

\begin{itemize}
  \item 1 x negation
  \item 3 x division
  \item 2 x max
  \item 3 x exponentiation
  \item 3 x subtraction
  \item 3 x addition
  \item 1 x natural logarithm
  \item 1 x multiplication
\end{itemize}

Additionally, the softmin of the previous 3 cell costs is added to the current
cell cost, for a total of 18 floating point operations (FLOPs). Therefore
Soft-DTW for a pair of time series of lengths $m$ and $n$ results in $18mn$
FLOPs.

\subsection{Test Hardware Specifications}
\FloatBarrier
\begin{table}[htbp]
  \centering
  \caption{Device Hardware Specifications}
  \label{specs}
  \begin{tabular}{lr}
      \toprule
          Property                & Value                           \\
          \midrule
          Model                   & GeForce RTX 2060 Super          \\
          Generation              & Turing                          \\
          Compute Capability      & 7.5                             \\
          SM Count                & 34                              \\
          CUDA Cores              & 2176                            \\
          Tensor Cores            & 272                             \\
          CUDA Version            & 11.2                            \\
          Driver Version          & 460.39                          \\
          Clock Speed (MHz)       & 1470                            \\
          Boost Speed (MHz)       & 1710                            \\
          Memory Speed (Gbps)     & 14                              \\
          Memory Size (GiB)       & 8                               \\
          Memory Bandwidth (GB/s) & 448                             \\
          L1 Cache (KiB) (per SM) & 64                              \\
          L2 Cache (KiB)          & 4096                            \\
          \bottomrule
  \end{tabular}
\end{table}

\begin{table}
  \centering
  \caption{Host Hardware Specifications}
  \label{hostspecs}
  \begin{tabular}{lrr}
      \toprule
          {} & Desktop \\
          \midrule
          RAM Size:            & 64GB                               \\
          Configuration:       & 16GB x 4                           \\
          Type:                & DDR4                               \\
          Speed:               & 2667 MT/s                          \\
          Bandwidth:           & 16415 MiB/s                        \\
          Architecture:        & x86\_64                            \\
          CPU(s):              & 16                                 \\
          Thread(s) per core:  & 2                                  \\
          Core(s) per socket:  & 8                                  \\
          Model name:          & AMD Ryzen 7 3700X 8-Core Processor \\
          CPU MHz:             & 1862.571                           \\
          CPU max MHz:         & 3600.0000                          \\
          CPU min MHz:         & 2200.0000                          \\
          BogoMIPS:            & 7187.14                            \\
          L1d cache:           & 256 KiB                            \\
          L1i cache:           & 256 KiB                            \\
          L2 cache:            & 4 MiB                              \\
          L3 cache:            & 32 MiB                             \\
          \bottomrule
  \end{tabular}
\end{table}

We tested the program on a single GeForce RTX 2060 Super GPU with the
specifications detailed in table \ref{specs} \cite{techpowerup}. Per NVIDIA's
documentation, ``The peak single precision floating point performance of a CUDA
device is defined as the number of CUDA cores times the graphics clock frequency
multiplied by two \cite{nvidia_flops}. For this device, the equation works out
to $2176 \times 1.71 \times 2 = 7441.92$ GFLOPS.

The GPU device is installed on a desktop computer running Ubuntu Linux 20.04
with CPU and memory specifications detailed in Table \ref{hostspecs}.


\begin{table}[htbp]
  \centering
  \caption{UWB LAB Device Hardware Specifications}
  \label{specslab}
  \begin{tabular}{lr}
      \toprule
          Property                & Value                           \\
          \midrule
          Model                   & GeForce GTX 745                 \\
          Compute Capability      & 3.0                             \\
          SM Count                & 3                               \\
          CUDA Cores              & 384                            \\
          CUDA Version            & 10.1                            \\
          Driver Version          & 10.1                          \\
          Clock Speed (MHz)       & 1032                            \\
          Memory clock rate(MHz)  & 14                              \\
          Memory Bandwidth (GB/s) & 2047                             \\
          L1 Cache (KiB) (per SM) & 64                              \\
          L2 Cache (KiB)          & 2048                             \\
          Maximum number of resident blocks per multiprocessor &16   \\
          Maximum amount of shared memory per multiprocessor(KB) & 48 \\
          Maximum number of resident blocks per multiprocessor &16 \\
          Maximum number of threads per multiprocessor  &2048 \\
          Total amount of shared memory per block (B)   &49152 \\
          \bottomrule
  \end{tabular}
\end{table}

In our experiments we also used a laboratory machine. The device has the compute
capability 3.0 with only 3 SM running on Centos Linux operating system. Table
\ref{specslab} is showing the detail of the UWB lab device. Due to the lower
performance of the machine we preferred to use it for evaluating the performance
results of the wavefront tiled kernel implementation, which is already not
comparable to the other kernels as it is optimized for very long time series
rather than large batches of short time series. Therefore within this document
whenever we talk about the performance of tiled kernel, we refer to this table
for hardware specifications.

\subsection{Test Data}

For performance testing we selected the ECG 200 dataset from the UCR Time Series
Archive \cite{dau_ucr_2019}. This dataset consists of 200 sets of
electrocardiogram time series of length 96. We also wrote a performance testing
program that generates sets of random unit normal data to test the performance
impact of varying time series length and count to many different sizes.


\section{Results}
\FloatBarrier

As a baseline for performance comparison, we timed the naive, sequential CPU
implementation of SoftDTW and found that its execution rate varied between
0.35-0.4 GFLOPs (Figure \ref{plot_cpu}) on the hardware described in Table
\ref{hostspecs}, for random unit normal time series of length 100. Figure
\ref{plot_cpu_1024} shows the same for time series of length 1024, with almost
identical performance results. While a multithreaded parallel CPU implementation
could likely achieve several times this rate of execution, depending on the
number of available cores, we opted not to experiment with CPU parallelism and
instead move on to the focus of the project which was parallelizing the
algorithm in CUDA with GPU multithreading.

\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_cpu.pgf}}
    \end{center}
    \caption{CPU performance on random unit normal time series length = 100}
    \label{plot_cpu}
\end{figure}

\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_cpu_1024.pgf}}
    \end{center}
    \caption{CPU performance on random unit normal time series length = 1024}
    \label{plot_cpu_1024}
\end{figure}


Figure \ref{plot_cpu_gpu} shows the execution rate of the naive CUDA kernel
vs. the CPU implementation for comparison. The CUDA kernel achieves performance
of around 34 GFLOPs compared to the CPU's 0.36 GFLOPs, a 92x performance
increase over naive sequential processing.

\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_cpu_gpu.pgf}}
    \end{center}
    \caption{CPU vs. CUDA (Naive) performance on random unit normal time series
      length = 100}
    \label{plot_cpu_gpu}
\end{figure}


\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_multi.pgf}}
    \end{center}
    \caption{Kernel performance on random unit normal time series length =
      100}
    \label{plot_multi}
\end{figure}

\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_multi_1024.pgf}}
    \end{center}
    \caption{Kernel performance on random unit normal time series length =
      1024}
    \label{plot_multi_1024}
\end{figure}


After testing the naive kernel we experimented with CUDA performance
optimizations. Both the diagonal data layout kernel and the shared memory
stencil kernel brought significant performance improvements over the naive
kernel, as seen in Figure \ref{plot_multi}. While the naive kernel achieved
GFLOPs in the low 30s, the diagonal data layout transformation increased
performance into the lower 40s, and the shared memory implementation reached
almost 70 GFLOPs, essentially double the execution rate of the naive kernel, for
time series length 100. When the time series length is increased to 1024, the
stencil and diagonal kernels continue to perform at a similar rate, while the
naive kernel's performance falls to 20 GFLOPs or less, so the value of these
optimizations grows with the problem size. As we show in the Profiling section,
this drop in performance for the naive kernel is due to the cost matrix no
longer fitting in L1 cache.

Our performance measurements stop at a problem size of 40000 pairwise DTW
comparisons because the entire problem is processed in a single 3D tensor of
pairwise distance matrices that is transferred to the device once, to avoid
mixing data transfer performance with computation performance. The
diagonal-major layout consumes more space than the row-major layout (($m+n-1) \times
min(m,n)$ matrix vs. an $m \times n$ matrix), so the diagonal-major kernel is the
first to exceed the 8 GiB device memory capacity of the test GPU hardware. A
potential improvement for future work would be a program to batch the input data
and launch the kernels on multiple streams to avoid exceeding device memory
limits and hide the latency associated with multiple transfers.

Sakoe-Chiba bands also improved performance \emph{even after} adjusting for the
reduced number of total floating point operations; for input time series of
length 100, a bandwidth of 40\% of the matrix size resulted in the highest
execution rate, exceeding 60 GFLOPs, compared to low-30s when the bandwidth is
set to 100 (Figure \ref{plot_bw}).

\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_naive_multi_bw.pgf}}
    \end{center}
    \caption{Naive kernel performance on random unit normal data by
      Sakoe-Chiba bandwidth at time series length = 100}
    \label{plot_bw}
\end{figure}

However, when the input time series length was increased to 1024, the picture
changed; the kernel with Sakoe-Chiba bandwidth equal to 20\% of the matrix size
performed the best, exceeding 50 GFLOPs, while the 40\% bandwidth kernel fell
below 40 GFLOPS and the others fell to just 20 GFLOPs or less (Figure
\ref{plot_bw_1024}). This suggests, intuitively, that a narrower bandwidth has a
larger positive performance impact as the input time series length increases.
This is also probably related to cache size limitations; as the matrix size
increases, this shrinks the maximum bandwidth that allows the needed data to fit
in L1 cache.

\begin{figure}[htbp]
    \begin{center}
        \scalebox{0.85}{\input{fig/plot_naive_multi_bw_1024.pgf}}
    \end{center}
    \caption{Naive kernel performance on random unit normal data by
      Sakoe-Chiba bandwidth at time series length = 1024}
    \label{plot_bw_1024}
\end{figure}

The final optimization we implemented and tested was the PeerWave style tiled
kernel as described in \cite{belviranli_peerwave_2015}. As our other kernels are
limited to time series of max length 1024 because they handle distance matrix
per CUDA thread block, this kernel gave us the ability to test calculation of
very long time series.

The tiled kernel uses global barriers to handle dependencies across rows or
``waves'' of tiles, with each row signaling to its ``peer'' in the next row when
its work is completed. There is one loop inside the kernel to manage the wave
and another iteration within each tile. In order to test on large sets of time
series, we had to add two more nested loops to launch the kernel across multiple
CUDA streams. This generated a high kernel launch overhead, decreasing the
performance significantly and preventing us from comparing it with the other
kernels, which are written to process multiple smaller distance matrices at
once. So the main utility of implementing a tiled version is to compute the
similarity of time series for much longer sequences that the other kernels are
unable to process. Figure \ref{plot_tiled} shows that, for shorter length
sequences the performance is low, but by increasing the length, it is improving
and capable of exceeding 120 GFLOPS, a far higher execution rate than the other
kernels achieved on large sets of smaller time series pairs. Therefore, we
suggest that it would be appropriate for a general-purpose DTW library to select
this tiling approach only when the input time series are very long.

\begin{figure}[htbp]
\includegraphics[height=4in]{fig/plot_tiled.png}
\centering
\caption{Soft DTW performance for tiled \& shared memory kernel on random unit
  normal time series of varying lengths}
\label{plot_tiled}
\end{figure}

We additionally tested our kernel optimizations on the ECG200 dataset, which is
a curated dataset of 200 time series of real heart rhythms of length 96, so
pairwise comparison of every time series to every other time series results in
$96^2 \times 200^2$ SoftDTW cost cell computations. Table \ref{ecg_table} shows
performance on this dataset by kernel. The kernels named ``squared euclidean
distance'' convey the time and GFLOPS performance needed to produce the distance
matrix from the original input time series, before the Soft-DTW algorithm
iterates over it to find the lowest-cost path. While we wrote all of our kernels
to handle preprocessed squared Euclidean distance matrices, which imparts the
ability to handle multivariate time series in addition to univariate, the
overhead of computing this distance matrix suggests that it would be more time
and memory efficient, in the case of univariate time series, to compute the DTW
cost matrix on the input time series directly.

The kernel named ``convert to diagonal'' is indicating the
computation for converting the squared euclidean distance matrix from row-major
to diagonal-major layout; this process takes more than twice as long as the
``diagonal`` kernel does, suggesting that constructing the cost matrix in
diagonal-major format in the first place could produce a performance improvement
in the end-to-end process. The shared memory stencil kernel produced the best
performance at 69.24, with the naive CUDA kernel with Sakoe-Chiba band set to
40\% of the matrix size performing next best on the ECG200 dataset.

\input{fig/ecg_kernels.tex}

\FloatBarrier
\section{Profiling}

We profiled our kernels with NVIDIA NSight Compute to understand how cache hit
rates, registers per thread, and occupancy may be related to performance. We
profiled the naive, diagonal and stencil kernel, without adjusting bandwidth, to
make a fair comparison.

Table \ref{prof_table} shows profiler metrics by kernel when run on two pairs of
time series of length 100, while Table \ref{prof_table_1000} shows results for
two pairs of time series of length 1000 each. When the matrix is small enough to
fit in L1 cache, the naive kernel has the highest L1 cache hit rate. As the time
series length increases, the L1 cache hit rate falls for the naive and stencil
kernels, but not for the diagonal kernel, which also has the highest SM busy
rate, supporting our hypothesis that diagonal data layout would improve cache
efficiency and memory throughput via coalesced memory access.

The shared memory stencil kernel exhibits low L1 cache hit rate at matrix size
1000 because it utilizes shared memory instead, as a programmer-controlled
cache, allowing it to achieve even higher performance than the diagonal
kernel. In contrast, the naive kernel's performance suffers dramatically when
the cost matrix is too large to fit in L1 cache, as evidenced by the drop in L1
hit rate and corresponding drop in performance.

All three kernels used 43 registers per thread, so there is no difference to be
discerned from that metric. At matrix size 1000 squared (and therefore 1000
threads) the warp occupancy for our kernels is at 100\% on the test device, so
occupancy limits were not an issue, and neither was register pressure.

\input{fig/prof_table.tex}

\input{fig/prof_table_1000.tex}

\FloatBarrier
\section{Discussion}

In order to improve performance and make the best use of the GPU device
resources, we converted our kernels to to be able to compute the entire dataset
of multiple time series within the kernel in one kernel launch. Therefore, the
need for a nested loop to compare time series one by one was removed, and
replaced by the ability to process a large batch of time series in
bulk. Implementing this approach was technically challenging and required much
effort.

One of the major issues was the tiled kernel. Global barriers that we discussed
in the optimization section lead to nested loops for managing the dependencies
during intra-tile and inter-tile. The naive idea was to take the advantage of
Cuda stream and run the loops concurrently. Although we succeeded in
implementing this method, however, it was not very efficient and the performance
compared to the other kernels was disappointing. This could be improved in the
future by changes inside the kernel to cover multiple time series across
different sets of blocks even as multiple blocks within a set compute tiles of a
single time series. The primary strength of the tiled technique is its ability
to calculate the similarity of two extremely long time series with high