Data-Science-Notebook/scrapsheet.html at main · ercbk/Data-Science-Notebook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head>

<meta charset="utf-8">
<meta name="generator" content="quarto-1.8.25">

<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">


<title>Scrapsheet</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em);}
div.column{flex: auto; overflow-x: auto;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
ul.task-list li input[type="checkbox"] {
  width: 0.8em;
  margin: 0 0.8em 0.2em -1em; /* quarto-specific, see https://github.com/quarto-dev/quarto-cli/issues/4556 */
  vertical-align: middle;
}
/* CSS for syntax highlighting */
html { -webkit-text-size-adjust: 100%; }
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
.sourceCode { overflow: visible; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
  { counter-reset: source-line 0; }
pre.numberSource code > span
  { position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
  { content: counter(source-line);
    position: relative; left: -1em; text-align: right; vertical-align: baseline;
    border: none; display: inline-block;
    -webkit-touch-callout: none; -webkit-user-select: none;
    -khtml-user-select: none; -moz-user-select: none;
    -ms-user-select: none; user-select: none;
    padding: 0 4px; width: 4em;
  }
pre.numberSource { margin-left: 3em;  padding-left: 4px; }
div.sourceCode
  {   }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
</style>


<script src="scrapsheet_files/libs/clipboard/clipboard.min.js"></script>
<script src="scrapsheet_files/libs/quarto-html/quarto.js" type="module"></script>
<script src="scrapsheet_files/libs/quarto-html/tabsets/tabsets.js" type="module"></script>
<script src="scrapsheet_files/libs/quarto-html/axe/axe-check.js" type="module"></script>
<script src="scrapsheet_files/libs/quarto-html/popper.min.js"></script>
<script src="scrapsheet_files/libs/quarto-html/tippy.umd.min.js"></script>
<script src="scrapsheet_files/libs/quarto-html/anchor.min.js"></script>
<link href="scrapsheet_files/libs/quarto-html/tippy.css" rel="stylesheet">
<link href="scrapsheet_files/libs/quarto-html/quarto-syntax-highlighting-7b89279ff1a6dce999919e0e67d4d9ec.css" rel="stylesheet" id="quarto-text-highlighting-styles">
<script src="scrapsheet_files/libs/bootstrap/bootstrap.min.js"></script>
<link href="scrapsheet_files/libs/bootstrap/bootstrap-icons.css" rel="stylesheet">
<link href="scrapsheet_files/libs/bootstrap/bootstrap-9e3ffae467580fdb927a41352e75a2e0.min.css" rel="stylesheet" append-hash="true" id="quarto-bootstrap" data-mode="light">

  <script src="https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6"></script>
  <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script>

<script type="text/javascript">
const typesetMath = (el) => {
  if (window.MathJax) {
    // MathJax Typeset
    window.MathJax.typeset([el]);
  } else if (window.katex) {
    // KaTeX Render
    var mathElements = el.getElementsByClassName("math");
    var macros = [];
    for (var i = 0; i < mathElements.length; i++) {
      var texText = mathElements[i].firstChild;
      if (mathElements[i].tagName == "SPAN" && texText && texText.data) {
        window.katex.render(texText.data, mathElements[i], {
          displayMode: mathElements[i].classList.contains('display'),
          throwOnError: false,
          macros: macros,
          fleqn: false
        });
      }
    }
  }
}
window.Quarto = {
  typesetMath
};
</script>

</head>

<body class="fullcontent quarto-light">

<div id="quarto-content" class="page-columns page-rows-contents page-layout-article">

<main class="content" id="quarto-document-content">

<header id="title-block-header" class="quarto-title-block default">
<div class="quarto-title">
<h1 class="title">Scrapsheet</h1>
</div>


<div class="quarto-title-meta">


  </div>


</header>


<section id="other-places" class="level2">
<h2 class="anchored" data-anchor-id="other-places">Other Places</h2>
<ul>
<li>Tea Garden - 2 dinners</li>
</ul>
</section>
<section id="grocery-list" class="level2">
<h2 class="anchored" data-anchor-id="grocery-list">Grocery list</h2>
<ul>
<li>Lettuce</li>
<li>tomato</li>
<li>lunch meat</li>
<li>kind protein bars peanut butter, banana, dark chocolate</li>
<li>dried mango</li>
<li>cereal</li>
<li>sugarless chocolate</li>
<li>frozen
<ul>
<li>chicken breast</li>
<li>frozen lunch?</li>
<li>fries</li>
<li>detroit pepperoi pizza</li>
<li>mini-pizza</li>
<li>pot pie</li>
</ul></li>
<li>strawberry juice</li>
</ul>
</section>
<section id="misc" class="level2">
<h2 class="anchored" data-anchor-id="misc">Misc</h2>
<ul>
<li>Notes created but not added _quarto.yml
<ul>
<li>ide-positron</li>
<li>llms-preprocessing</li>
<li>llms-production</li>
<li>job-consulting</li>
<li>job-interview</li>
<li>job-resume</li>
<li>logos</li>
<li>visualization-base</li>
</ul></li>
<li>Hierarchical Bootstrap
<ul>
<li>bootstrap types
<ul>
<li><a href="https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/" class="uri">https://www.r-bloggers.com/2019/09/understanding-bootstrap-confidence-interval-output-from-the-r-boot-package/</a></li>
</ul></li>
<li><a href="https://radlfabs.github.io/posts/thesis/">Uncertainty quantification for cross-validation</a>
<ul>
<li>Procedure as described (<a href="https://radlfabs.github.io/posts/thesis/#ref-davison_bootstrap_1997">Davison and Hinkley 1997</a>; <a href="https://radlfabs.github.io/posts/thesis/#ref-goldstein_bootstrapping_2010">Goldstein 2010</a>)
<ul>
<li><p>Sample with replacement from the fold indices –&gt;</p></li>
<li><p>Sample with replacement from the validation preds of the newly set of fold indices</p></li>
<li><p>Calculate CV estimate on the new set of validation preds</p></li>
<li><p>CI Variants: basic, normal, studentized, percentile</p></li>
</ul></li>
</ul></li>
<li>My understanding
<ol type="1">
<li>Sample fold indices w/replacement</li>
<li>For each sampled fold’s validation set, sample the predictions w/replacement</li>
<li>For each sampled fold’s predictions on the validation set, calculate the score</li>
<li>Average the scores across the folds.</li>
<li>Repeat 1-10K times</li>
<li>Calculate CI variant on the distribution of averaged scores</li>
</ol></li>
<li>Questions
<ul>
<li>How many bootstrap iterations?</li>
</ul></li>
</ul></li>
<li>Empirical Orthogonal Functions (EOFs)
<ul>
<li><p>A form of PCA applied to spatiotemporal data that is useful for understanding patterns in fields like climate science, oceanography, and meteorology.</p></li>
<li><p>Components</p>
<ul>
<li>Spatial Patterns (EOFs) - Shows where <em>variability</em> is concentrated</li>
<li>Time series (principal components) - Shows when each pattern is <em>active</em> and with what <em>amplitude</em></li>
</ul></li>
<li><p>Process</p>
<ul>
<li>Locations are variables and observations are time series</li>
<li>Preprocess (scale) and perform PCA</li>
<li>Each EOF is an eigenvector (<span class="math inline">\(V\)</span>) representing a spatial pattern, and its eigenvalue tells you how much variance that pattern explains. The PC is a time series (<span class="math inline">\(\text{PC1}_i = V_{(,1)} \cdot A_{(i,)}\)</span>)</li>
</ul></li>
<li><p>Example: In sea surface temperature data, EOF1 might reveal the El Niño/La Niña pattern — showing warming/cooling in the tropical Pacific. The associated PC would be a time series showing El Niño events as positive peaks and La Niña as negative troughs.</p></li>
</ul></li>
</ul>
</section>
<section id="multivariable-geostatistics" class="level2">
<h2 class="anchored" data-anchor-id="multivariable-geostatistics">Multivariable Geostatistics</h2>
<p><span class="math display">\[
\begin{align}
Z_1(s) &amp;= X_1\beta_1 + e_1(s) \\
\vdots \\
Z_n(s) &amp;= X_n \beta_n + e_n(s)
\end{align}
\]</span></p>
<ul>
<li><p><strong>Cross-Variogram</strong> - For each pair of residual variables, it describes the covariance of <span class="math inline">\(e_i(s)\)</span> and <span class="math inline">\(e_j(s+h)\)</span>.</p>
<ul>
<li>A non-zero cross-variance indicates that <span class="math inline">\(e_j(s+h)\)</span> may help predict (or simulate) <span class="math inline">\(e_i(s)\)</span>, and is especially true if <span class="math inline">\(Z_j(s)\)</span> is more densely sampled than <span class="math inline">\(Z_i(s)\)</span>.</li>
</ul></li>
<li><p><strong>Cokriging</strong> and <strong>Cosimulation</strong> are the multivariable versions of kriging and simulation.</p></li>
<li><p>Examples:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(gstat)</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">demo</span>(cokriging)</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="fu">demo</span>(cosimulation)</span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div></li>
</ul>
</section>
<section id="structural-equation-modeling-sem" class="level2">
<h2 class="anchored" data-anchor-id="structural-equation-modeling-sem">Structural Equation Modeling (SEM)</h2>
<ul>
<li>Packages
<ul>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/influence.SEM/index.html" style="color: #990000">influence.SEM</a><span style="color: #990000">}</span> - A set of tools for evaluating several measures of case influence for structural equation models</li>
</ul></li>
<li>“Certain datasets lead to inadmissible solutions in structural equation modeling (Paxton, Curran, Bollen, Kirby, &amp; Chen, 2001)” (<a href="https://arxiv.org/abs/2509.11741">source</a>)</li>
</ul>
</section>
<section id="drawdown-implied-correlation-dic" class="level2">
<h2 class="anchored" data-anchor-id="drawdown-implied-correlation-dic">Drawdown Implied Correlation (DIC)</h2>
<ul>
<li><p>Notes from</p>
<ul>
<li><a href="https://cssanalytics.wordpress.com/2024/12/23/drawdown-implied-correlations-part-1/">Drawdown Implied Correlations (Part 1)</a></li>
<li><a href="https://cssanalytics.wordpress.com/2025/01/09/drawdown-implied-correlations-part-2-generalized-downside-implied-correlations/">Drawdown Implied Correlations Part 2: Generalized Downside Implied Correlations</a></li>
<li><a href="https://cssanalytics.wordpress.com/2025/01/21/iterative-psd-shrinkage-ips/">Iterative PSD Shrinkage (IPS)</a></li>
</ul></li>
<li><p>Formula</p>
<p><span class="math display">\[
\text{DIC} = \frac{4\text{MDD}_{AB}^2-\text{MDD}_{A}^2-\text{MDD}_{B}^2}{2\cdot \text{MDD}_{A} \cdot \text{MDD}_{B}}
\]</span></p>
<ul>
<li><span class="math inline">\(\text{MDD}\)</span> is the Maximum Drawdown</li>
<li><span class="math inline">\(A\)</span> is an asset, <span class="math inline">\(B\)</span> is an asset and <span class="math inline">\(AB\)</span> is a portfolio with both <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span></li>
</ul></li>
<li><p>Asset returns are not Normal — they have “fat tails”</p></li>
<li><p>Two assets can lose money simultaneously, even while maintaining a negative (pearson) correlation, leaving the portfolio exposed to significant losses.</p>
<ul>
<li>We typically rely on a dynamic or rolling correlation to measure diversification</li>
</ul></li>
<li><p>In contrast to correlations and volatility, drawdowns are nonlinear and path-dependent making them complementary for risk analysis.</p></li>
<li><p>For the DIC measure, you can certainly use drawdowns entirely within a lookback window to keep the measure mathematically consistent, but it is recommended that you use a much bigger window for calculation to avoid a lot of noise.</p>
<ul>
<li>Regardless using drawdowns from all-time highs will slighly change the final values in such a way that they can be more negative than -1 which is why you need to bound the DIC between 1 and -1 to provide a practical correlation measure.</li>
</ul></li>
<li><p>When constructing a portfolio with multiple assets, the portfolio’s drawdown series (the peak-to-trough losses of the combined portfolio) behaves differently than the individual drawdown series of the constituent assets. This difference arises from how the assets interact in a portfolio.</p>
<ul>
<li>The drawdown of the portfolio is not simply the sum or average of individual asset drawdowns; instead, it reflects the combined behavior of the assets as they interact over time. Two or more assets in the portfolio may experience drawdowns at different times or to different extents, and their drawdown implied correlations will directly influence how the portfolio’s total drawdown evolves.</li>
<li>If two assets experience drawdowns simultaneously, their joint drawdown will be greater than what you would expect from either asset alone, this will lead to a measurement of high correlation.</li>
<li>If the portfolio drawdown is moderate to low compared to the individual assets’ drawdowns this will lead to a measurement of low correlation.</li>
<li>Therefore, the drawdown of the portfolio can reflect behavior and interactions between assets that individual asset drawdowns and returns cannot capture. This is the key reason correlating individual asset drawdowns will not fully explain the portfolio drawdowns.</li>
</ul></li>
<li><p>Single Reference Process</p>
<ol type="1">
<li>Drawdown Calculation for Each Asset:
<ul>
<li><p>For Asset A and Asset B, calculate the drawdowns from their respective all-time highs over a rolling window (e.g., 60 days).</p></li>
<li><p>For the joint time series (AB), calculate the combined drawdown from all-time highs over the same 60-day window.</p></li>
</ul></li>
<li>Find Maximum Drawdown for AB:
<ul>
<li><p>Identify the maximum drawdown for the joint time series (AB) over the 60-day rolling window.</p></li>
<li><p>Retrieve the corresponding drawdown values for Asset A and Asset B on the same day that the maximum drawdown for AB occurs.</p></li>
</ul></li>
<li>Compute the DIC:
<ul>
<li><p>Calculate the implied correlation between the drawdowns of A, B, and AB on the specific day.</p></li>
<li><p>This gives the DIC for the pair of assets based on the maximum drawdown for the joint time series (AB).</p></li>
</ul></li>
</ol></li>
<li><p>The DIC can be calculated using only one drawdown point while you need a minimum of 3 data points to compute a correlation between drawdowns.</p></li>
<li><p>The “standard” version of the DIC which uses the max drawdown over some window.</p></li>
<li><p>Note you can certainly use the top % or drawdowns above a threshold as well. But because we are only looking at maximum drawdowns with this variation, in order to create a rolling daily measurement I suggest a slight modification to the original calculation by using a “triple point” reference.</p></li>
<li><p>Triple Reference</p>
<ul>
<li><p>This means we are going to look at three reference points which represent the maximum drawdown for each asset and the portfolio. The purpose of a triple reference is to get as much information as possible from a shorter window and increase accuracy while reducing indicator volatility.</p></li>
<li><p>Calculating DIC at the point of maximum drawdown (individually) for A , B, and the joint series AB and averaging the three results.<br>
(need pic)</p></li>
<li><p>The maximum drawdown for AB could be influenced by an unusually strong movement in one asset, which might not reflect the risk dynamics between A and B themselves. By averaging the DIC from the three scenarios (max drawdown of AB, A, and B), you smooth out this potential bias and get a more robust measure of correlation.</p></li>
</ul></li>
<li><p>Triple Reference Process</p>
<ol type="1">
<li><p>Find the point of Max Drawdown for Asset A:</p>
<ul>
<li><p>Now, repeat the process but for Asset A as the reference. Find the maximum drawdown for A over the same 60-day rolling window.</p></li>
<li><p>Retrieve the corresponding drawdown values for Asset B and AB on the same day that the maximum drawdown for A occurs. Calculate the DIC using the exact same formula.</p></li>
</ul></li>
<li><p>Find the point of Max Drawdown for Asset B:</p>
<ul>
<li><p>Similarly, find the maximum drawdown for Asset B over the same 60-day window.</p></li>
<li><p>Retrieve the corresponding drawdown values for A and AB on the same day as the maximum drawdown for B. Calculate the DIC using the same formula.</p></li>
</ul></li>
<li><p>Calculate and Average DICs:</p>
<ul>
<li><p>You now have three DICs: one from the maximum drawdown for AB, one from the maximum drawdown for A, and one from the maximum drawdown for B.</p></li>
<li><p>The final DIC is the average of these three DICs, providing a comprehensive view of the correlation during drawdown periods for both individual assets and their joint performance.</p></li>
</ul></li>
</ol></li>
</ul>
</section>
<section id="dbt" class="level2">
<h2 class="anchored" data-anchor-id="dbt">DBT</h2>
<section id="dbt-expectations" class="level3">
<h3 class="anchored" data-anchor-id="dbt-expectations">dbt-expectations</h3>
<ul>
<li>Feaures
<ul>
<li>Free package</li>
<li>Integrates into already existing dbt project</li>
<li>Assertive testing</li>
</ul></li>
<li>Set-Up
<ul>
<li><p>Specify in package.yml</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">packages</span><span class="kw">:</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">package</span><span class="kw">:</span><span class="at"> calogica/dbt_expectations</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">version</span><span class="kw">:</span><span class="at"> </span><span class="fl">0.10.4</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div></li>
<li><p>Run <code>dbt deps</code> to install</p></li>
</ul></li>
</ul>
<section id="tests" class="level4">
<h4 class="anchored" data-anchor-id="tests">Tests</h4>
<ul>
<li>Tests can be specified on a model, source, seed, or column in one of your YAML files</li>
<li>Source Data
<ul>
<li><p>Always apply your tests to the source data if possible</p></li>
<li><p>Because sources are defined in your YAML files, this is where you will want to write your tests</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="fu">sources</span><span class="kw">:</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> company_customers</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">database</span><span class="kw">:</span><span class="at"> company</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">schema</span><span class="kw">:</span><span class="at"> customers</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">description</span><span class="kw">:</span><span class="at"> </span><span class="st">"Contains personal customer information for company"</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">tables</span><span class="kw">:</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> addresses</span></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">description</span><span class="kw">:</span><span class="at"> </span><span class="st">"Customer addresses for the company"</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="kw">-</span><span class="at"> </span><span class="fu">dbt_expectations.expect_table_column_count_to_be_between</span><span class="kw">:</span></span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a><span class="at">              </span><span class="fu">min_value</span><span class="kw">:</span><span class="at"> </span><span class="dv">1</span><span class="at"> </span></span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a><span class="at">              </span><span class="fu">max_value</span><span class="kw">:</span><span class="at"> </span><span class="dv">10</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>Specify the test under one of the source’s tables and not the source itself</li>
<li>check to make sure the <span class="var-text">customers.addresses</span> table has between 1-10 columns.</li>
</ul></li>
</ul></li>
<li>Models
<ul>
<li><p>Adding tests to your complex data models is great for ensuring your data is as expected <em>after</em> the transformation process.</p></li>
<li><p>Similar sytax to specifying in Source Data</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="fu">models</span><span class="kw">:</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> stg_addresses</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">description</span><span class="kw">:</span><span class="at"> </span><span class="st">"Customer addresses for the company"</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">dbt_expectations.expect_table_column_count_to_be_between</span><span class="kw">:</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">min_value</span><span class="kw">:</span><span class="at"> </span><span class="dv">1</span><span class="at"> </span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">max_value</span><span class="kw">:</span><span class="at"> </span><span class="dv">10</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>Tests on the addresses staging model</li>
</ul></li>
</ul></li>
<li>Column
<ul>
<li><p>Can only apply the test to one column</p></li>
<li><p>Preferrable on source data</p></li>
<li><p>Make sure the column names of your sources and models are fully documented before you can implement column tests.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">models</span><span class="kw">:</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> stg_addresses</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">description</span><span class="kw">:</span><span class="at"> </span><span class="st">"Customer addresses for the company"</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">columns</span><span class="kw">:</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> address_id </span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">description</span><span class="kw">:</span><span class="at"> </span><span class="st">"The primary key of this table"</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="kw">-</span><span class="at"> dbt_expectations.expect_column_to_exist</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>Tests whether <span class="var-text">address_id</span> exists as a column within the model</li>
</ul></li>
</ul></li>
<li>Check for Recent Data
<ul>
<li><p>Applies to one column only</p></li>
<li><p>Recommendation: Making this interval 3 days max, so you can catch the issue at the source fairly quickly</p></li>
<li><p>Use Case: FiveTran shows data connector syncs working, but you aren’t sure</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">dbt_expectations.expect_row_values_to_have_recent_data</span><span class="kw">:</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">datepart</span><span class="kw">:</span><span class="at"> day</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">interval</span><span class="kw">:</span><span class="at"> </span><span class="dv">3</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>If there is no new data from the last 3 days then the test will throw an error.</li>
</ul></li>
</ul></li>
<li>Compare Column Values
<ul>
<li><p>Applies to models, seeds, and sources</p></li>
<li><p>Compares if the value in column A is greater than the value in column B.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B</span><span class="kw">:</span></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">column_A</span><span class="kw">:</span><span class="at"> total_amount</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">column_B</span><span class="kw">:</span><span class="at"> sub_total</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">or_equal</span><span class="kw">:</span><span class="at"> </span><span class="ch">False</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>If the source contains high-quality data, the total should always be greater than the subtotal</li>
</ul></li>
</ul></li>
<li>Check Column Type
<ul>
<li><p>Use Case: If your source is a spreadsheet, they are particularly prone to these types of data entry errors</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">dbt_expectations.expect_column_values_to_be_of_type</span><span class="kw">:</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">column_type</span><span class="kw">:</span><span class="at"> timestamp_ntz</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>This makes sure the time stamp is a format.</li>
<li>Different time stamps across different models and data sources can become an issue with joins, etc.</li>
</ul></li>
</ul></li>
<li>Add Row Conditions
<ul>
<li><p>These can be added to other tests</p></li>
<li><p>A commin condition is “id is not null”</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9"><pre class="sourceCode yaml code-with-copy"><code class="sourceCode yaml"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="fu">tests</span><span class="kw">:</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">dbt_expectations.expect_column_values_to_be_in_set</span><span class="kw">:</span></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">value_set</span><span class="kw">:</span><span class="at"> </span><span class="kw">[</span><span class="st">'cat'</span><span class="kw">,</span><span class="st">'dog'</span><span class="kw">,</span><span class="st">'pig'</span><span class="kw">]</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">quote_values</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span><span class="at"> </span></span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">row_condition</span><span class="kw">:</span><span class="at"> </span><span class="st">"animal_id is not null"</span></span></code></pre></div><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></div>
<ul>
<li>This test only looks at the column values whose row does <em>not</em>have a null <span class="var-text">animal_id</span>.</li>
</ul></li>
</ul></li>
</ul>
</section>
</section>
</section>
<section id="functional-data-analysis-and-forecasting" class="level2">
<h2 class="anchored" data-anchor-id="functional-data-analysis-and-forecasting">Functional Data Analysis and Forecasting</h2>
<ul>
<li>Each observation in a functional dataset consists of a collection of points representing a continuous curve or surface over a compact domain (e.g., a fixed length of time or region of space).</li>
<li>Functional data provide detailed information of a continuous process</li>
<li>Apply a dimension reduction technique, such as fPCA, and use a subset of the transformed features as predictor variables
<ul>
<li>Only accounts for vertical functional variability, where <em>vertical</em> variability (also known as y or amplitude variability) is the variability in the height of functions</li>
</ul></li>
<li>Horizontal variability (also known as x or phase variability) is the variability in the location of peaks and valleys of the functions.</li>
<li>Requires enough data so that smoothing functions can accurately interpolate the curve.</li>
<li>Me
<ul>
<li>Seems like uber-nonlinear modeling. The coefficients and the predictors are smoothing functions. No flexibility on which terms get smoothers — all do. Also, the RHS is an integral.</li>
<li>Seems like this can be used for multivariate/group time series forecasting or as a feature reduction technique</li>
</ul></li>
<li>Packages
<ul>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/fastFGEE/index.html" style="color: #990000">FastFGEE</a><span style="color: #990000">}</span> - Fits functional generalized estimating equations for longitudinal functional outcomes and covariates using a one-step estimator that is fast even for large cluster sizes or large numbers of clusters</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/fdars/index.html" style="color: #990000">fdars</a><span style="color: #990000">}</span> - Written in Rust. Provides methods for functional data manipulation, depth computation, distance metrics, regression, and statistical testing. Supports both 1D functional data (curves) and 2D functional data (surfaces).</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/fkcentroids/index.html" style="color: #990000">fkcentroids</a><span style="color: #990000">}</span> - Functional K-Centroids Clustering Using Phase and Amplitude Components</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/hdftsa/index.html" style="color: #990000">hdftsa</a><span style="color: #990000">}</span> - High-Dimensional Functional Time Series Analysis</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/MFSD/index.html" style="color: #990000">MFSD</a><span style="color: #990000">}</span> - Analysis of multivariate functional spatial data, including spectral multivariate functional principal component analysis and related statistical procedures</li>
<li><span style="color: #990000">{</span><a href="https://mlr3fda.mlr-org.com/" style="color: #990000">mlr3fda</a><span style="color: #990000">}</span> - fda in mlr3</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/refund/index.html" style="color: #990000">refund</a><span style="color: #990000">}</span> - Methods for regression for functional data, including function-on-scalar, scalar-on-function, and function-on-function regression.</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/refundBayes/index.html" style="color: #990000">refundBayes</a><span style="color: #990000">}</span> - Bayesian regression with functional data, including regression with scalar, survival, or functional outcomes</li>
<li><span style="color: #990000">{</span><a href="https://astamm.github.io/roahd/" style="color: #990000">roahd</a><span style="color: #990000">}</span> - The Robust Analysis of High-dimensional Data package allows to use a set of statistical tools for the exploration and robustification of univariate and multivariate functional datasets through the use of depth-based statistical methods.
<ul>
<li>Functions for generating functional data</li>
<li>Band depths and modified band depths,</li>
<li>Modified band depths for multivariate functional data,</li>
<li>Epigraph and hypograph indexes,</li>
<li>Spearman and Kendall’s correlation indexes for functional data,</li>
<li>Confidence intervals and tests on Spearman’s correlation coefficients for univariate and multivariate functional data.</li>
</ul></li>
<li><span style="color: #990000">{</span><a href="https://fbertran.github.io/SelectBoost.FDA/" style="color: #990000">SelectBoost.FDA</a><span style="color: #990000">}</span> - SelectBoost-Style Variable Selection for Functional Data Analysis</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/tidyfun/index.html" style="color: #990000">tidyfun</a><span style="color: #990000">}</span> - Represent, visualize, describe and wrangle functional data in tidy data frames</li>
<li><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/veesa/index.html" style="color: #990000">veesa</a><span style="color: #990000">}</span> (<a href="https://arxiv.org/abs/2501.07602">Paper</a>) - Pipeline for Explainable Machine Learning with Functional Data
<ul>
<li>Accounts for the vertical and horizontal variability in the functional data</li>
<li>Provides an explanation in the original data space of how the model uses variability in the functional data for prediction</li>
</ul></li>
</ul></li>
<li>Papers
<ul>
<li><a href="https://arxiv.org/abs/2508.12000">Penalized Spline M-Estimators for Discretely Sampled Functional Data: Existence and Asymptotics</a></li>
</ul></li>
<li>Types of Functional Data
<ul>
<li>age-specific mortality rates (Shang et all, 2024)
<ul>
<li><a href="https://arxiv.org/abs/2411.12423">Nonstationary functional time series forecasting</a></li>
<li><a href="https://arxiv.org/abs/2305.19749">Forecasting high-dimensional functional time series: Application to sub-national age-specific mortality</a>
<ul>
<li>age- and sex-specific mortality rates in the United States, France, and Japan, in which there are 51 states, 95 departments, and 47 prefectures</li>
</ul></li>
</ul></li>
<li>heights of children measured over time (Ramsay and Silverman, 2005)</li>
<li>silhouettes of animals extracted from images (Srivastava and Klassen, 2016)</li>
<li>glucose monitoring (Danne et al., 2017)</li>
<li>fitness tracking (Henriksen et al., 2018)</li>
<li>environmental sensors (Butts-Wilmsmeyer, Rapp and Guthrie, 2020)</li>
</ul></li>
</ul>
</section>
<section id="gaussian-processes" class="level2">
<h2 class="anchored" data-anchor-id="gaussian-processes">Gaussian Processes</h2>
<ul>
<li>Notes from
<ul>
<li><a href="https://arxiv.org/abs/2411.05869">Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data</a>
<ul>
<li>alternative kernel that can discover and encode both sparsity and nonstationarity</li>
</ul></li>
<li><a href="https://juanxie19.github.io/posts/brisc/">Reading Notes on BRISC: Bootstrap for Rapid Inference on Spatial Covariances</a></li>
</ul></li>
<li>Packaages
<ul>
<li>bigGP - Implements parallel linear algebra operations using threading and message-passing, which is useful for kriging and Gaussian process regression</li>
<li>laGP - Implements local approximate Gaussian process regression for large-scale modeling and sparse computation with massive data sets.</li>
</ul></li>
<li>A preeminent framework for stochastic function approximation, statistical modeling of real-world measurements, and non-parametric and nonlinear regression within machine learning (ML) and surrogate modeling.</li>
<li>Unlike many other machine learning methods, GPs include an implicit characterization of uncertainty</li>
<li>Traditional implementations of GPs involve stationary kernels (also termed covariance functions) that limit their flexibility and exact methods for inference that prevent application to data sets with more than about ten thousand points. (paper and its packages fix this)
<ul>
<li>Other methods to fix this are generally difficult to implement for large data sets due to their large numbers of hyperparameters which leads to overfitting and the need for specialized algorithms for training</li>
</ul></li>
<li>In regression and density estimation, Gaussian processes have been widely used as nonparametric priors for unknown random functions.</li>
<li>Reasons for Popularity
<ul>
<li><strong>Analytical Tractability</strong>:
<ul>
<li><p>GPs provide a <strong>closed-form solution</strong> for many problems, making them analytically tractable.</p></li>
<li><p>For example, the posterior distribution of a GP can be derived explicitly, allowing for exact inference in many cases.</p></li>
</ul></li>
<li><strong>Marginal and Conditional Distributions</strong>:
<ul>
<li><p>Any <strong>marginal distribution</strong> of a GP is also Gaussian. This means that if you take a subset of the random variables in a GP, their joint distribution remains Gaussian.</p></li>
<li><p>Similarly, the <strong>conditional distribution</strong> of a GP is Gaussian. This property is particularly useful for making predictions at new locations, as the conditional distribution can be computed analytically.</p></li>
</ul></li>
<li><strong>Flexibility in Modeling</strong>:
<ul>
<li><p>GPs can model complex, <strong>non-linear relationships</strong> by choosing an appropriate covariance (kernel) function.</p></li>
<li><p>Common kernel functions include the <strong>Radial Basis Function (RBF)</strong>, <strong>Matérn</strong>, and <strong>Exponential kernels</strong>, each of which captures different types of relationships in the data.</p></li>
</ul></li>
<li><strong>Probabilistic Predictions</strong>:
<ul>
<li>GPs provide <strong>uncertainty estimates</strong> along with predictions. This is crucial for decision-making in applications like Bayesian optimization, where understanding the uncertainty is as important as the prediction itself.</li>
</ul></li>
<li><strong>Applications</strong>:
<ul>
<li><p>GPs are widely used in <strong>geostatistics</strong> (e.g., kriging), <strong>machine learning</strong> (e.g., regression, classification), and <strong>Bayesian optimization</strong> (e.g., hyperparameter tuning).</p></li>
<li><p>They are also used in <strong>time series analysis</strong>, <strong>robotics</strong>, and <strong>environmental modeling</strong>.</p></li>
</ul></li>
<li><strong>Kernel Design</strong>:
<ul>
<li><p>The choice of kernel function allows GPs to capture a wide range of behaviors, such as periodicity, smoothness, and trends.</p></li>
<li><p>Kernels can also be combined or adapted to create more complex models.</p></li>
</ul></li>
<li><strong>Interpretability</strong>:
<ul>
<li>The parameters of the kernel function often have intuitive interpretations, such as length scales or variance, making GPs more interpretable than some other machine learning models.</li>
</ul></li>
</ul></li>
<li>Despite their many advantages, GPs face a significant computational bottleneck: the need to invert the covariance matrix <span class="math inline">\(K(s, s')\)</span> which has <span class="math inline">\(O(n^3)\)</span> complexity, where n is the number of observations). For large datasets, this becomes infeasible, limiting the scalability of traditional GPs.</li>
<li>To address the computational challenges of traditional GPs, Nearest Neighbor Gaussian Process (NNGP) has been developed. NNGP approximates the full GP by limiting dependencies between data points to a small subset of nearest neighbors. This reduces the computational complexity while retaining the key properties of GPs, making it a scalable alternative for large datasets.</li>
</ul>
</section>
<section id="wavelets" class="level2">
<h2 class="anchored" data-anchor-id="wavelets">Wavelets</h2>
<ul>
<li><p>Notes from</p>
<ul>
<li><a href="https://arxiv.org/abs/2406.05012">TrendLSW: Trend and Spectral Estimation of Nonstationary Time Series in R</a></li>
<li>ChatGPT
<ul>
<li><a href="https://chatgpt.com/c/672cd99f-7480-8002-80cd-39263f52950b">My link</a></li>
<li><a href="https://chatgpt.com/share/672cddd9-ff2c-8002-a105-be470cc41dc2">Public Link</a></li>
</ul></li>
</ul></li>
<li><p>Packages</p>
<ul>
<li><p><span style="color: #990000">{</span><a href="https://cran.r-project.org/web/packages/DWaveNARDL/index.html" style="color: #990000">DWaveNARDL</a><span style="color: #990000">}</span> - Dual Wavelet Based NARDL Model</p>
<ul>
<li><p>Nonlinear Autoregressive Distributed Lag model for noisy time series analysis</p></li>
<li><p>Designed to capture both short-run and long-run relationships</p></li>
<li><p>Useful for analyzing economic and financial time series data that exhibit both long-term trends and short-term fluctuations</p></li>
</ul></li>
</ul></li>
<li><p>Wavelets are particularly suited for analyzing nonstationary time series because they can capture both time and frequency information.</p></li>
<li><p>Spectral estimation</p>
<ul>
<li>Spectral estimation is a technique used to analyze the frequency content of time series data, particularly focusing on how the variance (or power) is distributed across different frequencies. This is especially useful in nonstationary time series, where statistical properties, such as the mean and variance, change over time.</li>
<li>In classical settings, spectral estimation often involves the Fourier transform, where stationary processes are assumed. For nonstationary processes, wavelet-based methods are popular.</li>
<li>Usage
<ul>
<li>Identifying underlying periodicities, understanding the evolution of variance across different time periods, and detecting anomalies or regime shifts in time series.</li>
<li>If the time series contains periodic behavior (such as seasonal patterns), the spectral plot will show high power at the corresponding frequency</li>
<li>In an EWS plot, you may see that a certain frequency band has high power only during certain time intervals, indicating a time-localized periodicity.</li>
<li>A flat or consistent spectral plot indicates stationarity, while time-varying plots (especially with wavelet-based approaches) show how different scales contribute at different times, revealing nonstationarity. An increasing or decreasing trend in a particular frequency band over time might indicate a nonstationary process.</li>
<li>Sudden spikes or drops in spectral power at specific times and frequencies could indicate anomalies, abrupt changes, or unusual behavior in the time series.
<ul>
<li>A sudden burst (i.e.&nbsp;transient, not consistent) of power at a low scales (high-frequency) may indicate an abrupt event, such as a machine breakdown in industrial monitoring data</li>
<li>If the burst occurs at high scales (low frequencies), it may indicate a sudden, large-scale trend change, such as a long-term shift or event.</li>
</ul></li>
<li>In economic time series, higher scales (low frequencies) might represent long-term economic cycles, while lower scales (high frequencies) might correspond to short-term market volatility.</li>
<li>Analysts can detect sub-seasonal or irregular cycles that might not be immediately obvious.</li>
<li>If the model’s spectral plot aligns with the observed data’s spectral plot, this indicates a good fit. Discrepancies in power or patterns suggest areas where the model might need improvement.
<ul>
<li>e.g.&nbsp;After fitting a time series model, an analyst might generate a simulated spectral plot and compare it with the real data to see if the model captures both the trends and the variabilities at different scales.</li>
</ul></li>
<li>Time-frequency or time-scale plots provide a localized view of how spectral properties change, enabling analysts to detect events or features that occur intermittently.
<ul>
<li>e.g.&nbsp;In seismic data, an EWS plot could reveal bursts of energy at different scales corresponding to earthquake tremors and aftershocks.</li>
</ul></li>
<li>High power at certain frequencies suggests strong periodicity or correlation at corresponding time lags.
<ul>
<li>e.g.&nbsp;In biological signals, changes in autocorrelation can be related to transitions between different physiological states (e.g., sleep stages).</li>
</ul></li>
</ul></li>
</ul></li>
<li><p>Scales</p>
<ul>
<li>Scales refer to the different levels of resolution at which the data is analyzed. These scales are analogous to frequencies in Fourier analysis but provide a time-localized view of how signal components of different frequencies evolve over time.</li>
<li>Each scale <span class="math inline">\(j\)</span> corresponds to a specific level of detail, and the evolutionary wavelet spectrum <span class="math inline">\(S_j(z)\)</span>estimates the power at scale <span class="math inline">\(j\)</span> over time <span class="math inline">\(z\)</span>. This allows for the identification of how variance at different frequencies changes across time, which is crucial for analyzing nonstationary time series.</li>
<li>The wavelet function at a particular scale acts like a band-pass filter, capturing specific ranges of frequencies. Lower scales capture higher-frequency details (short-term fluctuations), while higher scales capture lower-frequency trends (long-term patterns).</li>
<li>At higher scales, the wavelet stretches over a longer portion of the time series, capturing slower, low-frequency changes. At lower scales, the wavelet is more compressed, capturing faster, high-frequency changes. Unlike Fourier transforms, which assume a fixed frequency range, wavelets allow for more adaptive, localized analysis.</li>
<li><strong>Low Scales (High Frequencies)</strong>: Capture fine, fast-varying details, like noise or short-term oscillations.</li>
</ul>
<!-- -->
<ul>
<li><p><strong>High Scales (Low Frequencies)</strong>: Capture broad, slow-varying features, such as long-term trends or cycles.</p></li>
<li><p>Scales to Periodicity</p>
<ul>
<li><p>Frequency Relationship<br>
<span class="math display">\[
f_j = \frac{k}{2^j \Delta t}
\]</span></p>
<ul>
<li><p><span class="math inline">\(f_j\)</span> : Frequency at scale <span class="math inline">\(j\)</span></p></li>
<li><p><span class="math inline">\(\Delta t\)</span> : Time step of your data (e.g.&nbsp;1 day for daily, 1 sec for seconds, etc.)</p></li>
<li><p><span class="math inline">\(k\)</span> : Constant that depends on the choice of wavelet</p>
<ul>
<li>e.g.&nbsp;Morlet wavelet, <span class="math inline">\(k \approx 1.03\)</span></li>
</ul></li>
</ul></li>
<li><p>Scale to Periodicity<br>
<span class="math display">\[
\begin{align}
P_j &amp;= \frac{1}{f_j} \\
&amp;\approx \frac{2^j \Delta t}{k}
\end{align}
\]</span></p></li>
<li><p><span class="ribbon-highlight">Example</span>: 1 sec data with power detected at <span class="math inline">\(S_5\)</span> (scale 5) using a Morlet wavelet<br>
<span class="math display">\[
P_5 \approx \frac{2^5 \times 1}{1.03} \approx 31.07 \;\mbox{sec}
\]</span></p>
<ul>
<li>If the high power persists over time in your wavelet spectrum (such as in an Evolutionary Wavelet Spectrum or Scalogram plot), it suggests a consistent periodic signal at that time scale.</li>
</ul>
<!-- -->
<ul>
<li>If the high power is transient, it indicates that the periodic behavior is localized in time, occurring only during certain time intervals.</li>
</ul></li>
</ul></li>
</ul></li>
<li><p>Trend Estimation</p>
<ul>
<li>Wavelets with more vanishing moments can handle smoother and more complex polynomial trends.</li>
</ul>
<!-- -->
<ul>
<li><p>Wavelets with fewer vanishing moments are better at capturing sharp, localized features but may not handle smooth trends as well.</p></li>
<li><p>Example: A wavelet with 4 vanishing moments means it can remove cubic trends from the data.</p></li>
<li><p>Choosing the number of vanishing moments</p>
<ul>
<li>For smooth trends: If the data is expected to have a smooth, slowly varying trend, you should choose a wavelet with a higher number of vanishing moments (e.g., 4 or more). This allows the wavelet to annihilate polynomial trends up to a higher degree and isolate the trend from noise or high-frequency components.</li>
</ul>
<!-- -->
<ul>
<li>For sharp changes or noise: If the data contains sharp changes or you’re more interested in detecting localized features, wavelets with fewer vanishing moments (e.g., 1 or 2) may be more appropriate.</li>
</ul></li>
</ul></li>
<li><p>Wavelet Types</p>
<ul>
<li><p><strong>Phase distortion</strong> occurs when the phase of different frequency components of a signal is altered unevenly, leading to a misalignment in the reconstructed signal (i.e.&nbsp;shifts in position). Important for trend estimation. The more symmetric a wavelet is, the less phase distortion it introduces</p>
<ul>
<li>The <strong>phase</strong> describes the position of the waveform relative to a reference point in time. For example, in a sine wave, the phase tells us where the peaks and troughs of the wave occur.</li>
<li>When a signal is passed through a filter or transformation (such as a wavelet or Fourier transform), each frequency component might experience a shift in its phase and by different amounts which results in the distortion.</li>
</ul></li>
<li><p>Wavelets with strong time localization allow them to detect sharp, abrupt changes</p></li>
<li><p>Guidelines</p>
<ul>
<li>Daubechies EP Wavelets: Choose for <em>sharp, localized features</em> (e.g.&nbsp;spike, discontinuities) or if phase shifts are not a major concern. They are ideal for compression or denoising while preserving features like discontinuities and detecting abrupt changes but may introduce phase distortion and boundary effects.
<ul>
<li>e.g.&nbsp;seismic , financial data</li>
</ul></li>
</ul>
<!-- -->
<ul>
<li><p>Daubechies LA Wavelets: Choose for <em>smooth trends</em> and when minimizing phase distortion is important. They are particularly effective for trend estimation and nonstationary data with long-term, smooth features.</p>
<ul>
<li>e.g.&nbsp;temperature changes, economic growth</li>
</ul></li>
</ul></li>
</ul></li>
</ul>
</section>
<section id="ranking-algo" class="level2">
<h2 class="anchored" data-anchor-id="ranking-algo">Ranking Algo</h2>
<ul>
<li>Notes from <a href="https://arxiv.org/abs/2408.10558">Multi-Attribute Preferences: A Transfer Learning Approach</a></li>
<li>preference data are typically elicited by individuals, whether in the form of pairwise comparisons, partial rankings or click-through data, which are aggregated into a single coherent ranking that best reflects these preferences</li>
<li>Use Cases
<ul>
<li>data consisting of hotel rankings, where consumers rank various attributes of hotels such as breakfast, hygiene, price, quality of service, but also their overall satisfaction with the hotel</li>
<li>different types of food that are ranked on various properties, such as different aspects of taste, smell, visual aspects, but also their overall ranking</li>
</ul></li>
<li>primary attribute - the main attribute of interest
<ul>
<li>Typically the overall preference or satisfaction, but not necessarily</li>
</ul></li>
<li>secondary attributes - The other attributes on which the objects are evaluated</li>
<li>jointly learning tasks
<ul>
<li>multi-task learning - concerns the improvement of multiple related learning tasks by borrowing relevant information among these tasks and therefore coincides with existing methods that aim to model multi-attribute preference data</li>
</ul></li>
<li>learning a single task
<ul>
<li>transfer learning - aims to optimise the efficiency of learning a single task, by utilising relevant information from other task</li>
<li>the single task of interest is called the target, whilst the other tasks are sources, and forms a parallel to the primary and secondary attributes</li>
<li>only the Gaussian graphical model and the Gaussian mixture model have been enriched by the transfer learning framework.</li>
</ul></li>
<li>Paper goals
<ul>
<li>Utilizing Bradley-Terry and its generalization the Plackett-Luce models – in order to improve inference on parameters underlying a primary attribute by utilising information contained in the secondary attributes
<ul>
<li>Models frequently used in pairwise comparison data</li>
<li>method is then incorporated into the transfer learning framework and extended upon, resulting in algorithms that generate estimates for the primary attribute with and without a known set of informative secondary attributes</li>
</ul></li>
<li>typically only a subset of the secondary attributes is useful when estimating the primary attribute parameters, we adapt the framework proposed by Tian and Feng, where we introduce an algorithm that is able to effectively infer the set of informative secondary attributes</li>
<li>Bradley-Terry<br>
<span class="math display">\[
\begin{aligned}
&amp;P(o_j &gt; o_l) = \frac{e^{\alpha_j}}{e^{\alpha_j} + e^{\alpha_l}} \\
&amp;\text{where} \; 1 \le j \ne l \le M
\end{aligned}
\]</span>
<ul>
<li>Each individual <span class="math inline">\(i\)</span> assigns their preference for one object <span class="math inline">\(j\)</span> over another object <span class="math inline">\(l\)</span> from a total pool of <span class="math inline">\(M\)</span> objects.</li>
<li>Assumes that underlying each object there exists some worth <span class="math inline">\(\alpha\)</span> that relates to its probability of being preferred over another objects.</li>
<li>These pairwise comparisons can be presented by an undirected graph <span class="math inline">\(\mathcal{G} =(\mathcal{V}, \mathcal{E})\)</span>, with vertices <span class="math inline">\(\mathcal{V} = \{1, \ldots, M\}\)</span> and edge set <span class="math inline">\(\mathcal{E}\)</span> that has the property that <span class="math inline">\((j, l) \in \mathcal{E}\)</span> if and only if objects <span class="math inline">\(j\)</span> and <span class="math inline">\(l\)</span> are compared at least once in the data. The following conditions are postulated for the pairwise comparison graph.</li>
</ul></li>
<li>Assumptions</li>
<li>Data
<ul>
<li>partial ranking: <span class="math inline">\(\{o_1 \gt \cdots \gt o_m\}\)</span></li>
<li>pairwise comparisons: <span class="math inline">\(\bigcap_{1 \le j \ne l \le M} \: \{o_j \gt o_l\}\)</span></li>
</ul></li>
</ul></li>
</ul>
</section>
<section id="trend-followingmomentum" class="level2">
<h2 class="anchored" data-anchor-id="trend-followingmomentum">Trend Following/Momentum</h2>
<ul>
<li>Notes from <a href="https://arxiv.org/abs/2407.13685">Beyond Trend Following: Deep Learning for Market Trend Prediction</a></li>
<li>Read
<ul>
<li><a href="https://quantpedia.com/designing-robust-trend-following-system/">Designing Robust Trend-Following System</a></li>
<li><a href="https://alphaarchitect.com/2024/12/portfolio-efficiency/">Diversifying Trend Following Strategies Improves Portfolio Efficiency</a>
<ul>
<li>Diversify across CTAs</li>
<li>It’s a tail hedge, so there will be long periods of loss — i.e.&nbsp;not for faint of heart</li>
<li>Improve the efficiency of their portfolio by also adding allocations to the other uncorrelated strategies</li>
</ul></li>
<li><a href="https://quantpedia.com/exploration-of-cta-momentum-strategies-using-etfs/">Exploration of CTA Momentum Strategies Using ETFs</a></li>
</ul></li>
<li>Trend following
<ul>
<li>Trend following or trend trading is an investment strategy based on the expectation of price movements to continue in the same direction: buy an asset when its price goes up, sell it when its price goes down.
<ul>
<li>a particular criterion to detect when prices move in a particular direction over time and every investor uses his own criterion</li>
</ul></li>
<li>Traditional trend following is usually done on futures. Just follow trends on a large, diversified set of futures markets, covering major asset classes.
<ul>
<li>Diversification is key: with multiple assets with low or negative correlations, you can achieve higher returns at a lower risk.</li>
</ul></li>
<li>Trend following on stocks can easily yield negative returns in the short side (when prices go down). When we trade only on the long side, it does not always add any real value.
<ul>
<li>Standard trend following is not expected to work with stocks, since their correlation is too high.</li>
</ul></li>
<li>Compared with a passive index ETF, trend following requires additional work and creates potential risks, yet it does not always yield actual benefits.
<ul>
<li>Trend following on single stocks, or a few of them, however, is not attractive for the risk you have to assume.</li>
</ul></li>
<li>Bear regime strategy (Meb Fabor on <a href="https://www.bloomberg.com/news/articles/2024-10-18/meb-faber-on-why-prudent-investors-keep-getting-punished?srnd=oddlots">Odd Lots</a>)</li>
</ul></li>
<li>momentum investing
<ul>
<li>When a stock price goes up for a while, the likelihood of rising higher is greater than the likelihood of falling. Likewise, a stock going up faster than other stocks is likely to keep going up faster than other stocks.</li>
<li>One explanation is that people who buy past winners and sell past losers temporarily move prices. An alternative explanation is that the market underreacts to information on short-term prospects but overreacts to information on long-term prospects.</li>
<li>Andreas Clenow employs the following trading rules on a weekly basis:
<ul>
<li>A.&nbsp;F. Clenow,&nbsp;Stocks on the Move: Beating the Market with Hedge Fund Momentum Strategies.Equilateral Publishing, 2015.</li>
<li>rank stocks on volatility-adjusted momentum (using an exponential 90-day regression, multiplied by its coefficient of determination),</li>
<li>calculate position sizes (targeting a daily move of 10 basis points),</li>
<li>check the index filter (S&amp;P 500 above its 200-day moving average), and build your portfolio.</li>
<li>Individual stocks are disqualified when they are below their 100-day moving average or have experienced a gap over 15%.</li>
<li>When, in the weekly portfolio rebalancing, a stock is no longer in the top 20% of the S&amp;P 500 ranking or fails to meet the qualification criteria (moving average and gap), it is sold. It is replaced by other stocks only if the index is in a positive trend. Twice per month, position sizes are also rebalanced to control risk.</li>
</ul></li>
</ul></li>
</ul>
</section>
<section id="clusteringhierarchical-ts" class="level2">
<h2 class="anchored" data-anchor-id="clusteringhierarchical-ts">Clustering/Hierarchical TS</h2>
<ul>
<li>Notes from <a href="https://arxiv.org/html/2404.06064v1">Constructing hierarchical time series through clustering: Is there an optimal way for forecasting?</a>
<ul>
<li>Code: <a href="https://github.com/AngelPone/project_hierarchy" class="uri">https://github.com/AngelPone/project_hierarchy</a></li>
</ul></li>
<li>The models used to obtain base forecasts and the reconciliation method are fixed throughout the experiments</li>
<li>coherent, that is they respect the aggregation constraints implied by the hierarchical structure. Coherent forecasts facilitate aligned decisions by agents acting upon different variables within the hierarchy. For example, consider a retail setting, where a warehouse manager supplies stock to individual store managers within their region. Forecasts could be incoherent when the warehouse manager forecasts low total demand while store managers forecast high demand, leading to supply shortages.</li>
<li>Clustered different representations (the original time series, forecast errors, features of both), different distance metrics (Euclidean, dynamic time warping), and different clustering paradigms (k-medioids, hierarchical).
<ul>
<li>For features, they used 56 features from {tsfeatures}</li>
<li>in-sample one-step-ahead forecast error as a representation of the time series, since a key step in MinT reconciliation is to estimate the <span class="math inline">\(\boldsymbol{W}_h\)</span> matrix.
<ul>
<li>It is important to note that raw time series and in-sample error representations are standardized to eliminate the impact of scale variations.</li>
</ul></li>
</ul></li>
<li>rq1
<ul>
<li>natural hierarchy outperforms the two-level hierarchy, and data-driven hierarchy via clustering can further improve forecast performance compared to natural hierarchy.
<ul>
<li>“grouping” is the idea that some correct subsets of series are chosen to form new middle-level series</li>
<li>“structure” of the hierarchy, includes the number of middle-level series, the depth of the hierarchy, and the distribution of group sizes in the middle layer(s).</li>
</ul></li>
<li>optimal clustering method depends on the dataset characteristics</li>
</ul></li>
<li>rq2
<ul>
<li>the driver of forecast improvement is the enlarged number of series in the hierarchy and/or its <strong>structure</strong>, rather than similarities between the time series (i.e.&nbsp;grouping).</li>
</ul></li>
<li>rq3
<ul>
<li>an equally-weighted combination of reconciled forecasts derived from multiple hierarchies improves forecast reconciliation performance</li>
<li>our approach averages not only different coherent forecasts, but also across hierarchies with completely different middle level series. This is possible since only coherent bottom and top level forecasts are averaged and evaluated.</li>
</ul></li>
<li>Section 2 describes the trace minimization reconciliation method (min T from {forecast})</li>
</ul>
</section>
<section id="lab-91" class="level2">
<h2 class="anchored" data-anchor-id="lab-91">lab 91</h2>
<ul>
<li>clvtools for prob type, h2o::automl for ML</li>
<li>agg, cohort, prob, ml, fcast</li>
<li>group by, summarize, pad_by_time, ungroup</li>
<li>lag - use horizon and use 2*horizon</li>
<li>rolling - 2,3,6 (uses lag parameters and 2; lags were 3 and 6 months with horizon = 3)</li>
<li>splits: timetk::time_series_split, cumulative = TRUE says use all previous data(?)</li>
</ul>
</section>
<section id="rhino" class="level2">
<h2 class="anchored" data-anchor-id="rhino">Rhino</h2>
<ul>
<li>rhinoverse.dev</li>
<li>opiniated project structure, development toolbox, guides you towards best practices</li>
<li><code>rhino::init()</code> or RStudio New Project wizard</li>
<li>github discussions for questions</li>
<li>Can use other UI packages and not just those in rhinoverse</li>
<li>Project structure
<ul>
<li>config.yml for different environments (e.g.&nbsp;dev, prod)</li>
<li>main.R with server and ui</li>
<li>view - modules that rely on reactivity</li>
<li>static - imgs</li>
<li>styles - sass files (css stuff)</li>
<li>dependencies - explicit list of packages</li>
<li>cypress - unit tests of functions</li>
</ul></li>
<li><code>options(shiny.qutoreload = TRUE)</code> - once you save, the app changes automatically</li>
<li>addins
<ul>
<li>formatting, lintr</li>
<li>create rhino module</li>
<li>build sass - automatically shows changes in app when changing and saving sass file</li>
<li>build javascript - same as build sass but for react components</li>
</ul></li>
<li>Uses {box} for function imports from packages and has a box linter</li>
<li>dependency management
<ul>
<li><code>pkg_install</code>/<code>remove</code> - install packages from everywhere and not just cran. Updates dependency.R and renv.lock</li>
</ul></li>
<li>Add react components with {shiny.react}</li>
</ul>
</section>
<section id="signature-transform" class="level2">
<h2 class="anchored" data-anchor-id="signature-transform">Signature Transform</h2>
<ul>
<li>Todo
<ul>
<li>Continue reading</li>
<li>Look at the separate papers from with the applied data examples are taken from</li>
<li>Go back to original Amazon paper and see if signature parts and its appendix make more sense.</li>
<li>Look at Discussion section in Signatory github and ask questions</li>
</ul></li>
<li>Misc
<ul>
<li><span class="math inline">\(e\)</span> is a monomial (pg 13)</li>
<li><span class="math inline">\(\lambda\)</span> is a real number (pg 13)</li>
<li><span class="math inline">\(\otimes\)</span> is defined as the the <em>joining</em> (i.e.&nbsp;concantenating) of multi-indexes of monomials: <span class="math inline">\(e_{i_1} \cdots e_{i_k} \otimes e_{j_1} \cdots e_{j_m} = e_{i_1} \cdots e_{i_k} e_{j_1} \cdots e_{j_m}\)</span> (pg 13)
<ul>
<li>Chen’s Identity: <span class="math inline">\(S(X*Y)_{a,c} = S(X)_{a,b} \otimes S(Y)_{b,c}\)</span> where <span class="math inline">\(X*Y\)</span> is the concantenation of two paths (pg 14)
<ul>
<li>So the signature of a concantenated path is equal to the circle-product of the signatures of the component paths.</li>
</ul></li>
<li><span class="math inline">\(\otimes n\)</span> is the n<sup>th</sup> power with respect to the circle product, <span class="math inline">\(\otimes\)</span> (pg15)<br>
</li>
</ul></li>
</ul></li>
<li>Workflow
<ul>
<li>Create a continuous path <span class="math inline">\(X_i\)</span> from each time-series <span class="math inline">\(\{Y_i\}\)</span> (row-wise)</li>
<li>If needed, make use of the lead-lag transform to account for the variability in data
<ul>
<li>Cumalative sum transform is another</li>
</ul></li>
<li>Compute the truncated signature <span class="math inline">\(S(X_i)|_L\)</span> of the path <span class="math inline">\(X_i\)</span> up to level <span class="math inline">\(L\)</span>
<ul>
<li>Either a Full or Log signature</li>
</ul></li>
<li>Standardize each signature column</li>
<li>Use the terms of signature <span class="math inline">\(\{S^I_i\}\)</span> as features</li>
</ul></li>
<li>Issue
<ul>
<li>Degeneracy in the terms of the signature causes this representation not to be unique and introducing a problem of colinearity of the signature terms.
<ul>
<li>Solution: LASSO, ridge or elastic net regularization</li>
<li>Paper uses a 2-step lasso where signature features are selected by LASSO. Then those selected features are used in a second regression with other predictors.</li>
</ul></li>
</ul></li>
<li>Signature
<ul>
<li><span class="math inline">\(S^{(1)}_{a,b} = X_b - X_a\)</span></li>
<li><span class="math inline">\(S^{(1,1)}_{a,b} = \frac{(X_b - X_a)^2}{2!}\)</span></li>
<li><span class="math inline">\(S^{(1,1,1)}_{a,b} = \frac{(X_b - X_a)^3}{3!}\)</span></li>
</ul></li>
<li>Cumulative + Lead Lag Signature Truncated to Level 2
<ul>
<li>Signature
<ul>
<li><span class="math inline">\(S(\tilde X)|_{L=2} = (1, S^{(1)}, S^{(2)}, S^{(1,1)}, S^{(1,2)}, S^{(2,1)}, S^{(2,2)})\)</span></li>
<li><span class="math inline">\(S^{(1)} = S^{(2)} = \sum_i^N X_i\)</span></li>
<li><span class="math inline">\(S^{(1,1)} = S^{(2,2)} = \frac{1}{2} \left(\sum_i^N X_i \right)^2\)</span></li>
<li><span class="math inline">\(S^{(1,2)} = \frac{1}{2} \left[\left(\sum_i^N X_i\right)^2 + \sum_i^N X_i^2 \right]\)</span></li>
<li><span class="math inline">\(S^{(1,2)} = \frac{1}{2} \left[\left(\sum_i^N X_i\right)^2 - \sum_i^N X_i^2 \right]\)</span></li>
</ul></li>
<li>Moments
<ul>
<li>Mean(X): <span class="math inline">\(\frac{1}{N}S^{(1)}\)</span></li>
<li>Var(X): <span class="math inline">\(-\frac{N+1}{N^2}S^{(1,2)} + \frac{N-1}{N^2}S^{(2,1)}\)</span></li>
</ul></li>
</ul></li>
<li>Lead Lag Signature Truncated to Level 2
<ul>
<li><span class="math inline">\(S(\tilde X)|_{L=2} = (1, S^{(1)}, S^{(2)}, S^{(1,1)}, S^{(1,2)}, S^{(2,1)}, S^{(2,2)})\)</span></li>
<li><span class="math inline">\(S^{(1)} = S^{(2)} = \sum_i^{N-1} (X_{i+1} - X_i)\)</span></li>
<li><span class="math inline">\(S^{(1,1)} = S^{(2,2)} = \frac{1}{2} \left(\sum_i^N (X_{i+1} - X_i) \right)^2\)</span></li>
<li><span class="math inline">\(S^{(1,2)} = \frac{1}{2} \left[\left(\sum_i^N (X_{i+1} - X_i)\right)^2 + \sum_i^N (X_{i+1} - X_i) \right]\)</span></li>
<li><span class="math inline">\(S^{(2,1)} = \frac{1}{2} \left[\left(\sum_i^N (X_{i+1} - X_i)\right)^2 - \sum_i^N (X_{i+1} - X_i) \right]\)</span></li>
</ul></li>
<li>Log Signature Truncated to Level 2<br>
<span class="math display">\[
\begin{aligned}
&amp;\log S(X) = (\Delta X, \Delta X, \frac{1}{2}\text{QV}(X))\\