forked from nikosbosse/model_stacking
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathmethod.tex
More file actions
115 lines (94 loc) · 4.14 KB
/
method.tex
File metadata and controls
115 lines (94 loc) · 4.14 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
\documentclass[11pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{amsfonts,epsfig}
\usepackage[hyphens]{url}
\RequirePackage{color}
\usepackage{microtype}
\usepackage{mathtools}
\usepackage{extarrows}
\usepackage{enumerate}% http://ctan.org/pkg/enumerate
\usepackage{tabularx}
\usepackage{pifont}
\usepackage{tikz}
\usetikzlibrary{shapes,arrows,chains}
\usetikzlibrary[calc]
\usetikzlibrary{bayesnet}
\newcommand{\cmark}{\ding{51}}%
\newcommand{\xmark}{\ding{55}}%
\usepackage{hyperref}
\hypersetup{
colorlinks=true,
linkcolor=black,
citecolor=black,
filecolor=black,
urlcolor=black,
}
\usepackage{breakurl}
\usepackage{comment}
\graphicspath{{fig/}}
\linespread{1.02}
\usepackage{wrapfig}
\usepackage{amsthm}
\newtheorem{theorem}{Theorem}
\usepackage[linesnumbered,boxed]{algorithm2e}
\usepackage{relsize}
\usepackage{fancyvrb}
\let\oldv\verbatim
\let\oldendv\endverbatim
\def\verbatim{\par\setbox0\vbox\bgroup\oldv}
\def\endverbatim{\oldendv\egroup\fboxsep0pt \noindent\colorbox[gray]{0.96}{\usebox0}\par}
\usepackage{amssymb}
%%% Document layout, margins
\usepackage{geometry}
\geometry{letterpaper, textwidth=6.5in, textheight=9in, marginparsep=1em}
%%% Section headings
\usepackage{sectsty}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amsmath}
\DeclareMathOperator{\E}{\mathrm{E}}
\DeclareMathOperator{\Exp}{\mathrm{Exp}}
\DeclareMathOperator{\Var}{\mathrm{Var}}
\DeclareMathOperator{\KL}{\mathrm{KL}}
\DeclareMathOperator{\R}{\mathbb{R}}
\DeclareMathOperator{\N}{\mbox{N}}
\usepackage[round]{natbib}
\makeatletter
\def\@maketitle{%
\begin{center}%
\let \footnote \thanks
{\large \@title \par}%
{\normalsize
\begin{tabular}[t]{c}%
\@author
\end{tabular}\par}%
{\small \@date}%
\end{center}%
}
\makeatother
\title{\bf Use Stacking to Average Models for Time Series and Panel Data Under Continuous Ranked Probability Score }
\author{Yuling Yao}
\date{\today \vspace{-.1in}}
\begin{document}\sloppy
\maketitle
\thispagestyle{empty}
\section{Method}
The CRPS for forecast CDFs with a finite first
moment can be written as
$$crps(F,y)=\E_X|X-y|- \frac{1}{2}\E_{X,X^\prime}|X-X^\prime|.$$
Given $K$ prediction models and observations $y_{tr}$ (number of observed icu needs) at time $t=1, \dots, T$ and region $r=1,\dots, R$, we could generate one-day-ahead prediction in each model, and draw $K$ predictive samples $x_{1ktr}, \dots, x_{Sktr}$ for the $k$-th model on the day $t$ and region $r$. The prediction uses all information before $t$ to eliminate over-fitting.
Using these predictive draws, we can compute the crps of the k-th model on the $tr$-th observation by
$$ \widehat {crps}_{ktr}= \widehat {crps}_(x_{1ktr}, \dots, x_{sktr},y_{tr})= \frac{1}{S} \sum_{s=1}^S |x_{sktr}-y_{tr}| -
\frac{1}{2S^2} \sum_{s, j=1}^S |x_{sktr}- x_{jktr}|.$$
Now the goal is to aggregate these $K$ models. It is easy to show that, when prediction is a mixture with weights $w_1, \dots, w_s$, the CRPS can be expressed as
$$ \widehat {crps}_{agg, tr} (w_1, \dots, w_K) = \frac{1}{S} \sum_{k=1}^K w_k \sum_{s=1}^S |x_{skt}-y_t| -
\frac{1}{2S^2} (\sum_{k=1}^K \sum_{k, k'=1 }^K w_k w_{k'} \sum_{s, j=1}^S |x_{skt}- x_{jk't}| ).$$
This expression is quadratic on $w$. We only need to compute $\sum_{s=1}^S |x_{skt}-y_{tr}|$, $\sum_{s, j=1}^S |x_{sktr}- x_{jktr}|$, and
$\sum_{s, j=1}^S |x_{sktr}- x_{jk'tr}|$ for all $k, k'$ pairs once and store them for all weight values in the optimization.
An extra degree of freedom is how we weight the prediction over time and over regions. In the early days we have very few data so all predictions are noisy and we might want to down weight the early phase in the final model evaluation. Mathematically we can pre-fixed introduce a time-varying weight $\lambda_1, \dots, \lambda_T$, say $\lambda_t = 1.5-(1-t/T)^2$, it penalizes early estimates (smaller t). Likewise we take into account the association of regions and individual sample size by a region-specific weight $\tau_r$
Finally we solve a quadratic optimization:
\begin{align*}
&\min_{w_1, \dots, w_K} \sum_{t=1}^T \sum_{r=1}^R\lambda_t\tau_r \widehat {crps}_{agg, tr} (w), \\
&s.t. ~{0\leq w_1, \dots, w_K \leq 1, \sum_{k=1}^K w_k=1}.
\end{align*}
\end{document}