Skip to content

borva/lmDecomp1

Repository files navigation

<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />

<meta name="viewport" content="width=device-width, initial-scale=1">

<meta name="author" content="BV" />

<meta name="date" content="2017-09-30" />

<title>Linear models: Pushing down and speeding up</title>



<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
div.sourceCode { overflow-x: auto; }
table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode {
  margin: 0; padding: 0; vertical-align: baseline; border: none; }
table.sourceCode { width: 100%; line-height: 100%; background-color: #f8f8f8; }
td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; }
td.sourceCode { padding-left: 5px; }
pre, code { background-color: #f8f8f8; }
code > span.kw { color: #204a87; font-weight: bold; } /* Keyword */
code > span.dt { color: #204a87; } /* DataType */
code > span.dv { color: #0000cf; } /* DecVal */
code > span.bn { color: #0000cf; } /* BaseN */
code > span.fl { color: #0000cf; } /* Float */
code > span.ch { color: #4e9a06; } /* Char */
code > span.st { color: #4e9a06; } /* String */
code > span.co { color: #8f5902; font-style: italic; } /* Comment */
code > span.ot { color: #8f5902; } /* Other */
code > span.al { color: #ef2929; } /* Alert */
code > span.fu { color: #000000; } /* Function */
code > span.er { color: #a40000; font-weight: bold; } /* Error */
code > span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */
code > span.cn { color: #000000; } /* Constant */
code > span.sc { color: #000000; } /* SpecialChar */
code > span.vs { color: #4e9a06; } /* VerbatimString */
code > span.ss { color: #4e9a06; } /* SpecialString */
code > span.im { } /* Import */
code > span.va { color: #000000; } /* Variable */
code > span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */
code > span.op { color: #ce5c00; font-weight: bold; } /* Operator */
code > span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */
code > span.ex { } /* Extension */
code > span.at { color: #c4a000; } /* Attribute */
code > span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */
code > span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */
code > span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */
code > span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */
</style>




</head>

<body>




<h1 class="title toc-ignore">Linear models: Pushing down and speeding up</h1>
<h4 class="author"><em>BV</em></h4>
<h4 class="date"><em>30 September 2017</em></h4>



<div id="this-is-the-github-page-accompanying-the-blog-article-www.quantitative-consulting.eublog4.html" class="section level3">
<h3>This is the GitHub page accompanying the blog article <a href="www.quantitative-consulting.eu/blog4.html">www.quantitative-consulting.eu/blog4.html </a></h3>
<p>Let’s start with a few observations about linear models. The usual discussion posits a target variable <code>y</code> of length N and design matrix <code>X</code> with p columns (of full rank, say) and N rows. The most common case – unless you are a geneticist – is that <code>X</code> is ‘long’, i.e. we have N &gt;&gt; p. The linear model equation, could be written</p>
<div id="section" class="section level4">
<h4><img src="yXb.svg" /></h4>
<p>and the least squares solution is obtained by projecting <code>y</code> orthogonally onto the image of <code>X</code> and identifying the corresponding set of parameters</p>
</div>
<div id="section-1" class="section level4">
<h4><img src="betahat.svg" /></h4>
<p>We can test this in a little simulation</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">p =<span class="st"> </span><span class="dv">1000</span>
N =<span class="st"> </span><span class="dv">10</span>^<span class="dv">5</span>
beta =<span class="st"> </span><span class="kw">runif</span>(p, -<span class="dv">20</span>, <span class="dv">20</span>)

X &lt;-<span class="st"> </span><span class="kw">matrix</span>(<span class="kw">rnorm</span>(p*N), <span class="dt">ncol =</span> p)
y  &lt;-<span class="st"> </span>X %*%<span class="st"> </span>beta +<span class="st"> </span><span class="kw">rnorm</span>(N, <span class="dt">sd =</span> <span class="dv">5</span>)

<span class="kw">system.time</span>(XtX &lt;-<span class="st"> </span><span class="kw">crossprod</span>(X))[<span class="dv">3</span>]</code></pre></div>
<pre><code>## elapsed 
##  62.252</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">system.time</span>(Xty &lt;-<span class="st"> </span><span class="kw">drop</span>(<span class="kw">crossprod</span>(X,y)))[<span class="dv">3</span>]</code></pre></div>
<pre><code>## elapsed 
##   0.315</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">system.time</span>(beta_est &lt;-<span class="st"> </span><span class="kw">solve</span>(XtX, Xty))[<span class="dv">3</span>]</code></pre></div>
<pre><code>## elapsed 
##   0.352</code></pre>
<p>Classical treatments of the least squares solution focus on decomposing <code>X</code> in a suitable manner to address both, the crossproduct and the inversion. This obscures a little bit the fact that these are two distinct data steps:</p>
<ul>
<li>a ‘small data’ step: Even with our hypothetical <code>p = 1000</code> coefficients, the inversion only takes a few seconds</li>
<li>a ‘big data’ step: The <em>crossproduct</em> is what we should worry about much more for typical large <code>X</code>!</li>
</ul>
</div>
</div>
<div id="chunking-and-parallelising" class="section level3">
<h3>Chunking and parallelising</h3>
<p>Our example case (large <code>X</code> with N &gt;&gt; p) is ideal for chunking and parallelising. Chunking is made easy by packages like <code>bigglm</code> and <code>speedglm</code> that provide the infrastructure for handling chunked data. Also note that the crossproducts are ‘embarrassingly parallel’, i.e. can be calculated via an appropriate ‘lapply’-ification:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(parallel)

X_chunked &lt;-<span class="st"> </span><span class="kw">lapply</span>(<span class="kw">split</span>(X, (<span class="dv">1</span>:N)%%<span class="dv">9</span>) , matrix, <span class="dt">ncol =</span> p)

<span class="kw">system.time</span>(XtX_chunked &lt;-<span class="st"> </span><span class="kw">mclapply</span>(X_chunked, crossprod, <span class="dt">mc.cores =</span> <span class="dv">3</span>))[<span class="dv">3</span>]</code></pre></div>
<pre><code>## elapsed 
##  31.413</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">XtX2 &lt;-<span class="st"> </span><span class="dv">0</span>; for(xtx in XtX_chunked) XtX2 &lt;-<span class="st"> </span>XtX2 +<span class="st"> </span>xtx
<span class="kw">sum</span>(<span class="kw">abs</span>(XtX-<span class="st"> </span>XtX2)) ## ~ 0</code></pre></div>
<pre><code>## [1] 3.239706e-06</code></pre>
<p>So, chunking and parallelising helps.</p>
<p>To improve things further, one needs to look more closely at the structure of <code>X</code>. One well-known approach is in the <code>MatrixModels</code>-package, which allows to make use of sparsity in <code>X</code>, for instance, when <code>X</code> contains many dummy variables.</p>
</div>
<div id="special-structures-in-x-star-schema" class="section level3">
<h3>Special structures in <code>X</code> – Star schema</h3>
<p>The situation for which I have not immediately found an adequate package, and which I want to discuss in a bit more detail here, is when <code>X</code> is constructed from some kind of database ‘star schema’. As an example, we can take the fuel price data discussed [here].</p>
<p>We start with loading the required packages and by defining a small bechmarking function, which will not only measure execution times, but also give a rough idea on memory usage:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## (C) 2017 by Boris Vaillant, Quantitative Consulting
## License is granted to to use this code under the MIT License

## Clear all -- only if you have saved your other projects
<span class="kw">rm</span>(<span class="dt">list=</span><span class="kw">ls</span>(<span class="dt">all=</span><span class="ot">TRUE</span>))

<span class="kw">library</span>(tidyverse)
<span class="kw">library</span>(stringr) 
<span class="kw">library</span>(splines)
<span class="kw">library</span>(Matrix)
<span class="kw">library</span>(MatrixModels)

## a simple benchmarker
mstest &lt;-<span class="st"> </span>function(<span class="dt">expr=</span><span class="ot">NULL</span>) {
  mem &lt;-<span class="st"> </span><span class="kw">gc</span>(<span class="dt">reset =</span> <span class="ot">TRUE</span>)[<span class="dv">2</span>, <span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">6</span>)]
  st &lt;-<span class="st"> </span><span class="kw">proc.time</span>()
  expr
  mem2 &lt;-<span class="st"> </span><span class="kw">gc</span>(<span class="dt">reset =</span> <span class="ot">FALSE</span>)[<span class="dv">2</span>, <span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">6</span>)] -<span class="st"> </span>mem
  <span class="kw">names</span>(mem2) &lt;-<span class="st"> </span><span class="kw">c</span>(<span class="st">&quot;A_Mem final&quot;</span>, <span class="st">&quot;A_Mem max&quot;</span>)
  <span class="kw">c</span>(<span class="dt">A_Time =</span><span class="kw">structure</span>(<span class="kw">proc.time</span>()-<span class="st"> </span>st, <span class="dt">class =</span> <span class="st">&quot;proc_time&quot;</span>)[<span class="dv">1</span>:<span class="dv">3</span>],  mem2)
}</code></pre></div>
<p>Now load (and clean) the fuel price data (download it from GitHub [here]). It is a very simple star schema, with fuel prices per station and time as fact table:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">pricesAll &lt;-<span class="st"> </span><span class="kw">readRDS</span>( <span class="st">&quot;Prices_Demo.rds&quot;</span>)%&gt;%
<span class="st">  </span><span class="kw">filter</span>(E10 &gt;<span class="st"> </span><span class="dv">700</span>, E10 &lt;<span class="st"> </span><span class="dv">3500</span>, TimeAggID &gt;<span class="st"> </span><span class="dv">26425</span>,
         ## choose TimeAggID filter to fit to your machine
         !<span class="kw">is.na</span>(CompE10)) 
pricesAll</code></pre></div>
<pre><code>## # A tibble: 4,859,838 x 4
##    StationID TimeAggID   E10 CompE10
##        &lt;int&gt;     &lt;int&gt; &lt;int&gt;   &lt;int&gt;
##  1         1     26426  1269    1450
##  2         1     26427  1269    1450
##  3         1     26428  1269    1450
##  4         1     26429  1269    1450
##  5         1     26430  1269    1439
##  6         1     26431  1269    1400
##  7         1     26432  1269    1308
##  8         1     26433  1269    1295
##  9         1     26434  1269    1290
## 10         1     26435  1269    1281
## # ... with 4,859,828 more rows</code></pre>
<p>as well as dimension tables for stations and times:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">timesAll &lt;-<span class="st"> </span><span class="kw">readRDS</span>( <span class="st">&quot;E_prtimes.rds&quot;</span>)
timesAll</code></pre></div>
<pre><code>## # A tibble: 27,528 x 6
##    TimeAggID DateHour DateDay HourCl  DayCl Brent
##        &lt;int&gt;    &lt;int&gt;   &lt;int&gt; &lt;fctr&gt; &lt;fctr&gt; &lt;int&gt;
##  1         1 14060800  140608 Hour00    Sun   501
##  2         2 14060801  140608 Hour01    Sun   501
##  3         3 14060802  140608 Hour02    Sun   501
##  4         4 14060803  140608 Hour03    Sun   501
##  5         5 14060804  140608 Hour04    Sun   501
##  6         6 14060805  140608 Hour05    Sun   501
##  7         7 14060806  140608 Hour06    Sun   501
##  8         8 14060807  140608 Hour07    Sun   501
##  9         9 14060808  140608 Hour08    Sun   501
## 10        10 14060809  140608 Hour09    Sun   501
## # ... with 27,518 more rows</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">stationsAll &lt;-<span class="st"> </span><span class="kw">readRDS</span>( <span class="st">&quot;C_stationsAll_final.rds&quot;</span>)

## add natural splines to stations 
NS &lt;-<span class="st"> </span><span class="kw">as_data_frame</span>( <span class="kw">model.matrix</span>( ~<span class="st"> </span>-<span class="dv">1</span> +<span class="st"> </span><span class="kw">ns</span>(lat, <span class="dt">df =</span> <span class="dv">3</span>):<span class="kw">ns</span>(lng, <span class="dt">df =</span> <span class="dv">3</span>),   
                                   <span class="dt">data =</span> stationsAll ))

<span class="kw">names</span>(NS) &lt;-<span class="st"> </span><span class="kw">paste0</span>(<span class="st">&quot;NS&quot;</span>,<span class="kw">str_replace_all</span>(<span class="kw">names</span>(NS), 
                      <span class="st">&quot;</span><span class="ch">\\</span><span class="st">:|ns</span><span class="ch">\\</span><span class="st">(lat, df = </span><span class="ch">\\</span><span class="st">d+</span><span class="ch">\\</span><span class="st">)|ns</span><span class="ch">\\</span><span class="st">(lng, df = </span><span class="ch">\\</span><span class="st">d+</span><span class="ch">\\</span><span class="st">)&quot;</span>, <span class="st">&quot;&quot;</span>))
stationsAll &lt;-<span class="st"> </span><span class="kw">bind_cols</span>(stationsAll, NS)
stationsAll</code></pre></div>
<pre><code>## # A tibble: 14,902 x 15
##    StationID brandCl       lng      lat isBAB closeBAB       NS11
##        &lt;int&gt;  &lt;fctr&gt;     &lt;dbl&gt;    &lt;dbl&gt; &lt;int&gt;    &lt;int&gt;      &lt;dbl&gt;
##  1         1  Others 10.850848 48.55568     0        0 0.39713860
##  2         2     BfT 10.920224 49.66159     0        0 0.43709986
##  3         3  Others  6.948950 51.16919     0        0 0.01966565
##  4         4  Others 11.451392 48.12189     0        1 0.37167161
##  5         5  Others 12.520820 48.24147     0        0 0.28931168
##  6         6  Others  7.586279 48.02442     0        0 0.06587314
##  7         7  Others 10.261920 47.52460     0        0 0.34155893
##  8         8  Others  9.424034 47.86324     0        0 0.27201896
##  9         9    Avia  9.825452 48.00048     0        0 0.31995881
## 10        10  Others  9.651461 47.82916     0        0 0.29700352
## # ... with 14,892 more rows, and 8 more variables: NS21 &lt;dbl&gt;, NS31 &lt;dbl&gt;,
## #   NS12 &lt;dbl&gt;, NS22 &lt;dbl&gt;, NS32 &lt;dbl&gt;, NS13 &lt;dbl&gt;, NS23 &lt;dbl&gt;, NS33 &lt;dbl&gt;</code></pre>
</div>
<div id="traditional-approach-one-big-design-matrix-x" class="section level3">
<h3>Traditional approach: One big design matrix <code>X</code></h3>
<p>Now, let’s say we want to estimate the influence of</p>
<ul>
<li>the brand of the station</li>
<li>the closeness of the station to a highway (via isBAB, closeBAB)</li>
<li>the location of the station in Germany (via the natural splines on long / lat)</li>
<li>the time of day</li>
<li>the average price of Competitors (CompE10 — I’ll say more about this, later)</li>
</ul>
<p>etc. on the price of E10 fuel.</p>
<p>In the traditional approach, we join all the dimension information to the fact table, and run the model on the resulting large table:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">fulldat &lt;-<span class="st"> </span>pricesAll %&gt;%
<span class="st">  </span><span class="kw">left_join</span>(<span class="kw">select</span>(timesAll, -DateHour, -DateDay )) %&gt;%
<span class="st">  </span><span class="kw">left_join</span>(stationsAll)

<span class="kw">object.size</span>(fulldat)/<span class="dv">10</span>^<span class="dv">6</span> ## 622MB</code></pre></div>
<pre><code>## 622.1 bytes</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">RESULT &lt;-<span class="st"> </span><span class="ot">NULL</span>

modfor &lt;-<span class="st"> </span><span class="kw">as.formula</span>(
  <span class="kw">paste0</span>(<span class="st">&quot;E10 ~ 1 +  CompE10 + Brent + isBAB + closeBAB + brandCl +&quot;</span>,
         <span class="st">&quot;DayCl  +  HourCl +&quot;</span>, 
         <span class="kw">paste</span>(<span class="kw">names</span>(NS), <span class="dt">collapse =</span> <span class="st">&quot; + &quot;</span>)))

msLM &lt;-<span class="st"> </span><span class="kw">mstest</span>(modLM &lt;-<span class="st"> </span><span class="kw">lm</span>( modfor, <span class="dt">data =</span> fulldat))
    
RESULT[[<span class="st">&quot;trad. LM&quot;</span>]] &lt;-<span class="st"> </span><span class="kw">c</span>(msLM, modLM$coefficients)
msLM</code></pre></div>
<pre><code>## A_Time.user.self  A_Time.sys.self   A_Time.elapsed      A_Mem final 
##           41.616            0.551           42.236         2870.500 
##        A_Mem max 
##         5058.200</code></pre>
<pre><code>##            used  (Mb) gc trigger   (Mb) max used  (Mb)
## Ncells  1506557  80.5    7974897  426.0  1506557  80.5
## Vcells 13599970 103.8  699091880 5333.7 13599970 103.8</code></pre>
</div>
<div id="exploiting-the-star-schema-lifted-matrices" class="section level3">
<h3>Exploiting the star schema: Lifted matrices</h3>
<p>So, how to make this smaller and faster?</p>
<p>The main observation we can make is that the parts of the design matrix, that come from the joined dimension matrices, have a very simple form. For example, when looking at the Brent price(which depends on time only)</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> matBrent =<span class="st"> </span><span class="kw">model.Matrix</span>(~<span class="st"> </span>-<span class="dv">1</span> +<span class="st"> </span>Brent, <span class="dt">data =</span> timesAll )</code></pre></div>
<p>is the design matrix on the table <code>timesAll</code> and</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">indBrent =<span class="st"> </span><span class="kw">as</span>(<span class="kw">list</span>(pricesAll$TimeAggID, <span class="kw">max</span>(timesAll$TimeAggID)), <span class="st">&quot;indMatrix&quot;</span>)</code></pre></div>
<p>is the (sparse version of) the ‘index matrix’, representing the join. This matrix has a 1 in position <code>(ro, co)</code> exactly when <code>TimeAggID[ro] == co</code> and zeros otherwise. It is now easy to check that the matrix product of <code>indBrent</code> and <code>matBrent</code> is the lifted / joined matrix on the fact table</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">all</span>(indBrent %*%<span class="st"> </span>matBrent ==<span class="st"> </span>timesAll$Brent[pricesAll$TimeAggID]) ## TRUE</code></pre></div>
<p>This is a very useful representation, as</p>
<pre><code>1. `matBrent` will be comparatively small,
2. crossproducts with the index matrix `indBrent` amount to a simple summing over groups.</code></pre>
<p>Furthermore, index matrix algebra is already implemented in the <code>Matrix</code>-package.</p>
<p>Let’s call these objects, ‘lifted matrices’. We can actually define them as a simple extension of the classes in <code>Matrix</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setClass</span>(<span class="st">&quot;liftMatrix&quot;</span>, 
         <span class="kw">representation</span>(<span class="dt">mat =</span> <span class="st">&quot;Matrix&quot;</span>, 
                        <span class="dt">indI  =</span> <span class="st">&quot;indMatrix&quot;</span>))</code></pre></div>
<p>The beauty of the ‘Matrix’-package (and the S4-system it is built on) is that we now only need to implement the algebra for this new class as a set of S4-methods, adding to (in one case for the moment: overriding) the existing method-definitions in <code>Matrix</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">&quot;crossprod&quot;</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">&quot;indMatrix&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;missing&quot;</span>), 
          function(x)  <span class="kw">Diagonal</span>(<span class="dt">x =</span> <span class="kw">tabulate</span>(x@perm, x@Dim[<span class="dv">2</span>]))) </code></pre></div>
<pre><code>## [1] &quot;crossprod&quot;</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">&quot;crossprod&quot;</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">&quot;indMatrix&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;indMatrix&quot;</span> ), 
          function(x, y) {
            l1 &lt;-<span class="st"> </span><span class="kw">as</span>(x, <span class="st">&quot;lMatrix&quot;</span>)
            l2 &lt;-<span class="st"> </span><span class="kw">as</span>(y, <span class="st">&quot;lMatrix&quot;</span>)
            <span class="kw">crossprod</span>(l1,l2)
            })</code></pre></div>
<pre><code>## [1] &quot;crossprod&quot;</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">&quot;crossprod&quot;</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">&quot;liftMatrix&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;missing&quot;</span>), 
          function(x) <span class="kw">crossprod</span>(x@mat, <span class="kw">crossprod</span>(<span class="kw">crossprod</span>(x@indI) , x@mat)))</code></pre></div>
<pre><code>## [1] &quot;crossprod&quot;</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">&quot;crossprod&quot;</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">&quot;liftMatrix&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;liftMatrix&quot;</span> ), 
          function(x, y)   <span class="kw">crossprod</span>(x@mat, <span class="kw">crossprod</span>(<span class="kw">crossprod</span>(y@indI, x@indI), y@mat)))</code></pre></div>
<pre><code>## [1] &quot;crossprod&quot;</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">&quot;crossprod&quot;</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">&quot;liftMatrix&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;Matrix&quot;</span> ), 
          function(x, y)   {
            <span class="kw">crossprod</span>(x@mat, <span class="kw">crossprod</span>(x@indI, y)) })</code></pre></div>
<pre><code>## [1] &quot;crossprod&quot;</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">&quot;crossprod&quot;</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">&quot;Matrix&quot;</span>, <span class="dt">y =</span> <span class="st">&quot;liftMatrix&quot;</span> ), 
          function(x, y)   {
            <span class="kw">crossprod</span>(x, y@indI) %*%<span class="st"> </span>y@mat
          })</code></pre></div>
<pre><code>## [1] &quot;crossprod&quot;</code></pre>
<p>With these definitions, it is now easy to define the ‘lifted’ (or ‘pushdown’, depending on your point of view) version of our fuel price model above, by defining separately the components of the design which stem from the two dimension tables and (for the effect of <code>CompE10</code>) from the fact table itself.</p>
<p>I go through the full calculation here for demonstration purposes. I hope to put this into a more user-friendly format, soon.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">msLift &lt;-<span class="st"> </span><span class="kw">mstest</span>({
  
  modforS &lt;-<span class="st"> </span><span class="kw">as.formula</span>(
    <span class="kw">paste0</span>(<span class="st">&quot;~ 1 +  isBAB + closeBAB + brandCl +&quot;</span>, 
           <span class="kw">paste</span>(<span class="kw">names</span>(NS), <span class="dt">collapse =</span> <span class="st">&quot; + &quot;</span>)))
 
  XS &lt;-<span class="st"> </span><span class="kw">new</span>(<span class="st">&quot;liftMatrix&quot;</span>,
            <span class="dt">mat =</span> <span class="kw">model.Matrix</span>(modforS, <span class="dt">data =</span> stationsAll )*<span class="fl">1.0</span>,
            <span class="dt">indI =</span> <span class="kw">as</span>(<span class="kw">list</span>(pricesAll$StationID, <span class="kw">max</span>(stationsAll$StationID)),
                      <span class="st">&quot;indMatrix&quot;</span>))
     
  modforT &lt;-<span class="st"> </span><span class="kw">as.formula</span>(<span class="kw">paste0</span>(<span class="st">&quot;~ 1 + Brent + DayCl + HourCl&quot;</span>))

  XT &lt;-<span class="st"> </span><span class="kw">new</span>(<span class="st">&quot;liftMatrix&quot;</span>,
            <span class="dt">mat =</span> <span class="kw">model.Matrix</span>(modforT, <span class="dt">data =</span> timesAll )[,-<span class="dv">1</span>]*<span class="fl">1.0</span>,
            <span class="dt">indI =</span> <span class="kw">as</span>(<span class="kw">list</span>(pricesAll$TimeAggID, <span class="kw">max</span>(timesAll$TimeAggID)), 
                    <span class="st">&quot;indMatrix&quot;</span>))
 
  XC  &lt;-<span class="st"> </span><span class="kw">Matrix</span>( pricesAll$CompE10, <span class="dt">ncol =</span> <span class="dv">1</span>, 
                 <span class="dt">dimnames =</span> <span class="kw">list</span>(<span class="ot">NULL</span>, <span class="st">&quot;CompE10&quot;</span>))
    
  Y &lt;-<span class="st"> </span><span class="kw">Matrix</span>(pricesAll$E10, <span class="dt">ncol =</span> <span class="dv">1</span>)

  XTXT &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XT)
  XSXS &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XS)
  XCXC &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XC)
  XTXS &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XT, XS)
  XTXC &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XT, XC)
  XSXC &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XS, XC)
  XSY &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XS, Y)
  XTY &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XT, Y)
  XCY &lt;-<span class="st"> </span><span class="kw">crossprod</span>(XC, Y)
  
  XX &lt;-<span class="st">  </span><span class="kw">rbind</span>(<span class="kw">cbind</span>(XTXT, XTXS, XTXC),
               <span class="kw">cbind</span>(<span class="kw">t</span>(XTXS), XSXS, XSXC),
               <span class="kw">cbind</span>(<span class="kw">t</span>(XTXC), <span class="kw">t</span>(XSXC), XCXC))
  XY &lt;-<span class="st"> </span><span class="kw">rbind</span>(XTY, XSY, XCY)
  
  modLift &lt;-<span class="st"> </span><span class="kw">drop</span>(<span class="kw">solve</span>(XX, XY))
})

RESULT[[<span class="st">&quot;lifted LM&quot;</span>]] &lt;-<span class="st"> </span><span class="kw">c</span>(modLift,  msLift)</code></pre></div>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for(nam in <span class="kw">names</span>(RESULT)) RESULT[[nam]]&lt;-<span class="st"> </span>
<span class="st">  </span><span class="kw">data.frame</span>( <span class="dt">estimate =</span> RESULT[[nam]]) %&gt;%<span class="st"> </span>
<span class="st">  </span><span class="kw">rownames_to_column</span>(<span class="dt">var =</span> <span class="st">&quot;term&quot;</span>) %&gt;%
<span class="st">  </span><span class="kw">mutate</span>(<span class="dt">model =</span> nam)

resultWide &lt;-<span class="st"> </span><span class="kw">bind_rows</span>(RESULT) %&gt;%<span class="st"> </span>dplyr::<span class="kw">select</span>(model, term, estimate) %&gt;%
<span class="st">  </span><span class="kw">spread</span>(model, estimate)

<span class="kw">head</span>(resultWide, <span class="dv">10</span>)</code></pre></div>
<pre><code>##                term   lifted LM    trad. LM
## 1       A_Mem final   84.000000 2870.500000
## 2         A_Mem max 2103.000000 5058.200000
## 3    A_Time.elapsed    3.507000   42.236000
## 4   A_Time.sys.self    0.320000    0.551000
## 5  A_Time.user.self    3.174000   41.616000
## 6       brandClAral   23.329966   23.329966
## 7       brandClAvia   -5.360235   -5.360235
## 8        brandClBfT  -17.300784  -17.300784
## 9       brandClEsso   -3.675214   -3.675214
## 10       brandClHEM  -20.194030  -20.194030</code></pre>
<p>Pretty impressive, huh?</p>
<p>Well, of course, I have massaged the example a bit. The model we have been looking at is a pretty extreme case. Almost all the explanatory variables are ‘lifts’ from one of the dimension tables. So the model is almost a product of two completely separate models. I actually had to sneak in the variable ‘CompE10’ (with all the problems, endogeneity, inconsistency, this variable brings, were we to really try to interpret this model) in order to make the example a truly non-product one.</p>
<p>Still, this approach is worthwile in many cases, and I will soon publish a more elaborate example, including how to deal with chunked and partitioned data, as well as fixed effects.</p>
</div>



<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
    document.getElementsByTagName("head")[0].appendChild(script);
  })();
</script>

</body>
</html>

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors