borva/lmDecomp1
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="pandoc" /> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="author" content="BV" /> <meta name="date" content="2017-09-30" /> <title>Linear models: Pushing down and speeding up</title> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> div.sourceCode { overflow-x: auto; } table.sourceCode, tr.sourceCode, td.lineNumbers, td.sourceCode { margin: 0; padding: 0; vertical-align: baseline; border: none; } table.sourceCode { width: 100%; line-height: 100%; background-color: #f8f8f8; } td.lineNumbers { text-align: right; padding-right: 4px; padding-left: 4px; color: #aaaaaa; border-right: 1px solid #aaaaaa; } td.sourceCode { padding-left: 5px; } pre, code { background-color: #f8f8f8; } code > span.kw { color: #204a87; font-weight: bold; } /* Keyword */ code > span.dt { color: #204a87; } /* DataType */ code > span.dv { color: #0000cf; } /* DecVal */ code > span.bn { color: #0000cf; } /* BaseN */ code > span.fl { color: #0000cf; } /* Float */ code > span.ch { color: #4e9a06; } /* Char */ code > span.st { color: #4e9a06; } /* String */ code > span.co { color: #8f5902; font-style: italic; } /* Comment */ code > span.ot { color: #8f5902; } /* Other */ code > span.al { color: #ef2929; } /* Alert */ code > span.fu { color: #000000; } /* Function */ code > span.er { color: #a40000; font-weight: bold; } /* Error */ code > span.wa { color: #8f5902; font-weight: bold; font-style: italic; } /* Warning */ code > span.cn { color: #000000; } /* Constant */ code > span.sc { color: #000000; } /* SpecialChar */ code > span.vs { color: #4e9a06; } /* VerbatimString */ code > span.ss { color: #4e9a06; } /* SpecialString */ code > span.im { } /* Import */ code > span.va { color: #000000; } /* Variable */ code > span.cf { color: #204a87; font-weight: bold; } /* ControlFlow */ code > span.op { color: #ce5c00; font-weight: bold; } /* Operator */ code > span.pp { color: #8f5902; font-style: italic; } /* Preprocessor */ code > span.ex { } /* Extension */ code > span.at { color: #c4a000; } /* Attribute */ code > span.do { color: #8f5902; font-weight: bold; font-style: italic; } /* Documentation */ code > span.an { color: #8f5902; font-weight: bold; font-style: italic; } /* Annotation */ code > span.cv { color: #8f5902; font-weight: bold; font-style: italic; } /* CommentVar */ code > span.in { color: #8f5902; font-weight: bold; font-style: italic; } /* Information */ </style> </head> <body> <h1 class="title toc-ignore">Linear models: Pushing down and speeding up</h1> <h4 class="author"><em>BV</em></h4> <h4 class="date"><em>30 September 2017</em></h4> <div id="this-is-the-github-page-accompanying-the-blog-article-www.quantitative-consulting.eublog4.html" class="section level3"> <h3>This is the GitHub page accompanying the blog article <a href="www.quantitative-consulting.eu/blog4.html">www.quantitative-consulting.eu/blog4.html </a></h3> <p>Let’s start with a few observations about linear models. The usual discussion posits a target variable <code>y</code> of length N and design matrix <code>X</code> with p columns (of full rank, say) and N rows. The most common case – unless you are a geneticist – is that <code>X</code> is ‘long’, i.e. we have N >> p. The linear model equation, could be written</p> <div id="section" class="section level4"> <h4><img src="yXb.svg" /></h4> <p>and the least squares solution is obtained by projecting <code>y</code> orthogonally onto the image of <code>X</code> and identifying the corresponding set of parameters</p> </div> <div id="section-1" class="section level4"> <h4><img src="betahat.svg" /></h4> <p>We can test this in a little simulation</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">p =<span class="st"> </span><span class="dv">1000</span> N =<span class="st"> </span><span class="dv">10</span>^<span class="dv">5</span> beta =<span class="st"> </span><span class="kw">runif</span>(p, -<span class="dv">20</span>, <span class="dv">20</span>) X <-<span class="st"> </span><span class="kw">matrix</span>(<span class="kw">rnorm</span>(p*N), <span class="dt">ncol =</span> p) y <-<span class="st"> </span>X %*%<span class="st"> </span>beta +<span class="st"> </span><span class="kw">rnorm</span>(N, <span class="dt">sd =</span> <span class="dv">5</span>) <span class="kw">system.time</span>(XtX <-<span class="st"> </span><span class="kw">crossprod</span>(X))[<span class="dv">3</span>]</code></pre></div> <pre><code>## elapsed ## 62.252</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">system.time</span>(Xty <-<span class="st"> </span><span class="kw">drop</span>(<span class="kw">crossprod</span>(X,y)))[<span class="dv">3</span>]</code></pre></div> <pre><code>## elapsed ## 0.315</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">system.time</span>(beta_est <-<span class="st"> </span><span class="kw">solve</span>(XtX, Xty))[<span class="dv">3</span>]</code></pre></div> <pre><code>## elapsed ## 0.352</code></pre> <p>Classical treatments of the least squares solution focus on decomposing <code>X</code> in a suitable manner to address both, the crossproduct and the inversion. This obscures a little bit the fact that these are two distinct data steps:</p> <ul> <li>a ‘small data’ step: Even with our hypothetical <code>p = 1000</code> coefficients, the inversion only takes a few seconds</li> <li>a ‘big data’ step: The <em>crossproduct</em> is what we should worry about much more for typical large <code>X</code>!</li> </ul> </div> </div> <div id="chunking-and-parallelising" class="section level3"> <h3>Chunking and parallelising</h3> <p>Our example case (large <code>X</code> with N >> p) is ideal for chunking and parallelising. Chunking is made easy by packages like <code>bigglm</code> and <code>speedglm</code> that provide the infrastructure for handling chunked data. Also note that the crossproducts are ‘embarrassingly parallel’, i.e. can be calculated via an appropriate ‘lapply’-ification:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(parallel) X_chunked <-<span class="st"> </span><span class="kw">lapply</span>(<span class="kw">split</span>(X, (<span class="dv">1</span>:N)%%<span class="dv">9</span>) , matrix, <span class="dt">ncol =</span> p) <span class="kw">system.time</span>(XtX_chunked <-<span class="st"> </span><span class="kw">mclapply</span>(X_chunked, crossprod, <span class="dt">mc.cores =</span> <span class="dv">3</span>))[<span class="dv">3</span>]</code></pre></div> <pre><code>## elapsed ## 31.413</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">XtX2 <-<span class="st"> </span><span class="dv">0</span>; for(xtx in XtX_chunked) XtX2 <-<span class="st"> </span>XtX2 +<span class="st"> </span>xtx <span class="kw">sum</span>(<span class="kw">abs</span>(XtX-<span class="st"> </span>XtX2)) ## ~ 0</code></pre></div> <pre><code>## [1] 3.239706e-06</code></pre> <p>So, chunking and parallelising helps.</p> <p>To improve things further, one needs to look more closely at the structure of <code>X</code>. One well-known approach is in the <code>MatrixModels</code>-package, which allows to make use of sparsity in <code>X</code>, for instance, when <code>X</code> contains many dummy variables.</p> </div> <div id="special-structures-in-x-star-schema" class="section level3"> <h3>Special structures in <code>X</code> – Star schema</h3> <p>The situation for which I have not immediately found an adequate package, and which I want to discuss in a bit more detail here, is when <code>X</code> is constructed from some kind of database ‘star schema’. As an example, we can take the fuel price data discussed [here].</p> <p>We start with loading the required packages and by defining a small bechmarking function, which will not only measure execution times, but also give a rough idea on memory usage:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">## (C) 2017 by Boris Vaillant, Quantitative Consulting ## License is granted to to use this code under the MIT License ## Clear all -- only if you have saved your other projects <span class="kw">rm</span>(<span class="dt">list=</span><span class="kw">ls</span>(<span class="dt">all=</span><span class="ot">TRUE</span>)) <span class="kw">library</span>(tidyverse) <span class="kw">library</span>(stringr) <span class="kw">library</span>(splines) <span class="kw">library</span>(Matrix) <span class="kw">library</span>(MatrixModels) ## a simple benchmarker mstest <-<span class="st"> </span>function(<span class="dt">expr=</span><span class="ot">NULL</span>) { mem <-<span class="st"> </span><span class="kw">gc</span>(<span class="dt">reset =</span> <span class="ot">TRUE</span>)[<span class="dv">2</span>, <span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">6</span>)] st <-<span class="st"> </span><span class="kw">proc.time</span>() expr mem2 <-<span class="st"> </span><span class="kw">gc</span>(<span class="dt">reset =</span> <span class="ot">FALSE</span>)[<span class="dv">2</span>, <span class="kw">c</span>(<span class="dv">2</span>,<span class="dv">6</span>)] -<span class="st"> </span>mem <span class="kw">names</span>(mem2) <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"A_Mem final"</span>, <span class="st">"A_Mem max"</span>) <span class="kw">c</span>(<span class="dt">A_Time =</span><span class="kw">structure</span>(<span class="kw">proc.time</span>()-<span class="st"> </span>st, <span class="dt">class =</span> <span class="st">"proc_time"</span>)[<span class="dv">1</span>:<span class="dv">3</span>], mem2) }</code></pre></div> <p>Now load (and clean) the fuel price data (download it from GitHub [here]). It is a very simple star schema, with fuel prices per station and time as fact table:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">pricesAll <-<span class="st"> </span><span class="kw">readRDS</span>( <span class="st">"Prices_Demo.rds"</span>)%>% <span class="st"> </span><span class="kw">filter</span>(E10 ><span class="st"> </span><span class="dv">700</span>, E10 <<span class="st"> </span><span class="dv">3500</span>, TimeAggID ><span class="st"> </span><span class="dv">26425</span>, ## choose TimeAggID filter to fit to your machine !<span class="kw">is.na</span>(CompE10)) pricesAll</code></pre></div> <pre><code>## # A tibble: 4,859,838 x 4 ## StationID TimeAggID E10 CompE10 ## <int> <int> <int> <int> ## 1 1 26426 1269 1450 ## 2 1 26427 1269 1450 ## 3 1 26428 1269 1450 ## 4 1 26429 1269 1450 ## 5 1 26430 1269 1439 ## 6 1 26431 1269 1400 ## 7 1 26432 1269 1308 ## 8 1 26433 1269 1295 ## 9 1 26434 1269 1290 ## 10 1 26435 1269 1281 ## # ... with 4,859,828 more rows</code></pre> <p>as well as dimension tables for stations and times:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">timesAll <-<span class="st"> </span><span class="kw">readRDS</span>( <span class="st">"E_prtimes.rds"</span>) timesAll</code></pre></div> <pre><code>## # A tibble: 27,528 x 6 ## TimeAggID DateHour DateDay HourCl DayCl Brent ## <int> <int> <int> <fctr> <fctr> <int> ## 1 1 14060800 140608 Hour00 Sun 501 ## 2 2 14060801 140608 Hour01 Sun 501 ## 3 3 14060802 140608 Hour02 Sun 501 ## 4 4 14060803 140608 Hour03 Sun 501 ## 5 5 14060804 140608 Hour04 Sun 501 ## 6 6 14060805 140608 Hour05 Sun 501 ## 7 7 14060806 140608 Hour06 Sun 501 ## 8 8 14060807 140608 Hour07 Sun 501 ## 9 9 14060808 140608 Hour08 Sun 501 ## 10 10 14060809 140608 Hour09 Sun 501 ## # ... with 27,518 more rows</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">stationsAll <-<span class="st"> </span><span class="kw">readRDS</span>( <span class="st">"C_stationsAll_final.rds"</span>) ## add natural splines to stations NS <-<span class="st"> </span><span class="kw">as_data_frame</span>( <span class="kw">model.matrix</span>( ~<span class="st"> </span>-<span class="dv">1</span> +<span class="st"> </span><span class="kw">ns</span>(lat, <span class="dt">df =</span> <span class="dv">3</span>):<span class="kw">ns</span>(lng, <span class="dt">df =</span> <span class="dv">3</span>), <span class="dt">data =</span> stationsAll )) <span class="kw">names</span>(NS) <-<span class="st"> </span><span class="kw">paste0</span>(<span class="st">"NS"</span>,<span class="kw">str_replace_all</span>(<span class="kw">names</span>(NS), <span class="st">"</span><span class="ch">\\</span><span class="st">:|ns</span><span class="ch">\\</span><span class="st">(lat, df = </span><span class="ch">\\</span><span class="st">d+</span><span class="ch">\\</span><span class="st">)|ns</span><span class="ch">\\</span><span class="st">(lng, df = </span><span class="ch">\\</span><span class="st">d+</span><span class="ch">\\</span><span class="st">)"</span>, <span class="st">""</span>)) stationsAll <-<span class="st"> </span><span class="kw">bind_cols</span>(stationsAll, NS) stationsAll</code></pre></div> <pre><code>## # A tibble: 14,902 x 15 ## StationID brandCl lng lat isBAB closeBAB NS11 ## <int> <fctr> <dbl> <dbl> <int> <int> <dbl> ## 1 1 Others 10.850848 48.55568 0 0 0.39713860 ## 2 2 BfT 10.920224 49.66159 0 0 0.43709986 ## 3 3 Others 6.948950 51.16919 0 0 0.01966565 ## 4 4 Others 11.451392 48.12189 0 1 0.37167161 ## 5 5 Others 12.520820 48.24147 0 0 0.28931168 ## 6 6 Others 7.586279 48.02442 0 0 0.06587314 ## 7 7 Others 10.261920 47.52460 0 0 0.34155893 ## 8 8 Others 9.424034 47.86324 0 0 0.27201896 ## 9 9 Avia 9.825452 48.00048 0 0 0.31995881 ## 10 10 Others 9.651461 47.82916 0 0 0.29700352 ## # ... with 14,892 more rows, and 8 more variables: NS21 <dbl>, NS31 <dbl>, ## # NS12 <dbl>, NS22 <dbl>, NS32 <dbl>, NS13 <dbl>, NS23 <dbl>, NS33 <dbl></code></pre> </div> <div id="traditional-approach-one-big-design-matrix-x" class="section level3"> <h3>Traditional approach: One big design matrix <code>X</code></h3> <p>Now, let’s say we want to estimate the influence of</p> <ul> <li>the brand of the station</li> <li>the closeness of the station to a highway (via isBAB, closeBAB)</li> <li>the location of the station in Germany (via the natural splines on long / lat)</li> <li>the time of day</li> <li>the average price of Competitors (CompE10 — I’ll say more about this, later)</li> </ul> <p>etc. on the price of E10 fuel.</p> <p>In the traditional approach, we join all the dimension information to the fact table, and run the model on the resulting large table:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">fulldat <-<span class="st"> </span>pricesAll %>% <span class="st"> </span><span class="kw">left_join</span>(<span class="kw">select</span>(timesAll, -DateHour, -DateDay )) %>% <span class="st"> </span><span class="kw">left_join</span>(stationsAll) <span class="kw">object.size</span>(fulldat)/<span class="dv">10</span>^<span class="dv">6</span> ## 622MB</code></pre></div> <pre><code>## 622.1 bytes</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">RESULT <-<span class="st"> </span><span class="ot">NULL</span> modfor <-<span class="st"> </span><span class="kw">as.formula</span>( <span class="kw">paste0</span>(<span class="st">"E10 ~ 1 + CompE10 + Brent + isBAB + closeBAB + brandCl +"</span>, <span class="st">"DayCl + HourCl +"</span>, <span class="kw">paste</span>(<span class="kw">names</span>(NS), <span class="dt">collapse =</span> <span class="st">" + "</span>))) msLM <-<span class="st"> </span><span class="kw">mstest</span>(modLM <-<span class="st"> </span><span class="kw">lm</span>( modfor, <span class="dt">data =</span> fulldat)) RESULT[[<span class="st">"trad. LM"</span>]] <-<span class="st"> </span><span class="kw">c</span>(msLM, modLM$coefficients) msLM</code></pre></div> <pre><code>## A_Time.user.self A_Time.sys.self A_Time.elapsed A_Mem final ## 41.616 0.551 42.236 2870.500 ## A_Mem max ## 5058.200</code></pre> <pre><code>## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 1506557 80.5 7974897 426.0 1506557 80.5 ## Vcells 13599970 103.8 699091880 5333.7 13599970 103.8</code></pre> </div> <div id="exploiting-the-star-schema-lifted-matrices" class="section level3"> <h3>Exploiting the star schema: Lifted matrices</h3> <p>So, how to make this smaller and faster?</p> <p>The main observation we can make is that the parts of the design matrix, that come from the joined dimension matrices, have a very simple form. For example, when looking at the Brent price(which depends on time only)</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"> matBrent =<span class="st"> </span><span class="kw">model.Matrix</span>(~<span class="st"> </span>-<span class="dv">1</span> +<span class="st"> </span>Brent, <span class="dt">data =</span> timesAll )</code></pre></div> <p>is the design matrix on the table <code>timesAll</code> and</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">indBrent =<span class="st"> </span><span class="kw">as</span>(<span class="kw">list</span>(pricesAll$TimeAggID, <span class="kw">max</span>(timesAll$TimeAggID)), <span class="st">"indMatrix"</span>)</code></pre></div> <p>is the (sparse version of) the ‘index matrix’, representing the join. This matrix has a 1 in position <code>(ro, co)</code> exactly when <code>TimeAggID[ro] == co</code> and zeros otherwise. It is now easy to check that the matrix product of <code>indBrent</code> and <code>matBrent</code> is the lifted / joined matrix on the fact table</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">all</span>(indBrent %*%<span class="st"> </span>matBrent ==<span class="st"> </span>timesAll$Brent[pricesAll$TimeAggID]) ## TRUE</code></pre></div> <p>This is a very useful representation, as</p> <pre><code>1. `matBrent` will be comparatively small, 2. crossproducts with the index matrix `indBrent` amount to a simple summing over groups.</code></pre> <p>Furthermore, index matrix algebra is already implemented in the <code>Matrix</code>-package.</p> <p>Let’s call these objects, ‘lifted matrices’. We can actually define them as a simple extension of the classes in <code>Matrix</code>:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setClass</span>(<span class="st">"liftMatrix"</span>, <span class="kw">representation</span>(<span class="dt">mat =</span> <span class="st">"Matrix"</span>, <span class="dt">indI =</span> <span class="st">"indMatrix"</span>))</code></pre></div> <p>The beauty of the ‘Matrix’-package (and the S4-system it is built on) is that we now only need to implement the algebra for this new class as a set of S4-methods, adding to (in one case for the moment: overriding) the existing method-definitions in <code>Matrix</code>:</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">"crossprod"</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">"indMatrix"</span>, <span class="dt">y =</span> <span class="st">"missing"</span>), function(x) <span class="kw">Diagonal</span>(<span class="dt">x =</span> <span class="kw">tabulate</span>(x@perm, x@Dim[<span class="dv">2</span>]))) </code></pre></div> <pre><code>## [1] "crossprod"</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">"crossprod"</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">"indMatrix"</span>, <span class="dt">y =</span> <span class="st">"indMatrix"</span> ), function(x, y) { l1 <-<span class="st"> </span><span class="kw">as</span>(x, <span class="st">"lMatrix"</span>) l2 <-<span class="st"> </span><span class="kw">as</span>(y, <span class="st">"lMatrix"</span>) <span class="kw">crossprod</span>(l1,l2) })</code></pre></div> <pre><code>## [1] "crossprod"</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">"crossprod"</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">"liftMatrix"</span>, <span class="dt">y =</span> <span class="st">"missing"</span>), function(x) <span class="kw">crossprod</span>(x@mat, <span class="kw">crossprod</span>(<span class="kw">crossprod</span>(x@indI) , x@mat)))</code></pre></div> <pre><code>## [1] "crossprod"</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">"crossprod"</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">"liftMatrix"</span>, <span class="dt">y =</span> <span class="st">"liftMatrix"</span> ), function(x, y) <span class="kw">crossprod</span>(x@mat, <span class="kw">crossprod</span>(<span class="kw">crossprod</span>(y@indI, x@indI), y@mat)))</code></pre></div> <pre><code>## [1] "crossprod"</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">"crossprod"</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">"liftMatrix"</span>, <span class="dt">y =</span> <span class="st">"Matrix"</span> ), function(x, y) { <span class="kw">crossprod</span>(x@mat, <span class="kw">crossprod</span>(x@indI, y)) })</code></pre></div> <pre><code>## [1] "crossprod"</code></pre> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">setMethod</span>(<span class="st">"crossprod"</span>, <span class="kw">signature</span>( <span class="dt">x =</span> <span class="st">"Matrix"</span>, <span class="dt">y =</span> <span class="st">"liftMatrix"</span> ), function(x, y) { <span class="kw">crossprod</span>(x, y@indI) %*%<span class="st"> </span>y@mat })</code></pre></div> <pre><code>## [1] "crossprod"</code></pre> <p>With these definitions, it is now easy to define the ‘lifted’ (or ‘pushdown’, depending on your point of view) version of our fuel price model above, by defining separately the components of the design which stem from the two dimension tables and (for the effect of <code>CompE10</code>) from the fact table itself.</p> <p>I go through the full calculation here for demonstration purposes. I hope to put this into a more user-friendly format, soon.</p> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">msLift <-<span class="st"> </span><span class="kw">mstest</span>({ modforS <-<span class="st"> </span><span class="kw">as.formula</span>( <span class="kw">paste0</span>(<span class="st">"~ 1 + isBAB + closeBAB + brandCl +"</span>, <span class="kw">paste</span>(<span class="kw">names</span>(NS), <span class="dt">collapse =</span> <span class="st">" + "</span>))) XS <-<span class="st"> </span><span class="kw">new</span>(<span class="st">"liftMatrix"</span>, <span class="dt">mat =</span> <span class="kw">model.Matrix</span>(modforS, <span class="dt">data =</span> stationsAll )*<span class="fl">1.0</span>, <span class="dt">indI =</span> <span class="kw">as</span>(<span class="kw">list</span>(pricesAll$StationID, <span class="kw">max</span>(stationsAll$StationID)), <span class="st">"indMatrix"</span>)) modforT <-<span class="st"> </span><span class="kw">as.formula</span>(<span class="kw">paste0</span>(<span class="st">"~ 1 + Brent + DayCl + HourCl"</span>)) XT <-<span class="st"> </span><span class="kw">new</span>(<span class="st">"liftMatrix"</span>, <span class="dt">mat =</span> <span class="kw">model.Matrix</span>(modforT, <span class="dt">data =</span> timesAll )[,-<span class="dv">1</span>]*<span class="fl">1.0</span>, <span class="dt">indI =</span> <span class="kw">as</span>(<span class="kw">list</span>(pricesAll$TimeAggID, <span class="kw">max</span>(timesAll$TimeAggID)), <span class="st">"indMatrix"</span>)) XC <-<span class="st"> </span><span class="kw">Matrix</span>( pricesAll$CompE10, <span class="dt">ncol =</span> <span class="dv">1</span>, <span class="dt">dimnames =</span> <span class="kw">list</span>(<span class="ot">NULL</span>, <span class="st">"CompE10"</span>)) Y <-<span class="st"> </span><span class="kw">Matrix</span>(pricesAll$E10, <span class="dt">ncol =</span> <span class="dv">1</span>) XTXT <-<span class="st"> </span><span class="kw">crossprod</span>(XT) XSXS <-<span class="st"> </span><span class="kw">crossprod</span>(XS) XCXC <-<span class="st"> </span><span class="kw">crossprod</span>(XC) XTXS <-<span class="st"> </span><span class="kw">crossprod</span>(XT, XS) XTXC <-<span class="st"> </span><span class="kw">crossprod</span>(XT, XC) XSXC <-<span class="st"> </span><span class="kw">crossprod</span>(XS, XC) XSY <-<span class="st"> </span><span class="kw">crossprod</span>(XS, Y) XTY <-<span class="st"> </span><span class="kw">crossprod</span>(XT, Y) XCY <-<span class="st"> </span><span class="kw">crossprod</span>(XC, Y) XX <-<span class="st"> </span><span class="kw">rbind</span>(<span class="kw">cbind</span>(XTXT, XTXS, XTXC), <span class="kw">cbind</span>(<span class="kw">t</span>(XTXS), XSXS, XSXC), <span class="kw">cbind</span>(<span class="kw">t</span>(XTXC), <span class="kw">t</span>(XSXC), XCXC)) XY <-<span class="st"> </span><span class="kw">rbind</span>(XTY, XSY, XCY) modLift <-<span class="st"> </span><span class="kw">drop</span>(<span class="kw">solve</span>(XX, XY)) }) RESULT[[<span class="st">"lifted LM"</span>]] <-<span class="st"> </span><span class="kw">c</span>(modLift, msLift)</code></pre></div> <div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for(nam in <span class="kw">names</span>(RESULT)) RESULT[[nam]]<-<span class="st"> </span> <span class="st"> </span><span class="kw">data.frame</span>( <span class="dt">estimate =</span> RESULT[[nam]]) %>%<span class="st"> </span> <span class="st"> </span><span class="kw">rownames_to_column</span>(<span class="dt">var =</span> <span class="st">"term"</span>) %>% <span class="st"> </span><span class="kw">mutate</span>(<span class="dt">model =</span> nam) resultWide <-<span class="st"> </span><span class="kw">bind_rows</span>(RESULT) %>%<span class="st"> </span>dplyr::<span class="kw">select</span>(model, term, estimate) %>% <span class="st"> </span><span class="kw">spread</span>(model, estimate) <span class="kw">head</span>(resultWide, <span class="dv">10</span>)</code></pre></div> <pre><code>## term lifted LM trad. LM ## 1 A_Mem final 84.000000 2870.500000 ## 2 A_Mem max 2103.000000 5058.200000 ## 3 A_Time.elapsed 3.507000 42.236000 ## 4 A_Time.sys.self 0.320000 0.551000 ## 5 A_Time.user.self 3.174000 41.616000 ## 6 brandClAral 23.329966 23.329966 ## 7 brandClAvia -5.360235 -5.360235 ## 8 brandClBfT -17.300784 -17.300784 ## 9 brandClEsso -3.675214 -3.675214 ## 10 brandClHEM -20.194030 -20.194030</code></pre> <p>Pretty impressive, huh?</p> <p>Well, of course, I have massaged the example a bit. The model we have been looking at is a pretty extreme case. Almost all the explanatory variables are ‘lifts’ from one of the dimension tables. So the model is almost a product of two completely separate models. I actually had to sneak in the variable ‘CompE10’ (with all the problems, endogeneity, inconsistency, this variable brings, were we to really try to interpret this model) in order to make the example a truly non-product one.</p> <p>Still, this approach is worthwile in many cases, and I will soon publish a more elaborate example, including how to deal with chunked and partitioned data, as well as fixed effects.</p> </div> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>