Skip to content

sbisaacson/r-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

R notes

Introduction

This document is just a grab-bag of notes and gripes for R.

suppressPackageStartupMessages({
    library("Matrix")
    library("knitr")
    library("printr")
    library("data.table")
    library("dplyr")       # masks data.table::{between,last}, etc.
    library("magrittr")
    library("tidyr")       # masks magrittr::extract, Matrix::expand
    library("assertthat")
    library("testthat")    # masks magrittr::{equals,is_less_than,not}
})

set.seed(11235)

Pitfalls

R is huge

R has a bewildering number of functions already reserved at startup:

## NB. get is not vectorized.

data_frame(package = paste0("package:",
               c("base", "methods", "datasets", "utils",
                 "grDevices", "graphics", "stats", "Matrix")),
           name = lapply(package, ls)) %>%
    unnest(name) %>%
    filter(vapply(seq_len(nrow(.)),
               function (r) is.function(get(.$name[r], .$package[r])),
               logical(1))) %>%
    group_by(package) %>%
    tally %>%
    arrange(desc(n))
package n
package:base 1193
package:stats 445
package:methods 216
package:utils 203
package:Matrix 116
package:grDevices 104
package:graphics 87

If you want to do something, it is probably in base somewhere. "A month in the lab can save an hour in the library." Try help(package = base) before writing your own code.

Ambiguous border cases

It is a commonplace that you should use seq_len(n) instead of 1:n because of the behavior when n == 0:

1:0
## [1] 1 0
seq_len(0)
## integer(0)

Many R functions try to guess your intention based on the length or dimensions of the argument, which can lead to odd bugs. Compare the following invocations of diag:

list(diag(3), diag(3, 4), diag(c(3, 4)))
## [[1]]
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
## 
## [[2]]
##      [,1] [,2] [,3] [,4]
## [1,]    3    0    0    0
## [2,]    0    3    0    0
## [3,]    0    0    3    0
## [4,]    0    0    0    3
## 
## [[3]]
##      [,1] [,2]
## [1,]    3    0
## [2,]    0    4

Indexing

If x is a vector and you index x with a logical vector v, R will extract the entries of x for which the corresponding entry of v (recycled if necessary) is TRUE:

logical_indexing <- rnorm(12)
logical_indexing
##  [1] -0.297055910 -1.423666458 -0.216315264 -0.005881971 -0.859138737
##  [6] -2.095621078 -0.265722558 -0.103193839 -0.799971279  0.514216402
## [11] -0.071860783  0.700654880
logical_indexing[TRUE]
##  [1] -0.297055910 -1.423666458 -0.216315264 -0.005881971 -0.859138737
##  [6] -2.095621078 -0.265722558 -0.103193839 -0.799971279  0.514216402
## [11] -0.071860783  0.700654880
logical_indexing[c(TRUE, FALSE, FALSE)]
## [1] -0.297055910 -0.005881971 -0.265722558  0.514216402

Be careful: if you accidentally convert these logical values to integers, then R will use integer indexing:

logical_indexing[1]
## [1] -0.2970559
logical_indexing[c(1, 0, 0)]
## [1] -0.2970559
all(c(1, 0, 0) == c(TRUE, FALSE, FALSE))
## [1] TRUE

Beware of indexing by 0 in general. You might object that the last line is disingenuous, as we should be using identical or perhaps all.equal:

list(identical(c(1, 0, 0), c(TRUE, FALSE, FALSE)),
     all.equal(c(1, 0, 0), c(TRUE, FALSE, FALSE)))
## [[1]]
## [1] FALSE
## 
## [[2]]
## [1] "Modes: numeric, logical"              
## [2] "target is numeric, current is logical"

Fair enough. So TRUE == 1, but TRUE and 1 are not identical, nor are they "nearly equal."

Encoding

Whenever reading or writing data, make sure to specify the encoding (so pass encoding = "latin1" or encoding = "UTF-8" to read.csv or file). Your code may work fine on Windows but fail on UNIX, or vice versa.

unwind-protect

Use on.exit to avoid resource leaks.

Missing functions

Even though R has thousands of functions, it is missing hypot (which is not trivial to implement) and a few other numerical niceties. As it happens, though, hypot is available for matrix arguments with norm(..., type = "F"):

c(norm(matrix(c(1e300, 2e300)), "F"), sqrt(sum(c(1e300, 2e300) ^ 2)))
## [1] 2.236068e+300           Inf

It a bit clumsy, but it avoids overflow.

String stuff

Avoid using native R string functions in favor of stringr whenever possible. Does this behavior make sense?

data_frame(string = c("", "-", "/", "-/", "/-", "-/-"),
           `base pieces` =
               sapply(base::strsplit(string, "/"), length),
           `stringr pieces` =
               sapply(stringr::str_split(string, "/"), length)) %>%
    mutate(string = sprintf("\"%s\"", encodeString(string)))
string base pieces stringr pieces
"" 0 1
"-" 1 1
"/" 1 2
"-/" 1 2
"/-" 2 2
"-/-" 2 2

Vectors and arrays

R really has two kinds of 1D vectors. An expression such as 1:5 has no dim attribute, but you could set it if you want. I think having no dim attribute for vanilla vectors is a strange design feature and I've never really understood what the purpose of the distinction between length and dim is supposed to be in the one-dimensional case. R does not have scalars (which would be arrays with zero-length dimension attributes). It will try to coerce a dimensionless vector to what its designers envisioned as the most sensible interpretation, but it depends on the context. Usually it just makes a column vector (which, by the way, matrix with no nrow or ncol arguments will do). But sometimes it makes a row vector: for instance, we have

print(1:5 %*% 1:5)
##      [,1]
## [1,]   55

The first 1:5 becomes a row vector and the second a column vector. But if we coerce the first argument to a column vector, suddenly the second becomes a row:

print(matrix(1:5) %*% 1:5)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    2    4    6    8   10
## [3,]    3    6    9   12   15
## [4,]    4    8   12   16   20
## [5,]    5   10   15   20   25

The documentation says

Since R 3.2.0, promotion of a vector to a 1-row or 1-column matrix
happens in even more cases, when one of the two choices allows x
and y to get conformable dimensions.

This is almost guaranteed to blow up if you write code that works with matrices and you plug in something with only one row or one column, since drop = TRUE is the default when indexing.

Lists and NULL

I like to think of lists as arrays of boxed data. The [ function applied to a list acts like vector indexing and returns a list, while [[ only accepts a single index and unboxes the entry of the list. You can't delete entries from numeric vectors with NULL assignment, but assigning NULL to a list index deletes entries:

example_list <- structure(as.list(1:10), names = letters[1:10])
example_list[1:3] <- NULL
example_list
## $d
## [1] 4
## 
## $e
## [1] 5
## 
## $f
## [1] 6
## 
## $g
## [1] 7
## 
## $h
## [1] 8
## 
## $i
## [1] 9
## 
## $j
## [1] 10

To store a NULL entry, you need to create a list first:

example_list[c(1, 4)] <- list(NULL)
example_list
## $d
## NULL
## 
## $e
## [1] 5
## 
## $f
## [1] 6
## 
## $g
## NULL
## 
## $h
## [1] 8
## 
## $i
## [1] 9
## 
## $j
## [1] 10

You might expect, based on the analogy with vectors, that [[ lets you avoid the list invocation, but it also deletes entries:

example_list[[1]] <- NULL
example_list
## $e
## [1] 5
## 
## $f
## [1] 6
## 
## $g
## NULL
## 
## $h
## [1] 8
## 
## $i
## [1] 9
## 
## $j
## [1] 10

The [[ function accepts vector arguments for lists of lists:

example_list <- lapply(seq_len(10), function (x) as.list(x ^ seq_len(x)))
example_list[[c(4, 3)]]
## [1] 64

Lists can have dimension attributes (I've never seen anyone use this in real code) and then you can use array indexing:

example_list <- array(as.list(1:12), c(2, 2, 3))
example_list[2, 1, 3]
## [[1]]
## [1] 10
example_list[[2, 1, 3]]
## [1] 10

Closures and garbage collection

I learned about this issue on the Win-Vector Blog. Closures in R are not really closures because they retain the entire environment of their definition, including things you wouldn't expect to be accessible:

make_accumulator <- function () {
    garbage <- rnorm(65536)
    x <- 0
    function (y) {
        x <<- x + sum(y)
        x
    }
}

test_accum <- make_accumulator()
test_accum(5)
## [1] 5
test_accum(10)
## [1] 15
object.size(test_accum)
## 5736 bytes

So far, so good. But in the documentation for object.size, we see

 Associated space (e.g., the environment of a function and what the
 pointer in a 'EXTPTRSXP' points to) is not included in the
 calculation.

In fact, we can still get garbage:

str(get("garbage", environment(test_accum)))
##  num [1:65536] -0.377 -0.685 0.831 1.51 0.556 ...

The true size is exposed with serialize:

object.size(serialize(test_accum, NULL))
## 530784 bytes

Data frames

You can store matrices and arrays in data frames. However, you can't create data frames with data.frame or data_frame that way:

array_frame <- data.frame(x = 1:10)
array_frame$m <- array(rnorm(60), c(10, 3, 2))

names(array_frame)
## [1] "x" "m"
names(data.frame(x = 1:10, m = array(rnorm(60), c(10, 3, 2))))
## [1] "x"   "m.1" "m.2" "m.3" "m.4" "m.5" "m.6"

Also, the [ operator doesn't work with these the way you might expect:

list(array_frame[1, "m"], array_frame$m[1, , ])
## [[1]]
## [1] -1.202653
## 
## [[2]]
##            [,1]       [,2]
## [1,] -1.2026530 -0.7556250
## [2,] -0.3581968 -0.9711227
## [3,] -0.3830408 -0.6716219

The dplyr package will not let you work with these exotic data frames, and data.table won't let you create them (as far as I know). You could convert the column to a list:

array_frame$m <- array_frame %$%
    vapply(seq_len(nrow(m)), function (r) list(m[r, , ]), list(1))
array_frame[1, "m"]
## [[1]]
##            [,1]       [,2]
## [1,] -1.2026530 -0.7556250
## [2,] -0.3581968 -0.9711227
## [3,] -0.3830408 -0.6716219

In general, lists in data frames are a very useful idiom. However, in this case by using a boxed structure you lose all the advantages of array storage that you had in the first place.

Vectorized functions

Always use rowSums, colSums, rowMeans, and colMeans instead of the equivalent apply call. Oddly enough, this seems to be it. There is no vectorized version of cumsum.

Array permutations

I can never remember which way array permutations act.

test_array <- array(seq_len(48), dim = c(2, 2, 4, 3))
test_array_perm <- aperm(test_array, perm = c(2, 3, 4, 1))

assert_that(all(dim(test_array_perm) == c(2, 4, 3, 2)))

## Inline matrix version of expand.grid:

`%g%` <- function (m1, m2) {
    if (is.null(dim(m1))) dim(m1) <- c(length(m1), 1)
    if (is.null(dim(m2))) dim(m2) <- c(length(m2), 1)
    n1 <- nrow(m1)
    n2 <- nrow(m2)
    cbind(m1[rep(seq_len(n1), n2), ],
          m2[rep(seq_len(n2), each = n1), ])
}

test_indices <- 1:2 %g% 1:2 %g% 1:4 %g% 1:3

assert_that(all(test_array[test_indices] == seq_len(48)))
assert_that(test_array[1, 2, 4, 3] == test_array_perm[2, 4, 3, 1])

## More generally:

assert_that(all(test_array[test_indices] ==
                test_array_perm[test_indices[, c(2, 3, 4, 1)]]))

This is even confusing in the discussion of |: in J for C programmers: "I wish I could give you an intuitive definition but I can't."

Incidentally, if you have an integer permutation v, then order(v) is its inverse (i.e., if v is the integers 1 through n rearranged, then v[order(v)] == order(v)[v] == seq_len(n)).

Apply and margins

Be careful with apply(X, MARGIN, FUN, ...) when FUN returns a vector. The documentation explains:

If each call to 'FUN' returns a vector of length 'n', then 'apply'
returns an array of dimension 'c(n, dim(X)[MARGIN])' if 'n > 1'.
If 'n' equals '1', 'apply' returns a vector if 'MARGIN' has length
1 and an array of dimension 'dim(X)[MARGIN]' otherwise.  If 'n' is
'0', the result has length 0 but not necessarily the 'correct'
dimension.

In fact, if MARGIN is a permutation of the indices and FUN is the identity, apply acts the same as aperm:

test_array <- array(rnorm(48), c(2, 2, 4, 3))
assert_that(all(aperm(test_array, c(4, 2, 1, 3)) ==
                    apply(test_array, c(4, 2, 1, 3), function (x) x)))

What happens to the non-MARGIN axes? The key to remember is that apply doesn't act in place, so when you use the row as the margin, the result appears transposed:

test_matrix <- matrix(1:12, 3, 4)
list(original = test_matrix,
     row_margin = apply(test_matrix, 1, cumsum),
     col_margin = apply(test_matrix, 2, cumsum))
## $original
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## $row_margin
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    5    7    9
## [3,]   12   15   18
## [4,]   22   26   30
## 
## $col_margin
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    3    9   15   21
## [3,]    6   15   24   33

This is in contrast to, say, J (which I don't know at all!). J has fewer surprises, but is difficult to read:

   |: 1 + i. 4 3
1 4 7 10
2 5 8 11
3 6 9 12
   +/\ |: 1 + i. 4 3
1  4  7 10
3  9 15 21
6 15 24 33
   +/\"1 |: 1 + i. 4 3
1 5 12 22
2 7 15 26
3 9 18 30
   2 # |: 1 + i. 4 3
1 4 7 10
1 4 7 10
2 5 8 11
2 5 8 11
3 6 9 12
3 6 9 12
   2 #"1 |: 1 + i. 4 3
1 1 4 4 7 7 10 10
2 2 5 5 8 8 11 11
3 3 6 6 9 9 12 12

Random number generation

Use this to evaluate an expression while temporarily reseeding the RNG:

with_seed <- function (seed, expr) {
    save_seed <-
        if (exists(".Random.seed", .GlobalEnv)) .Random.seed else NULL
    on.exit(
        if (!is.null(save_seed))
            assign(".Random.seed", save_seed, envir = .GlobalEnv),
        add = TRUE)
    set.seed(seed)
    eval(expr, envir = parent.frame())
}

assert_that(
    all(with_seed(1000, rnorm(9)[c(1:6, 1:3, 7:9)]) ==
            with_seed(1000, c(rnorm(6), with_seed(1000, rnorm(3)), rnorm(3)))))

Histograms in data.table

I almost always use dplyr instead of data.table. This is one task that's awkward with dplyr:

local({
    x <- data.table(z = with_seed(12345, rnorm(1000)), key = "z")
    x_breaks <- data.table(z = seq(min(x$z), max(x$z), length.out = 10), key = "z")
    x_breaks[x, .(count = .N), roll = TRUE, by = z]
})
z count
-2.7783255 21
-2.0995412 52
-1.4207569 132
-0.7419726 249
-0.0631883 260
0.6155961 186
1.2943804 68
1.9731647 28
2.6519490 3
3.3307333 1

After working with dplyr and data.table for a while, though, you forget what base R has to offer:

with_seed(12345, rnorm(1000)) %>%
    cut(breaks = c(seq(min(.), max(.), length.out = 10), Inf),
        include.lowest = TRUE, right = FALSE) %>%
    table
[-2.78,-2.1) [-2.1,-1.42) [-1.42,-0.742) [-0.742,-0.0632) [-0.0632,0.616) [0.616,1.29) [1.29,1.97) [1.97,2.65) [2.65,3.33) [3.33,Inf]
21 52 132 249 260 186 68 28 3 1
The semantics of this operation are slightly different than a rolling
join (which is more efficient when everything is sorted, but that's
another issue…).

Linear algebra

Cholesky decompositions

For some reason, R has chol2inv but not chol2solve.

chol2solve <- function (R, b) {
    assert_that(is.matrix(R))
    if (to_drop <- is.null(dim(b))) b <- matrix(b)
    assert_that(nrow(b) == nrow(R))
    pivot <- attr(R, "pivot")
    w <- if (is.null(pivot)) {
        backsolve(R, backsolve(R, b, transpose = TRUE))
    } else {
        assert_that(attr(R, "rank") == nrow(R))
        inv_pivot <- order(pivot)
        backsolve(R, backsolve(R, b[pivot, , drop = FALSE],
                               transpose = TRUE))[inv_pivot, ]
    }
    if (to_drop) w <- drop(w)
    w
}

test_that("chol2solve handles linear systems", {
    m <- matrix(c(14, -11, 5, -11, 17, -5, 5, -5, 13), 3, 3)
    b <- c(1, 2, 5)
    expect_that(chol2solve(chol(m), b), equals(solve(m, b)))
})

test_that("chol2solve handles pivots", {
    m <- matrix(c(14, -11, 5, -11, 17, -5, 5, -5, 13), 3, 3)
    b <- c(1, 2, 5)
    m_ch <- chol(m, pivot = TRUE)
    expect_that(attr(m_ch, "pivot"), equals(c(2, 3, 1)))
    expect_that(chol2solve(m_ch, b), equals(solve(m, b)))
})

It almost goes without saying that you should never invert matrices (never write solve(a), and always use solve(a, v) in favor of solve(a) %*% v). The chol2inv function makes an appearance in summary.lm where it is used to compute standard errors of regression coefficients.

crossprod

Use crossprod(x, y) in favor of t(x) %*% y. There is an analogous function tcrossprod(x, y) to replace x %*% t(y).

QR decompositions

The R documentation on QR decompositions is worth reading. But it never, as far as I know, explains the content of the qraux vector in the output of qr. At some point I dug into the source code for the QR decomposition routines (look at, say, src/appl/dqrdc2.f), which contain an ominous comment that dates back to 1997:

c     i am very nervous about modifying linpack code in this way.
c     if you are a computational linear algebra guru and you really
c     understand how to solve this problem please feel free to
c     suggest improvements to this code.

As it happens, qraux stores different data depending on whether you use LAPACK or LINPACK routines. The i-th Householder vector is stored in the last n - i entries of the i-th column of qr. The first i - 1 entries of the Householder vector are zero. Here's where the LINPACK and LAPACK decompositions in R differ:

  • In LINPACK (the default), the i-th entry of the Householder vector is stored in qraux[i] and β is 1 / qraux[i].
  • When you use LAPACK = TRUE, the i-th entry of the Householder vector is 1 and β is qraux[i].

In R notation, the Householder transformation associated to a pair (beta, v) is the function

function (w) w - beta * v %*% crossprod(v, w)

For details on the algorithm, see Golub and Van Loan.

Math

Logit and inverse logit

For the inverse logit, there is no need to write 1 / (1 + exp(-x)). Use plogis (in the stats package, which is loaded by default) or boot::inv.logit (which just calls plogis). For the logit, use stats::qlogis or boot::logit (which is just an alias). There is no real numerical reason to do this, but the internal R functions are guaranteed to handle NaN and infinity correctly, and the functions are more readable.

Computing upper tails

Never write 1 - pfoobar(...). All of the distribution functions accept an optional lower.tail so 1 - pfoobar(...) can be replaced by pfoobar(..., lower.tail = FALSE). It looks like more typing, but it is more precise due to the potential catastrophic cancellation:

c(1 - pnorm(10), pnorm(10, lower.tail = FALSE))
## [1] 0.000000e+00 7.619853e-24

If you find yourself writing log(pfoobar(...)), use pfoobar(..., log = TRUE) instead.

Useful miscellaneous utilities

When I miss MATLAB:

ones <- function (...) array(1, dim = c(...))

This is convenient when combined with %o%, but indexing a vector by an array drops dimensions, so it isn't as flexible.

Here is a useful utility for viewing a huge amount of data in the console:

pp <- function (x) page(x, method = "print", max = 99999L)

ESS and EMACS

This isn't really R stuff per se, but my notes editing R code. Here's my ESS config from (some version of) my .emacs:

(defun sbi/ess-setup ()
  "Set up ESS customizations."
  (ess-toggle-underscore nil)
  (ess-add-style 'sbi-custom
                 '((ess-indent-level . 4)
                   (ess-first-continued-statement-offset . 4)
                   (ess-continued-statement-offset . 0)
                   (ess-brace-offset . 0)
                   (ess-arg-function-offset . 4)
                   (ess-arg-function-offset-new-line . '(4))
                   (ess-expression-offset . 4)
                   (ess-else-offset . 0)
                   (ess-close-brace-offset . 0))))

(use-package ess-site
  :ensure ess
  :commands R
  :mode (("\\.[rR]\\'" . R-mode)
         ("\\.Rnw\\'" . Rnw-mode))
  :config (sbi/ess-setup))

Some people really don't like underscores because you have to press the shift key to type them, and ESS has helpfully bound that to <-, so you have to hit shift-hyphen twice. Or you could just disable this feature: that's the (ess-toggle-underscore nil). Google recommends using full stops or camelCase. But full stops in names interfere with the S3 class system. Hadley Wickham recommends underscores, but beware of this convention. Some of his functions only differ in name from core R by replacing . with _, but differ in behavior—data.frame has stringsAsFactors = TRUE by default, while dplyr::data_frame has stringsAsFactors = FALSE (as it should have been). I go for underscores on this one. There are no S3 surprises and We_the_people_of_the_United_States is a lot more readable than inOrderToFormAMorePerfectUnion.

ESS by default will indent magrittr/dplyr pipelines in a charming echelon fashion:

mtcars %>%
    group_by(cyl, gear) %>%
        summarise(mean(wt), mean(hp)) %>%
            ungroup %>%
                arrange(`mean(wt)`)

You have to mess with ess-continued-statement-offset and ess-first-continued-statement-offset to get something sane like

mtcars %>%
    group_by(cyl, gear) %>%
    summarise(mean(wt), mean(hp)) %>%
    ungroup %>%
    arrange(`mean(wt)`)

Useful links

  1. Hadleyverse:
  2. Advanced R
  3. R packages
  4. ggplot2
  5. github link
  6. R project
  7. R internals
  8. R language definition
  9. Writing R extensions
  10. github R mirror
  11. Rcpp
  12. Armadillo
  13. Rcpp documentation
  14. RcppArmadillo
  15. RcppEigen
  16. Netlib
  17. BLAS
  18. LAPACK
  19. data.table
  20. wiki
  21. cheat sheet
  22. Useful packages
  23. zoo
  24. Yihui Xie's github
  25. Stan (Bayesian modeling)
  26. Stan project
  27. Manual

About

Notes on the R programming language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors