This document is just a grab-bag of notes and gripes for R.
suppressPackageStartupMessages({
library("Matrix")
library("knitr")
library("printr")
library("data.table")
library("dplyr") # masks data.table::{between,last}, etc.
library("magrittr")
library("tidyr") # masks magrittr::extract, Matrix::expand
library("assertthat")
library("testthat") # masks magrittr::{equals,is_less_than,not}
})
set.seed(11235)R has a bewildering number of functions already reserved at startup:
## NB. get is not vectorized.
data_frame(package = paste0("package:",
c("base", "methods", "datasets", "utils",
"grDevices", "graphics", "stats", "Matrix")),
name = lapply(package, ls)) %>%
unnest(name) %>%
filter(vapply(seq_len(nrow(.)),
function (r) is.function(get(.$name[r], .$package[r])),
logical(1))) %>%
group_by(package) %>%
tally %>%
arrange(desc(n))| package | n |
|---|---|
| package:base | 1193 |
| package:stats | 445 |
| package:methods | 216 |
| package:utils | 203 |
| package:Matrix | 116 |
| package:grDevices | 104 |
| package:graphics | 87 |
If you want to do something, it is probably in base somewhere. "A
month in the lab can save an hour in the library." Try help(package = base) before writing your own code.
It is a commonplace that you should use seq_len(n) instead of 1:n
because of the behavior when n == 0:
1:0## [1] 1 0
seq_len(0)## integer(0)
Many R functions try to guess your intention based on the length or
dimensions of the argument, which can lead to odd bugs. Compare the
following invocations of diag:
list(diag(3), diag(3, 4), diag(c(3, 4)))## [[1]]
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
##
## [[2]]
## [,1] [,2] [,3] [,4]
## [1,] 3 0 0 0
## [2,] 0 3 0 0
## [3,] 0 0 3 0
## [4,] 0 0 0 3
##
## [[3]]
## [,1] [,2]
## [1,] 3 0
## [2,] 0 4
If x is a vector and you index x with a logical vector v, R will
extract the entries of x for which the corresponding entry of v
(recycled if necessary) is TRUE:
logical_indexing <- rnorm(12)
logical_indexing## [1] -0.297055910 -1.423666458 -0.216315264 -0.005881971 -0.859138737
## [6] -2.095621078 -0.265722558 -0.103193839 -0.799971279 0.514216402
## [11] -0.071860783 0.700654880
logical_indexing[TRUE]## [1] -0.297055910 -1.423666458 -0.216315264 -0.005881971 -0.859138737
## [6] -2.095621078 -0.265722558 -0.103193839 -0.799971279 0.514216402
## [11] -0.071860783 0.700654880
logical_indexing[c(TRUE, FALSE, FALSE)]## [1] -0.297055910 -0.005881971 -0.265722558 0.514216402
Be careful: if you accidentally convert these logical values to integers, then R will use integer indexing:
logical_indexing[1]## [1] -0.2970559
logical_indexing[c(1, 0, 0)]## [1] -0.2970559
all(c(1, 0, 0) == c(TRUE, FALSE, FALSE))## [1] TRUE
Beware of indexing by 0 in general. You might object that the last
line is disingenuous, as we should be using identical or perhaps
all.equal:
list(identical(c(1, 0, 0), c(TRUE, FALSE, FALSE)),
all.equal(c(1, 0, 0), c(TRUE, FALSE, FALSE)))## [[1]]
## [1] FALSE
##
## [[2]]
## [1] "Modes: numeric, logical"
## [2] "target is numeric, current is logical"
Fair enough. So TRUE == 1, but TRUE and 1 are not identical, nor
are they "nearly equal."
Whenever reading or writing data, make sure to specify the encoding
(so pass encoding = "latin1" or encoding = "UTF-8" to read.csv
or file). Your code may work fine on Windows but fail on UNIX, or
vice versa.
Use on.exit to avoid resource leaks.
Even though R has thousands of functions, it is missing hypot (which
is
not trivial to implement)
and a few other numerical niceties. As it happens, though, hypot is
available for matrix arguments with norm(..., type = "F"):
c(norm(matrix(c(1e300, 2e300)), "F"), sqrt(sum(c(1e300, 2e300) ^ 2)))## [1] 2.236068e+300 Inf
It a bit clumsy, but it avoids overflow.
Avoid using native R string functions in favor of stringr whenever
possible. Does this behavior make sense?
data_frame(string = c("", "-", "/", "-/", "/-", "-/-"),
`base pieces` =
sapply(base::strsplit(string, "/"), length),
`stringr pieces` =
sapply(stringr::str_split(string, "/"), length)) %>%
mutate(string = sprintf("\"%s\"", encodeString(string)))| string | base pieces | stringr pieces |
|---|---|---|
| "" | 0 | 1 |
| "-" | 1 | 1 |
| "/" | 1 | 2 |
| "-/" | 1 | 2 |
| "/-" | 2 | 2 |
| "-/-" | 2 | 2 |
R really has two kinds of 1D vectors. An expression such as 1:5 has
no dim attribute, but you could set it if you want. I think having
no dim attribute for vanilla vectors is a strange design feature and
I've never really understood what the purpose of the distinction
between length and dim is supposed to be in the one-dimensional case.
R does not have scalars (which would be arrays with zero-length
dimension attributes). It will try to coerce a dimensionless vector to
what its designers envisioned as the most sensible interpretation, but
it depends on the context. Usually it just makes a column vector
(which, by the way, matrix with no nrow or ncol arguments will
do). But sometimes it makes a row vector: for instance, we have
print(1:5 %*% 1:5)## [,1]
## [1,] 55
The first 1:5 becomes a row vector and the second a column vector.
But if we coerce the first argument to a column vector, suddenly the
second becomes a row:
print(matrix(1:5) %*% 1:5)## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 2 4 6 8 10
## [3,] 3 6 9 12 15
## [4,] 4 8 12 16 20
## [5,] 5 10 15 20 25
The documentation says
Since R 3.2.0, promotion of a vector to a 1-row or 1-column matrix
happens in even more cases, when one of the two choices allows x
and y to get conformable dimensions.
This is almost guaranteed to blow up if you write code that works with
matrices and you plug in something with only one row or one column,
since drop = TRUE is the default when indexing.
I like to think of lists as arrays of boxed data. The [ function
applied to a list acts like vector indexing and returns a list, while
[[ only accepts a single index and unboxes the entry of the list.
You can't delete entries from numeric vectors with NULL assignment,
but assigning NULL to a list index deletes entries:
example_list <- structure(as.list(1:10), names = letters[1:10])
example_list[1:3] <- NULL
example_list## $d
## [1] 4
##
## $e
## [1] 5
##
## $f
## [1] 6
##
## $g
## [1] 7
##
## $h
## [1] 8
##
## $i
## [1] 9
##
## $j
## [1] 10
To store a NULL entry, you need to create a list first:
example_list[c(1, 4)] <- list(NULL)
example_list## $d
## NULL
##
## $e
## [1] 5
##
## $f
## [1] 6
##
## $g
## NULL
##
## $h
## [1] 8
##
## $i
## [1] 9
##
## $j
## [1] 10
You might expect, based on the analogy with vectors, that
[[ lets you avoid the list invocation, but it also deletes entries:
example_list[[1]] <- NULL
example_list## $e
## [1] 5
##
## $f
## [1] 6
##
## $g
## NULL
##
## $h
## [1] 8
##
## $i
## [1] 9
##
## $j
## [1] 10
The [[ function accepts vector arguments for lists of lists:
example_list <- lapply(seq_len(10), function (x) as.list(x ^ seq_len(x)))
example_list[[c(4, 3)]]## [1] 64
Lists can have dimension attributes (I've never seen anyone use this in real code) and then you can use array indexing:
example_list <- array(as.list(1:12), c(2, 2, 3))
example_list[2, 1, 3]## [[1]]
## [1] 10
example_list[[2, 1, 3]]## [1] 10
I learned about this issue on the Win-Vector Blog. Closures in R are not really closures because they retain the entire environment of their definition, including things you wouldn't expect to be accessible:
make_accumulator <- function () {
garbage <- rnorm(65536)
x <- 0
function (y) {
x <<- x + sum(y)
x
}
}
test_accum <- make_accumulator()
test_accum(5)## [1] 5
test_accum(10)## [1] 15
object.size(test_accum)## 5736 bytes
So far, so good. But in the documentation for object.size, we see
Associated space (e.g., the environment of a function and what the
pointer in a 'EXTPTRSXP' points to) is not included in the
calculation.
In fact, we can still get garbage:
str(get("garbage", environment(test_accum)))## num [1:65536] -0.377 -0.685 0.831 1.51 0.556 ...
The true size is exposed with serialize:
object.size(serialize(test_accum, NULL))## 530784 bytes
You can store matrices and arrays in data frames. However, you can't
create data frames with data.frame or data_frame that way:
array_frame <- data.frame(x = 1:10)
array_frame$m <- array(rnorm(60), c(10, 3, 2))
names(array_frame)## [1] "x" "m"
names(data.frame(x = 1:10, m = array(rnorm(60), c(10, 3, 2))))## [1] "x" "m.1" "m.2" "m.3" "m.4" "m.5" "m.6"
Also, the [ operator doesn't work with these the way you might
expect:
list(array_frame[1, "m"], array_frame$m[1, , ])## [[1]]
## [1] -1.202653
##
## [[2]]
## [,1] [,2]
## [1,] -1.2026530 -0.7556250
## [2,] -0.3581968 -0.9711227
## [3,] -0.3830408 -0.6716219
The dplyr package will not let you work with these exotic data
frames, and data.table won't let you create them (as far as I know).
You could convert the column to a list:
array_frame$m <- array_frame %$%
vapply(seq_len(nrow(m)), function (r) list(m[r, , ]), list(1))
array_frame[1, "m"]## [[1]]
## [,1] [,2]
## [1,] -1.2026530 -0.7556250
## [2,] -0.3581968 -0.9711227
## [3,] -0.3830408 -0.6716219
In general, lists in data frames are a very useful idiom. However, in this case by using a boxed structure you lose all the advantages of array storage that you had in the first place.
Always use rowSums, colSums, rowMeans, and colMeans instead of
the equivalent apply call. Oddly enough, this seems to be it. There
is no vectorized version of cumsum.
I can never remember which way array permutations act.
test_array <- array(seq_len(48), dim = c(2, 2, 4, 3))
test_array_perm <- aperm(test_array, perm = c(2, 3, 4, 1))
assert_that(all(dim(test_array_perm) == c(2, 4, 3, 2)))
## Inline matrix version of expand.grid:
`%g%` <- function (m1, m2) {
if (is.null(dim(m1))) dim(m1) <- c(length(m1), 1)
if (is.null(dim(m2))) dim(m2) <- c(length(m2), 1)
n1 <- nrow(m1)
n2 <- nrow(m2)
cbind(m1[rep(seq_len(n1), n2), ],
m2[rep(seq_len(n2), each = n1), ])
}
test_indices <- 1:2 %g% 1:2 %g% 1:4 %g% 1:3
assert_that(all(test_array[test_indices] == seq_len(48)))
assert_that(test_array[1, 2, 4, 3] == test_array_perm[2, 4, 3, 1])
## More generally:
assert_that(all(test_array[test_indices] ==
test_array_perm[test_indices[, c(2, 3, 4, 1)]]))This is even confusing in the discussion of |: in
J for C programmers:
"I wish I could give you an intuitive definition but I can't."
Incidentally, if you have an integer permutation v, then order(v)
is its inverse (i.e., if v is the integers 1 through n rearranged,
then v[order(v)] == order(v)[v] == seq_len(n)).
Be careful with apply(X, MARGIN, FUN, ...) when FUN returns a
vector. The documentation explains:
If each call to 'FUN' returns a vector of length 'n', then 'apply'
returns an array of dimension 'c(n, dim(X)[MARGIN])' if 'n > 1'.
If 'n' equals '1', 'apply' returns a vector if 'MARGIN' has length
1 and an array of dimension 'dim(X)[MARGIN]' otherwise. If 'n' is
'0', the result has length 0 but not necessarily the 'correct'
dimension.
In fact, if MARGIN is a permutation of the indices and FUN is the
identity, apply acts the same as aperm:
test_array <- array(rnorm(48), c(2, 2, 4, 3))
assert_that(all(aperm(test_array, c(4, 2, 1, 3)) ==
apply(test_array, c(4, 2, 1, 3), function (x) x)))What happens to the non-MARGIN axes? The key to remember is that
apply doesn't act in place, so when you use the row as the margin,
the result appears transposed:
test_matrix <- matrix(1:12, 3, 4)
list(original = test_matrix,
row_margin = apply(test_matrix, 1, cumsum),
col_margin = apply(test_matrix, 2, cumsum))## $original
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## $row_margin
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 5 7 9
## [3,] 12 15 18
## [4,] 22 26 30
##
## $col_margin
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 3 9 15 21
## [3,] 6 15 24 33
This is in contrast to, say, J (which I don't know at all!). J has fewer surprises, but is difficult to read:
|: 1 + i. 4 3
1 4 7 10
2 5 8 11
3 6 9 12
+/\ |: 1 + i. 4 3
1 4 7 10
3 9 15 21
6 15 24 33
+/\"1 |: 1 + i. 4 3
1 5 12 22
2 7 15 26
3 9 18 30
2 # |: 1 + i. 4 3
1 4 7 10
1 4 7 10
2 5 8 11
2 5 8 11
3 6 9 12
3 6 9 12
2 #"1 |: 1 + i. 4 3
1 1 4 4 7 7 10 10
2 2 5 5 8 8 11 11
3 3 6 6 9 9 12 12
Use this to evaluate an expression while temporarily reseeding the RNG:
with_seed <- function (seed, expr) {
save_seed <-
if (exists(".Random.seed", .GlobalEnv)) .Random.seed else NULL
on.exit(
if (!is.null(save_seed))
assign(".Random.seed", save_seed, envir = .GlobalEnv),
add = TRUE)
set.seed(seed)
eval(expr, envir = parent.frame())
}
assert_that(
all(with_seed(1000, rnorm(9)[c(1:6, 1:3, 7:9)]) ==
with_seed(1000, c(rnorm(6), with_seed(1000, rnorm(3)), rnorm(3)))))I almost always use
dplyr
instead of
data.table.
This is one task that's awkward with dplyr:
local({
x <- data.table(z = with_seed(12345, rnorm(1000)), key = "z")
x_breaks <- data.table(z = seq(min(x$z), max(x$z), length.out = 10), key = "z")
x_breaks[x, .(count = .N), roll = TRUE, by = z]
})| z | count |
|---|---|
| -2.7783255 | 21 |
| -2.0995412 | 52 |
| -1.4207569 | 132 |
| -0.7419726 | 249 |
| -0.0631883 | 260 |
| 0.6155961 | 186 |
| 1.2943804 | 68 |
| 1.9731647 | 28 |
| 2.6519490 | 3 |
| 3.3307333 | 1 |
After working with dplyr and data.table for a while, though, you
forget what base R has to offer:
with_seed(12345, rnorm(1000)) %>%
cut(breaks = c(seq(min(.), max(.), length.out = 10), Inf),
include.lowest = TRUE, right = FALSE) %>%
table| [-2.78,-2.1) | [-2.1,-1.42) | [-1.42,-0.742) | [-0.742,-0.0632) | [-0.0632,0.616) | [0.616,1.29) | [1.29,1.97) | [1.97,2.65) | [2.65,3.33) | [3.33,Inf] |
|---|---|---|---|---|---|---|---|---|---|
| 21 | 52 | 132 | 249 | 260 | 186 | 68 | 28 | 3 | 1 |
| The semantics of this operation are slightly different than a rolling | |||||||||
| join (which is more efficient when everything is sorted, but that's | |||||||||
| another issue…). |
For some reason, R has chol2inv but not chol2solve.
chol2solve <- function (R, b) {
assert_that(is.matrix(R))
if (to_drop <- is.null(dim(b))) b <- matrix(b)
assert_that(nrow(b) == nrow(R))
pivot <- attr(R, "pivot")
w <- if (is.null(pivot)) {
backsolve(R, backsolve(R, b, transpose = TRUE))
} else {
assert_that(attr(R, "rank") == nrow(R))
inv_pivot <- order(pivot)
backsolve(R, backsolve(R, b[pivot, , drop = FALSE],
transpose = TRUE))[inv_pivot, ]
}
if (to_drop) w <- drop(w)
w
}
test_that("chol2solve handles linear systems", {
m <- matrix(c(14, -11, 5, -11, 17, -5, 5, -5, 13), 3, 3)
b <- c(1, 2, 5)
expect_that(chol2solve(chol(m), b), equals(solve(m, b)))
})
test_that("chol2solve handles pivots", {
m <- matrix(c(14, -11, 5, -11, 17, -5, 5, -5, 13), 3, 3)
b <- c(1, 2, 5)
m_ch <- chol(m, pivot = TRUE)
expect_that(attr(m_ch, "pivot"), equals(c(2, 3, 1)))
expect_that(chol2solve(m_ch, b), equals(solve(m, b)))
})It almost goes without saying that you should never invert matrices
(never write solve(a), and always use solve(a, v) in favor of
solve(a) %*% v). The chol2inv function makes an appearance in
summary.lm where it is used to compute standard errors of regression
coefficients.
Use crossprod(x, y) in favor of t(x) %*% y. There is an analogous
function tcrossprod(x, y) to replace x %*% t(y).
The
R documentation on QR decompositions
is worth reading. But it never, as far as I know, explains the content
of the qraux vector in the output of qr. At some point I dug into
the source code for the QR decomposition routines (look at, say,
src/appl/dqrdc2.f),
which contain an ominous comment that
dates back to 1997:
c i am very nervous about modifying linpack code in this way.
c if you are a computational linear algebra guru and you really
c understand how to solve this problem please feel free to
c suggest improvements to this code.As it happens, qraux stores different data depending on whether you
use LAPACK or LINPACK routines. The i-th Householder vector is stored
in the last n - i entries of the i-th column of qr. The first
i - 1 entries of the Householder vector are zero. Here's where the
LINPACK and LAPACK decompositions in R differ:
- In LINPACK (the default), the i-th entry of the Householder vector
is stored in
qraux[i]and β is1 / qraux[i]. - When you use
LAPACK = TRUE, the i-th entry of the Householder vector is 1 and β isqraux[i].
In R notation, the Householder transformation associated to a pair
(beta, v) is the function
function (w) w - beta * v %*% crossprod(v, w)For details on the algorithm, see Golub and Van Loan.
For the inverse logit, there is no need to write 1 / (1 + exp(-x)).
Use plogis (in the stats package, which is loaded by default) or
boot::inv.logit (which just calls plogis). For the logit, use
stats::qlogis or boot::logit (which is just an alias). There is no
real numerical reason to do this, but the internal R functions are
guaranteed to handle NaN and infinity correctly, and the functions
are more readable.
Never write 1 - pfoobar(...). All of the
distribution functions accept an optional lower.tail
so 1 - pfoobar(...) can be replaced by pfoobar(..., lower.tail = FALSE). It looks like more typing, but it is more
precise due to the potential catastrophic cancellation:
c(1 - pnorm(10), pnorm(10, lower.tail = FALSE))## [1] 0.000000e+00 7.619853e-24
If you find yourself writing log(pfoobar(...)), use pfoobar(..., log = TRUE) instead.
When I miss MATLAB:
ones <- function (...) array(1, dim = c(...))This is convenient when combined with %o%, but indexing a vector by
an array drops dimensions, so it isn't as flexible.
Here is a useful utility for viewing a huge amount of data in the console:
pp <- function (x) page(x, method = "print", max = 99999L)This isn't really R stuff per se, but my notes editing R code.
Here's my ESS config from (some version of) my
.emacs:
(defun sbi/ess-setup ()
"Set up ESS customizations."
(ess-toggle-underscore nil)
(ess-add-style 'sbi-custom
'((ess-indent-level . 4)
(ess-first-continued-statement-offset . 4)
(ess-continued-statement-offset . 0)
(ess-brace-offset . 0)
(ess-arg-function-offset . 4)
(ess-arg-function-offset-new-line . '(4))
(ess-expression-offset . 4)
(ess-else-offset . 0)
(ess-close-brace-offset . 0))))
(use-package ess-site
:ensure ess
:commands R
:mode (("\\.[rR]\\'" . R-mode)
("\\.Rnw\\'" . Rnw-mode))
:config (sbi/ess-setup))Some people really don't like underscores because you have to press
the shift key to type them, and ESS has
helpfully bound that to <-, so you have to hit shift-hyphen twice.
Or you could just disable this feature: that's the
(ess-toggle-underscore nil).
Google
recommends using full stops or camelCase. But full stops in names
interfere with the S3 class system.
Hadley Wickham recommends
underscores, but beware of this convention. Some of his functions only
differ in name from core R by replacing . with _, but differ in
behavior—data.frame has stringsAsFactors = TRUE by default, while
dplyr::data_frame has stringsAsFactors = FALSE (as it should have
been). I go for underscores on this one. There are no S3 surprises and
We_the_people_of_the_United_States is a lot more readable than
inOrderToFormAMorePerfectUnion.
ESS by default will indent magrittr/dplyr pipelines in a charming echelon fashion:
mtcars %>%
group_by(cyl, gear) %>%
summarise(mean(wt), mean(hp)) %>%
ungroup %>%
arrange(`mean(wt)`)You have to mess with ess-continued-statement-offset and
ess-first-continued-statement-offset to get something sane like
mtcars %>%
group_by(cyl, gear) %>%
summarise(mean(wt), mean(hp)) %>%
ungroup %>%
arrange(`mean(wt)`)- Hadleyverse:
- Advanced R
- R packages
- ggplot2
- github link
- R project
- R internals
- R language definition
- Writing R extensions
- github R mirror
- Rcpp
- Armadillo
- Rcpp documentation
- RcppArmadillo
- RcppEigen
- Netlib
- BLAS
- LAPACK
- data.table
- wiki
- cheat sheet
- Useful packages
- zoo
- Yihui Xie's github
- Stan (Bayesian modeling)
- Stan project
- Manual