Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ jobs:
config:
- {os: macOS-latest, r: 'release'}
- {os: windows-latest, r: 'release'}
# Use 3.6 to trigger usage of RTools35
- {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}

env:
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,7 @@ Meta
^\revdep$
/doc/
/Meta/
.DS_Store
Rplots.pdf
*.Rcheck/
inst/doc/
5 changes: 2 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
Package: miceFast
Title: Fast Imputations Using 'Rcpp' and 'Armadillo'
Version: 0.9.0
Version: 0.9.1
Authors@R: person("Maciej", "Nasinski", email = "nasinski.maciej@gmail.com", role = c("aut", "cre"))
Description:
Fast imputations under the object-oriented programming paradigm.
Moreover there are offered a few functions built to work with popular R packages such as 'data.table' or 'dplyr'.
The biggest improvement in time performance could be achieve for a calculation where a grouping variable have to be used.
The biggest improvement in time performance can be achieved for a calculation where a grouping variable is used.
A single evaluation of a quantitative model for the multiple imputations is another major enhancement.
A new major improvement is one of the fastest predictive mean matching in the R world because of presorting and binary search.
Depends: R (>= 3.6.0)
Expand All @@ -20,7 +20,6 @@ Imports:
Suggests:
knitr,
rmarkdown,
pacman,
testthat,
mice,
magrittr,
Expand Down
23 changes: 23 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,26 @@
# miceFast 0.9.1

## Bug fixes

* PMM returned predicted values instead of observed values (C++): The `pmm` model returned predicted $\hat{y}$ for missing rows instead of the nearest observed $y$ values. Now it follows Little and Rubin (2002).
* PMM with character/factor variables (R): `fill_NA_N()` with `model = "pmm"` and a character dependent variable failed because it attempted `as.numeric()` on non-numeric strings, producing all NAs.
* Character dependent variable with lm models: `fill_NA()` and `fill_NA_N()` with `model = "lm_pred"`, `"lm_bayes"`, or `"lm_noise"` silently returned all NAs when the dependent variable was character with non-numeric labels (e.g., `"apple"`, `"banana"`).

## Documentation

* README: added sequential-chain MI examples (dplyr and data.table) showing how to impute multiple variables and pool with Rubin's rules.
* Introduction vignette: added full imputation workflow with sequential ordering (impute variables whose predictors are complete first), FCS (chained equations) section with data.table example, and PMM note for the OOP interface.
* MI vignette: expanded Rubin's rules derivations, added PMM MI example using the OOP interface, expanded "Important caveat" section with OOP and data.table FCS code snippets for non-monotone patterns.
* Documented PMM as a proper MI method throughout vignettes and README.
* Improved prose throughout vignettes and README.

## Tests

* Added 20 PMM-specific tests (`test-pmm.R`): observed-value returns, factor/character support, weighted PMM, grouped data.table, reproducibility, stochasticity.
* Added 31 FCS tests (`test-fcs.R`): data.table, data.frame, and OOP FCS helpers; joint-missingness handling; MI+pool workflow; comparison with `mice` (pooled estimates and imputed means).
* Added tests for character dependent variables with non-numeric labels across all models and data types.
* Test suite expanded from 243 to 311 tests.

# miceFast 0.9.0

Kota Hattori, thank you for your feedback and for motivating me for this deep update.
Expand Down
12 changes: 2 additions & 10 deletions R/fill_NA.R
Original file line number Diff line number Diff line change
Expand Up @@ -188,11 +188,7 @@ fill_NA.data.frame <- function(
f[f > length(l)] <- length(l)
ff <- factor(l[f])
} else if (is_character_y) {
yy <- if (model != "lda") {
factor(yy, levels = sort(as.numeric(unique(yy))))
} else {
factor(yy)
}
yy <- factor(yy)
l <- levels(yy)
yy <- as.numeric(yy)
f <- round(fill_NA_(cbind(yy, xx), model, 1, 2:(ncol(xx) + 1), ww, ridge))
Expand Down Expand Up @@ -277,11 +273,7 @@ fill_NA.data.table <- function(
f[f > length(l)] <- length(l)
ff <- factor(l[f])
} else if (is_character_y) {
yy <- if (model != "lda") {
factor(yy, levels = sort(as.numeric(unique(yy))))
} else {
factor(yy)
}
yy <- factor(yy)
l <- levels(yy)
yy <- as.numeric(yy)
f <- round(fill_NA_(cbind(yy, xx), model, 1, 2:(ncol(xx) + 1), ww, ridge))
Expand Down
20 changes: 6 additions & 14 deletions R/fill_NA_N.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@
#' @return load imputations in a numeric/character/factor (similar to the input type) vector format
#'
#' @note
#' There is assumed that users add the intercept by their own.
#' The miceFast module provides the most efficient environment, the second recommended option is to use data.table and the numeric matrix data type.
#' The lda model is assessed only if there are more than 15 complete observations
#' and for the lms models if number of independent variables is smaller than number of observations.
#' It is assumed that users add the intercept column themselves.
#' The miceFast module provides the most efficient environment; the second recommended option is data.table with a numeric matrix.
#' Only \code{"lm_bayes"}, \code{"lm_noise"}, and \code{"pmm"} models are supported.
#' The model is fitted only when the number of complete observations exceeds the number of independent variables.
#'
#' @seealso \code{\link{fill_NA}} \code{\link{VIF}} \code{vignette("miceFast-intro", package = "miceFast")}
#'
Expand Down Expand Up @@ -187,11 +187,7 @@ fill_NA_N.data.frame <- function(
f[f > length(l)] <- length(l)
ff <- factor(l[f])
} else if (is_character_y) {
yy <- if (model != "lda") {
factor(yy, levels = sort(as.numeric(unique(yy))))
} else {
factor(yy)
}
yy <- factor(yy)
l <- levels(yy)
yy <- as.numeric(yy)
f <- round(fill_NA_N_(
Expand Down Expand Up @@ -295,11 +291,7 @@ fill_NA_N.data.table <- function(
f[f > length(l)] <- length(l)
ff <- factor(l[f])
} else if (is_character_y) {
yy <- if (model != "lda") {
factor(yy, levels = sort(as.numeric(unique(yy))))
} else {
factor(yy)
}
yy <- factor(yy)
l <- levels(yy)
yy <- as.numeric(yy)
f <- round(fill_NA_N_(
Expand Down
101 changes: 51 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,23 @@ For performance details, see `performance_validity.R` in the `extdata` folder.
- [Introduction and Advanced Usage](https://polkas.github.io/miceFast/articles/miceFast-intro.html)
- [Missing Data Mechanisms and Multiple Imputation](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html)


## Practical Advice

- **Only need a filled-in dataset for exploration or ML?** A single imputation with `fill_NA()` or averaging draws with `fill_NA_N()` is fast and convenient. For any inferential statement use full MI with `pool()`.
- **Little missing data + MCAR?** Consider using `complete.cases()`. Listwise deletion is unbiased under MCAR and may be sufficient when the fraction of incomplete rows is small.
- **For publication**, always run a **sensitivity analysis**: compare MI results against base methods (`complete.cases()`, mean imputation) and across different imputation models (`lm_bayes`, `lm_noise`, `pmm`). Vary the number of imputations. If conclusions change, investigate why. Report the imputation model, *m*, and any assumptions about the missing-data mechanism.
- See the [MI vignette](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html) for details on MCAR/MAR/MNAR mechanisms and a practical checklist.

## Multiple Imputation Workflow

[mice](https://cran.r-project.org/package=mice) implements the full MI pipeline (impute, analyze, pool). **miceFast** focuses on the computationally expensive partfitting the imputation models — and is typically **~10× faster** than mice for the imputation step alone (see [benchmarks](#performance-highlights)). Two usage modes:
[mice](https://cran.r-project.org/package=mice) implements the full MI pipeline (impute, analyze, pool). **miceFast** focuses on the computationally expensive part: fitting the imputation models. It is typically **~10× faster** than mice for the imputation step alone (see [benchmarks](#performance-highlights)). Two usage modes:

1. **MI with Rubin's rules** — call `fill_NA()` with a stochastic model (`lm_bayes`, `lm_noise`, or `lda` with a random `ridge`) in a loop to create *m* completed datasets, then `pool()` the fitted models.
1. **MI with Rubin's rules.** Call `fill_NA()` with a stochastic model in a loop to create *m* completed datasets, then `pool()` the fitted models. For continuous variables use `lm_bayes` (strictly **proper**; it draws from the posterior). For both continuous and categorical variables, `pmm` (Predictive Mean Matching) is also **proper**. It draws from the posterior and matches to observed values, preserving the data distribution. Use the OOP interface (`impute("pmm", ...)`) in a loop for MI with PMM. For categorical variables, `lda` with a random `ridge` is **approximate** (ad-hoc perturbation, not a posterior draw, but works well in practice). `lm_noise` is **improper** (no parameter uncertainty); useful for sensitivity checks. See the [MI vignette](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html).

2. **Single-dataset averaging** — `fill_NA_N()` returns the mean of *k* draws per missing value. Handy for exploration, but not for Rubin's rules (between-imputation variance is lost).
2. **Single-dataset imputation.** `fill_NA_N()` with `lm_bayes`/`lm_noise` returns the mean of *k* stochastic draws per missing value. With `pmm`, *k* is the number of nearest neighbours to sample from (no averaging). Handy for exploration, but not for Rubin's rules (between-imputation variance is lost).

3. **Iterative FCS (chained equations).** When multiple variables have interlocking (non-monotone) missingness, you can cycle through variables in a loop, restoring and re-imputing each one — the same algorithm mice uses. With a monotone pattern a single pass suffices and FCS is unnecessary. See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html) for details.

See the [MI vignette](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html) for worked examples.

Expand All @@ -56,36 +66,35 @@ devtools::install_github("polkas/miceFast")
library(miceFast)
library(dplyr)

set.seed(1234)
data(air_miss)

# Visualize the NA structure
upset_NA(air_miss, 6)

# Model-based single imputation
air_miss %>%
mutate(Ozone_imp = fill_NA(
x = ., model = "lm_bayes",
posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")
))

# Proper MI: impute m times, fit models, pool with Rubin's rules
completed <- lapply(1:5, function(i) {
air_miss %>%
mutate(Ozone_imp = fill_NA(
x = ., model = "lm_bayes",
posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")
))
# Select the 4 core variables for regression: Ozone ~ Solar.R + Wind + Temp
# Ozone has 37 NAs, Solar.R has 7 NAs, Wind and Temp are complete.
df <- air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]

# MI with Rubin's rules: impute m = 10 datasets, fit model, pool.
# Impute Solar.R first (predictors fully observed), then Ozone
# (can now use the freshly imputed Solar.R). This sequential order
# resolves joint missingness in a single pass.
set.seed(1234)
completed <- lapply(1:10, function(i) {
df %>%
mutate(Solar.R = fill_NA(., "lm_bayes", "Solar.R", c("Wind", "Temp"))) %>%
mutate(Ozone = fill_NA(., "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp")))
})
fits <- lapply(completed, function(d) lm(Ozone_imp ~ Wind + Temp, data = d))
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))
pool(fits)
#> Pooled results from 5 imputed datasets
#> Pooled results from 10 imputed datasets
#> Rubin's rules with Barnard-Rubin df adjustment
#>
#> term estimate std.error statistic df p.value
#> (Intercept) -62.771 23.9022 -2.626 46.95 1.162e-02
#> Wind -3.087 0.6857 -4.502 37.24 6.420e-05
#> Temp 1.736 0.2498 6.951 58.54 3.400e-09
#> term estimate std.error statistic df p.value
#> (Intercept) -49.50313 21.74948 -2.276 78.41 2.557e-02
#> Solar.R 0.05771 0.02294 2.516 72.83 1.407e-02
#> Wind -3.44033 0.62721 -5.485 76.15 5.185e-07
#> Temp 1.47603 0.23404 6.307 97.50 8.345e-09
```

### data.table
Expand All @@ -94,27 +103,28 @@ pool(fits)
library(miceFast)
library(data.table)

set.seed(1234)
data(air_miss)
setDT(air_miss)

# Single imputation
air_miss[, Ozone_imp := fill_NA(
x = .SD, model = "lm_bayes",
posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")
)]

# Grouped imputation — fits a separate model per group
air_miss[, Solar_R_imp := fill_NA(
x = .SD, model = "lm_bayes",
posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept")
), by = .(groups)]
dt <- as.data.table(air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")])

# MI with Rubin's rules: same sequential chain as above.
set.seed(1234)
completed <- lapply(1:10, function(i) {
d <- copy(dt)
d[, Solar.R := fill_NA(.SD, "lm_bayes", "Solar.R", c("Wind", "Temp"))]
d[, Ozone := fill_NA(.SD, "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))]
d
})
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))
pool(fits)
```

For iterative FCS (chained equations) with non-monotone missingness,
see the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html#iterative-fcs-chained-equations-with-micefast).

### Naive imputation (baseline only)

```r
# Quick baseline — biased, does not account for relationships between variables
# Quick baseline. Biased; does not account for relationships between variables.
naive_fill_NA(air_miss)
```

Expand All @@ -127,7 +137,7 @@ See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceF
- **Object-Oriented Interface** via `miceFast` objects (Rcpp modules).
- **Convenient Helpers**:
- `fill_NA()`: Single imputation (`lda`, `lm_pred`, `lm_bayes`, `lm_noise`).
- `fill_NA_N()`: Averaged multiple imputations (mean of N draws) (`pmm`, `lm_bayes`, `lm_noise`).
- `fill_NA_N()`: Multiple imputations. Averaged draws for `lm_bayes`/`lm_noise`; nearest-neighbour sampling for `pmm`.
- `pool()`: Pool multiply imputed results using Rubin's rules.
- `VIF()`: Variance Inflation Factor calculations.
- `naive_fill_NA()`: Automatic naive imputations.
Expand All @@ -140,7 +150,7 @@ See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceF
|-----------------|-----------------------------------------------------------------------------|
| `new(miceFast)` | Creates an OOP instance with numerous imputation methods (see the vignette). |
| `fill_NA()` | Single imputation: `lda`, `lm_pred`, `lm_bayes`, `lm_noise`. |
| `fill_NA_N()` | Averaged multiple imputations (mean of N draws): `pmm`, `lm_bayes`, `lm_noise`. |
| `fill_NA_N()` | `lm_bayes`/`lm_noise`: averages *k* draws. `pmm`: samples from *k* nearest observed values (works for both continuous and categorical). |
| `pool()` | Pools estimates from *m* imputed datasets using Rubin's rules. Works with any model that has `coef()` and `vcov()`. |
| `VIF()` | Computes Variance Inflation Factors. |
| `naive_fill_NA()` | Performs automatic, naive imputations. |
Expand All @@ -149,15 +159,6 @@ See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceF

---

## Practical Advice

- **Only need a filled-in dataset for exploration or ML?** A single imputation with `fill_NA()` or averaging draws with `fill_NA_N()` is fast and convenient. For any inferential statement use full MI with `pool()`.
- **Little missing data + MCAR?** Consider using `complete.cases()` — listwise deletion is unbiased under MCAR and may be sufficient when the fraction of incomplete rows is small.
- **For publication**, always run a **sensitivity analysis**: compare MI results against base methods (`complete.cases()`, mean imputation) and across different imputation models (`lm_bayes`, `lm_noise`, `pmm`). Vary the number of imputations. If conclusions change, investigate why. Report the imputation model, *m*, and any assumptions about the missing-data mechanism.
- See the [MI vignette](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html) for details on MCAR/MAR/MNAR mechanisms and a practical checklist.

---

## Performance Highlights

Median timings on 100k rows, 10 variables, 100 groups (R 4.4.3, macOS M3 Pro, [optimized BLAS/LAPACK](https://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Which-BLAS-is-used-and-how-can-it-be-changed_003f)):
Expand Down
8 changes: 3 additions & 5 deletions cran-comments.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,9 @@

github actions:

* {os: macOS-latest, r: 'release'}
* {os: windows-latest, r: 'release'}
* {os: windows-latest, r: '3.6'}
* {os: ubuntu-18.04, r: 'devel', http-user-agent: 'release'}
* {os: ubuntu-18.04, r: 'release'}
- {os: macOS-latest, r: 'release'}
- {os: windows-latest, r: 'release'}
- {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}

and:

Expand Down
8 changes: 4 additions & 4 deletions man/fill_NA_N.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading