Replace missing values (NA) in a data.frame with a specified value or method (such as mean, median, mode, constant, or custom function), applying imputation column-wise.
Arguments
- data
A data frame. The dataset in which missing values should be imputed.
- method
A list of one-sided formulas of the form
<selector> ~ <value>. Supported<value>options are:"mean": replace with the column mean (numeric columns only)."median": replace with the column median (numeric columns only)."mode": replace with the most frequent value (works for numeric, character, or factor).A numeric constant: replace with that constant (numeric columns).
A character constant: replace with that value (character/factor columns).
A function: a function
function(col)that receives the column and returns a single value to be used as replacement for NA.
The default is
list(dplyr::where(is.numeric) ~ "mean",dplyr::where(is.character) ~ "mode",dplyr::where(is.factor) ~ "mode").- filter_by
Character vector of column names. If provided, only rows that have all specified columns non-NA are kept (applied before imputation).
- drop_all_na
Logical; if
TRUE, rows where all columns areNAare removed before imputation.- verbose
Logical; if
TRUE(default) print a concise final summary of what was imputed. Set toFALSEto suppress messages.
Details
You can remove rows that are entirely NA before imputation using
drop_all_na, or filter rows based on specific variables using filter_by.
The
methodargument uses tidyselect helpers. For example,where(is.numeric()) ~ "median"imputes all numeric columns by their medians."mode"works for numeric, character and factor columns.When imputing factors with a character constant, the constant is added as a new level if needed.
When passing a custom function, it should return at least one value; if multiple values are returned, only the first is used (with a warning).
Examples
# Impute all numeric columns by their means:
impute_missing(icu)
#> Imputation summary: Numeric variables imputed with mean; Categorical/Factor variables imputed with mode
#> # A tibble: 5,813 × 12
#> record_id center covid_wave icu icu_enter_days icu_exit_days vent_mec
#> <int> <fct> <fct> <fct> <dbl> <dbl> <fct>
#> 1 1 Hospital C Wave 2 No 3.61 22.8 No
#> 2 2 Hospital A Wave 2 No 3.61 22.8 No
#> 3 3 Hospital A Wave 3 No 3.61 22.8 No
#> 4 4 Hospital A Wave 3 No 3.61 22.8 No
#> 5 5 Hospital C Wave 2 No 3.61 22.8 No
#> 6 6 Hospital C Wave 1 Yes 2 28 Yes
#> 7 7 Hospital E Wave 1 No 3.61 22.8 No
#> 8 8 Hospital B Wave 1 No 3.61 22.8 No
#> 9 9 Hospital C Wave 2 No 3.61 22.8 No
#> 10 10 Hospital E Wave 1 No 3.61 22.8 No
#> # ℹ 5,803 more rows
#> # ℹ 5 more variables: vent_mec_start_days <dbl>, vent_mec_end_days <dbl>,
#> # vent_mec_no_inv <fct>, vent_mec_no_inv_start_days <dbl>,
#> # vent_mec_no_inv_end_days <dbl>
# Impute numeric columns by median:
impute_missing(
icu,
method = list(where(is.numeric) ~ "median")
)
#> Imputation summary: Numeric variables imputed with median
#> # A tibble: 5,813 × 12
#> record_id center covid_wave icu icu_enter_days icu_exit_days vent_mec
#> <int> <fct> <fct> <fct> <dbl> <dbl> <fct>
#> 1 1 Hospital C Wave 2 No 3 16 No
#> 2 2 Hospital A Wave 2 No 3 16 No
#> 3 3 Hospital A Wave 3 No 3 16 No
#> 4 4 Hospital A Wave 3 No 3 16 No
#> 5 5 Hospital C Wave 2 No 3 16 No
#> 6 6 Hospital C Wave 1 Yes 2 28 Yes
#> 7 7 Hospital E Wave 1 No 3 16 No
#> 8 8 Hospital B Wave 1 No 3 16 No
#> 9 9 Hospital C Wave 2 No 3 16 No
#> 10 10 Hospital E Wave 1 No 3 16 No
#> # ℹ 5,803 more rows
#> # ℹ 5 more variables: vent_mec_start_days <dbl>, vent_mec_end_days <dbl>,
#> # vent_mec_no_inv <fct>, vent_mec_no_inv_start_days <dbl>,
#> # vent_mec_no_inv_end_days <dbl>
# Keep only rows where both "vent_mec_no_inv" and "vent_mec" are non-missing:
impute_missing(
icu,
filter_by = c("vent_mec_no_inv", "vent_mec")
)
#> Removed 6 rows that had NA in at least one of the 'filter_by' variables
#> Imputation summary: Numeric variables imputed with mean; Categorical/Factor variables imputed with mode
#> # A tibble: 5,807 × 12
#> record_id center covid_wave icu icu_enter_days icu_exit_days vent_mec
#> <int> <fct> <fct> <fct> <dbl> <dbl> <fct>
#> 1 1 Hospital C Wave 2 No 3.61 22.9 No
#> 2 2 Hospital A Wave 2 No 3.61 22.9 No
#> 3 3 Hospital A Wave 3 No 3.61 22.9 No
#> 4 4 Hospital A Wave 3 No 3.61 22.9 No
#> 5 5 Hospital C Wave 2 No 3.61 22.9 No
#> 6 6 Hospital C Wave 1 Yes 2 28 Yes
#> 7 7 Hospital E Wave 1 No 3.61 22.9 No
#> 8 8 Hospital B Wave 1 No 3.61 22.9 No
#> 9 9 Hospital C Wave 2 No 3.61 22.9 No
#> 10 10 Hospital E Wave 1 No 3.61 22.9 No
#> # ℹ 5,797 more rows
#> # ℹ 5 more variables: vent_mec_start_days <dbl>, vent_mec_end_days <dbl>,
#> # vent_mec_no_inv <fct>, vent_mec_no_inv_start_days <dbl>,
#> # vent_mec_no_inv_end_days <dbl>
