Assess when and how to fill in missing data without biasing models. Compare simple and model-based strategies and their assumptions.
Mean imputation can bias models mainly because ______.
it shrinks variance and distorts feature relationships
it always increases variance and noise
it requires labeled outcomes for fitting
it deletes rows with missing values
Missing completely at random (MCAR) implies ______.
values are imputed using a generative model
data are missing due to model residuals
missingness depends only on the unobserved value
missingness is independent of observed and unobserved data
MICE (multiple imputation by chained equations) improves over single imputation by ______.
generating multiple plausible datasets and pooling estimates
requiring no model fitting per variable
dropping all rows with any missing values
fitting a single global mean per feature
For tree-based models, a common pragmatic approach is to ______.
use simple imputation and add missingness indicators
impute with random noise only
discard all rows with any NA values
one-hot encode every continuous feature
KNN imputation risks leakage when ______.
neighbors are computed using features derived from the target fold
K is odd rather than even
distance metric is standardized
categorical variables are ordinal-encoded
Model-based imputation with lightGBM or XGBoost can help because ______.
it always outperforms simple methods on small data
it guarantees unbiased estimates under MNAR without assumptions
it eliminates the need for validation
it captures nonlinear relations among features to predict missing values
Under missing at random (MAR), valid imputation requires ______.
filling with group medians only
including predictors related to the missingness mechanism
ignoring auxiliary variables to avoid collinearity
using target variable as a donor in all cases
Before imputing, it is good practice to ______.
remove all constant features regardless of coverage
convert numeric features to ranks first
analyze missingness patterns by feature, cohort, and time
randomly shuffle target labels
For production scoring, a key risk of fancy imputation is ______.
using immutable model artifacts
needing to reindex model coefficients daily
mismatch between training-time imputers and real-time data pipelines
having too many categorical features
A safe cross-validation practice with imputation is to ______.
skip scaling within the pipeline entirely
fit the imputer inside each training fold and apply to its validation fold only
impute only the validation folds for fairness
fit the imputer on full data before splitting
Starter
Solid grasp—recheck assumptions like MCAR/MAR and pipeline steps.
Solid
Good choices—document imputer fit inside folds and monitor drift.
Expert!
Outstanding—balance rigor and runtime with production‑safe imputers.