Spot the subtle ways information from the future can sneak into your training data. Master pipelines and evaluation practices that keep your metrics honest.
Which setup best prevents data leakage during preprocessing?
Fit preprocessing on the full dataset before splitting
Use the test set to pick imputation strategies
Standardize by the mean of train+test together
Fit scalers and imputers inside a Pipeline that is cross-validated
Why can using future-dated features in a training row be leakage?
The model learns information that would not be available at prediction time
It always reduces variance regardless of target
It merely increases training time
It guarantees perfect calibration
Where should feature selection based on target correlation happen?
On the test set to pick the best features
On the full dataset before any split
Inside each training fold during cross-validation
Only after deployment
For time-series forecasting, which cross-validation approach reduces leakage?
Shuffle all timestamps randomly
Stratified k-fold on the target values
Standard k-fold with random shuffling
Use forward-chaining splits that respect time order
Which is an example of target leakage in healthcare claims modeling?
Using age and gender as features
Including a post-visit billing code that is created after the prediction point
One-hot encoding diagnosis categories
Scaling continuous lab results
How can you prevent text vectorization leakage?
Fit on test to maximize vocabulary coverage
Build the vocabulary from train+test jointly
Fit the vectorizer on training data in each fold and transform validation/test separately
Remove rare words after seeing test performance
Which evaluation choice best guards against overfitting tuning decisions?
Use the same validation fold for model selection and final report
Increase epochs until the test score stops improving
Choose the best model using test-set performance repeatedly
Keep a final untouched test set after cross-validation
Why is GroupKFold useful to avoid leakage?
It keeps all rows from the same entity in the same fold
It guarantees equal fold sizes regardless of groups
It stratifies by the target mean only
It prevents class imbalance automatically
What is a red flag that your preprocessing likely leaked?
Validation scores collapse when you move the transformer into a Pipeline
Feature names are longer
Model coefficients change after refit
Training time decreases
Which logging practice helps catch leakage before deployment?
Disable logging in production
Aggregate logs weekly without detail
Only log final predictions
Record data timestamps and feature availability windows for every prediction
Starter
Good start—focus on proper splits and fold-wise preprocessing.
Solid
Solid! Tighten your process with grouped/time splits and final holdouts.
Expert!
Excellent—your evaluation hygiene resists leakage in real projects.