Test how well you can balance utility and privacy when generating synthetic datasets. From model choices to leakage tests, see which safeguards matter most.
Which method is commonly used to generate realistic tabular data with mixed variable types?
CTGAN-style conditional GANs
Simple bootstrap resampling
Pure rule-based simulators
Autoencoders with L2 loss only
A key privacy risk of synthetic data arises when the generator ______.
uses latent variables
outputs in CSV format
trains with mini-batches
memorizes and reproduces individual records
Which evaluation checks utility by training on synthetic data and testing on real data?
BLEU score on metadata
Holdout AUC on synthetic only
Silhouette score of embeddings
TSTR (Train on Synthetic, Test on Real)
Differential privacy aims to limit the influence of any single person’s data by ______.
sampling fewer training epochs
adding calibrated noise to learning or outputs
encrypting the entire dataset at rest
using only open datasets
For time-series synthesis that preserves temporal dynamics, a proven approach is ______.
row-wise shuffling
adversarial or transformer-based time-series generators (e.g., TimeGAN)
per-feature Gaussian noise
k-means clustering then sampling centroids
To protect rare but sensitive categories in synthetic data, teams often ______.
remove all categorical features
always one-hot encode everything
release the raw stratified samples
apply conditional generation with minimum group counts or DP thresholds
Which metric family is commonly used to compare real vs. synthetic variable distributions?
two-sample tests or distances (e.g., KS, Wasserstein)
CPU utilization while training
JPEG compression ratio
edit distance on headers
A practical safeguard before releasing synthetic data is to run ______.
schema auto-formatting
k-fold CV on synthetic only
only visual t-SNE plots
membership-inference and nearest-neighbor leakage tests
Compared with simple oversampling like SMOTE, model-based synthesis can ______.
capture nonlinear feature interactions across many variables
eliminate the need for validation
replace real data entirely
guarantee zero privacy risk
A common pitfall when evaluating synthetic data quality is relying only on ______.
visual clustering plots without quantitative tests
distributional tests
task performance
privacy audits
Starter
Great beginning—keep exploring the core ideas and key trade-offs.
Solid
Strong grasp—practice applying these choices to real data and workloads.
Expert!
Excellent—your decisions reflect production-grade mastery.