benchmarking missing-values approaches for predictive
TRANSCRIPT
![Page 1: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/1.jpg)
Benchmarking missing-values approaches for predictive
models on health databases.CIMD presentation
Alexandre Perez-Lebel 1,2 Gael Varoquaux 1,2 Marine Le Morvan 2
Julie Josse 2 Jean-Baptiste Poline 1
1McGill University, Canada
2Inria, France
October 11, 2021
1 / 35
![Page 2: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/2.jpg)
Content
1. Overview2. Introduction3. Benchmark
Methods benchmarkedDatasetsProtocol
4. ResultsPrediction performanceComputational timeSignificance
5. DiscussionInterpretationLimitationsConclusion
2 / 35
![Page 3: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/3.jpg)
Overview
• Benchmark on real-world data.
• 4 health databases with missing values.
• Arbitrarily defined prediction tasks.
• Methods to handle missing values:Missing Incorporated in Attribute (MIA) vs imputation.
• Evaluate prediction score and computational time.
3 / 35
![Page 4: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/4.jpg)
Overview
Results:
• MIA performed better at little cost, but not always significantly.
• Conditional imputation is on a par with constant imputation.
• Complex imputation can be untractable at large scale.
4 / 35
![Page 5: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/5.jpg)
Introduction
5 / 35
![Page 6: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/6.jpg)
Introduction: scope of the study
Focus on supervised learning with missing values.Different tradeoffs: risk minimization instead of parameters estimation.
In supervised learning, most statistical models and machine learning algorithms are notdesigned for incomplete data.
How to deal with missing values in this framework?
• Delete samples having missing values → to avoid.• Use imputation.
• Constant imputation (mean, median)• Conditional imputation (KNN, MICE)
• Adapt or create predictive models to handle missing values natively.• Boosted-trees with Missing Incorporated in Attribute (MIA) adaptation [Twala et al., 2008].• NeuMiss networks in the regression setting [Le Morvan et al., 2020].
6 / 35
![Page 7: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/7.jpg)
Introduction: problem
• How does MIA experimentally compare to imputation?
• Constant imputation vs conditional imputation.
7 / 35
![Page 8: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/8.jpg)
Imputation
Replace missing values by values.
• Constant imputation:Mean or Median.
• Conditional imputation:Xmis ← E [Xmis | Xobs ].MICE [Buuren and Groothuis-Oudshoorn, 2010] or KNN.
Add a binary mask to keep track of imputed values.
8 / 35
![Page 9: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/9.jpg)
Missing Incorporated in Attribute (MIA)
Adaptation of boosted-trees to account for missing values.
Idea:For each split on a variable, all samples with a missing value in this variable are either sent tothe left or to the right child node depending on which option leads to the lowest risk.
9 / 35
![Page 10: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/10.jpg)
Methods benchmarked
Table: Methods compared in the main experiment.
In-article name Imputer Mask Predictive model
MIA - - Gradient-boosted treesMean Mean No Gradient-boosted treesMean+mask Mean Yes Gradient-boosted treesMedian Median No Gradient-boosted treesMedian+mask Median Yes Gradient-boosted treesIterative MICE No Gradient-boosted treesIterative+mask MICE Yes Gradient-boosted treesKNN KNN No Gradient-boosted treesKNN+mask KNN Yes Gradient-boosted trees
10 / 35
![Page 11: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/11.jpg)
Datasets
• Traumabase [The Traumabase Group, ], 20 000 samples.
• UK BioBank [Sudlow et al., ], 500 000 samples.
• MIMIC-III [Johnson et al., ], 60 000 samples.
• NHIS [National Center for Health Statistics, 2017], 88 000 samples.
Defined 13 prediction tasks (10 classifications, 3 regressions).Outcomes chosen arbitrarily.Feature selection:
• ANOVA (11 tasks)
• Expert knowledge (2 tasks).
11 / 35
![Page 12: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/12.jpg)
0 20 40 60 80 100
Number of features
death screening
hemo
hemo screening
platelet screening
septic screening
breast 25
breast screening
fluid screening
parkinson screening
skin screening
hemo screening
septic screening
income screening
MIMIC
NHIS
Traumabase
UKBB
Feature type
Categorical Ordinal Numerical
Figure: Types of features before encoding.
12 / 35
![Page 13: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/13.jpg)
0 50
0
50
100
death screening
0 5 10
0
50
100
hemo
0 25 50 75
0
50
100
hemo screening
0 25 50 75
0
50
100
platelet screening
0 25 50 75
0
50
100
septic screening
Traumabase
0 20 40 60
0
50
100
breast 25
0 50 100
0
50
100
breast screening
0 50 100
0
50
100
fluid screening
0 50 100
0
50
100
parkinson screening
0 50 100
0
50
100
skin screening
UKBB
0 50 100
0
50
100
hemo screening
0 50 100
0
50
100
septic screening
MIMIC
0 25 50 75
Features
0
50
100
Pro
port
ion
income screening
NHIS
Figure: Missing values distribution.
13 / 35
![Page 14: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/14.jpg)
Table: Correlation between features.
Threshold0.1 0.2 0.3
Database Task # features
Tra
um
ab
ase
death screening 92 68% 41% 22%hemo 12 50% 23% 12%hemo screening 76 65% 36% 20%platelet screening 90 67% 40% 22%septic screening 76 68% 37% 18%
UKBB breast 25 11 40% 20% 19%breast screening 100 26% 12% 8%fluid screening 100 21% 10% 6%parkinson screening 100 28% 16% 11%skin screening 100 24% 11% 8%
MIMIC hemo screening 100 22% 6% 3%septic screening 100 21% 6% 2%
NHIS income screening 78 15% 6% 4%
Average 79 40% 20% 12%
14 / 35
![Page 15: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/15.jpg)
Experimental protocol
• 9 methods, 13 prediction tasks.
• One-hot encode categorical features.
• Feature selection trained on 1/3 of the samples: 5 trials, 100 features.
• Sub-sampled the tasks: 2 500, 10 000, 25 000 and 100 000 samples.
• Cross-validation.
• Tuned the hyper-parameters of the predictive model.
• Evaluate prediction with accuracy or r2.
15 / 35
![Page 16: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/16.jpg)
Results - Prediction performance
−0.02 0.00
Relative prediction score
MIA
Mean
Mean+mask
Median
Median+mask
Iterative
Iterative+mask
KNN
KNN+mask
Con
stan
tim
pu
tati
onC
ond
itio
nal
imp
uta
tion
n=2500
−0.015 0.000 0.015
Relative prediction score
n=10000
−0.015 0.000
Relative prediction score
n=25000
Meanrank
MIA 1.9
Mean 5.0
Mean+mask
2.8
Median 6.7
Median+mask
3.1
Iterative 6.4
Iterative+mask
3.7
KNN 8.8
KNN+mask
6.4−0.004 0.000 0.004
Relative prediction score
n=100000 DatabaseTraumabase
UKBB
MIMIC
NHIS
Figure: Prediction performance.
For a specific task:Relative prediction score = prediction score - average prediction score of the 9 methods.
Iterative = MICE16 / 35
![Page 17: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/17.jpg)
Results - Computational time
23× 1× 3
2×
Relative total training time
MIA
Mean
Mean+mask
Median
Median+mask
Iterative
Iterative+mask
KNN
KNN+mask
Con
stan
tim
pu
tati
onC
ond
itio
nal
imp
uta
tion
n=2500
23× 1× 3
2×
Relative total training time
n=10000
23× 1× 3
2×
Relative total training time
n=25000
23× 1× 3
2×
Relative total training time
n=100000
DatabaseTraumabase
UKBB
MIMIC
NHIS
Figure: Computational time.
For a specific task:Relative prediction time = prediction time - average prediction time of the 9 methods.
17 / 35
![Page 18: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/18.jpg)
Results - Significance
Friedman test. [Friedman, 1937]Null hypothesis: “methods are equivalent”.
p-valueSize
2500 1.6e-1010000 2.6e-1025000 2.8e-04100000 8.5e-03
Nemenyi test. [Nemenyi, 1963]Once the Friedman test is rejected, the Nemenyi test can be applied. It provides a criticaldifference CD which is the minimal difference between the average ranks of two algorithms forthem to be significantly different. (N: number of datasets, k: number of algorithms)
CD = qα
√k(k + 1)
6N18 / 35
![Page 19: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/19.jpg)
1
2
3
4
5
6
7
8
9
crit
ical
dis
tan
ce
MIA
Mean+mask
Median+mask
Iterative+mask
Mean
Median
Iterative
KNN+mask
KNN
Size=2500, N=131
2
3
4
5
6
7
8
9
crit
ical
dis
tan
ce
MIA
Mean+mask
Median+mask
Iterative+mask
Mean
KNN+mask
Iterative
Median
KNN
Size=10000, N=12
1
2
3
4
5
6
7
8
9
crit
ical
dis
tan
ce
MIA
Median+mask
Mean+mask
Iterative+mask
Mean
KNN+mask
Median
Iterative
KNN
Size=25000, N=71
2
3
4
5
6
7
8
9
crit
ical
dis
tan
ce
MIA
Mean+mask
Median+mask
Iterative+mask
Mean
Median
Iterative
Size=100000, N=4
Figure: Mean ranks by method and by size of dataset.19 / 35
![Page 20: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/20.jpg)
Results - Significance
Same conclusions with the Wilcoxon one-sided signed rank test. [Wilcoxon, 1945]
20 / 35
![Page 21: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/21.jpg)
Findings and interpretation
Findings:
• MIA takes the lead at little cost, although not significantly.
• Adding the mask improves prediction.
Interpretation:• Good imputation does not imply good prediction.
• Low correlation between features.• Strong non-linear mechanisms.• Constant imputation provides a simple structure that can be extracted by the learner.
• The missingness is informative (MNAR or outcome depends on missingness)→ imputation is not applicable.
21 / 35
![Page 22: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/22.jpg)
Strengths and limitations
Limitations:
• Not every difference is significant.
• Would benefit having more datasets and having more datasets with large number ofsamples.
Strengths of the benchmark:
• 12 000 CPU hours.
• Lots of datasets (only 6% of empirical NeurIPS articles build upon more than 10 datasets[Bouthillier and Varoquaux, 2020].)
• Real data.
22 / 35
![Page 23: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/23.jpg)
Conclusion
23 / 35
![Page 24: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/24.jpg)
Conclusion
• Using MIA provides small but systematic improvement over imputation.
• Complex imputation untractable at large scale.
• Experiments suggests that missingness is informative: imputation not grounded.
• Directly handling missing values in the predictive model is to be considered.
• Change habits in practice: better choices than imputation.
24 / 35
![Page 25: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/25.jpg)
Reviewers’ feedbacks
Manuscript submitted in GigaScience.
Some comments of the reviewers:
• What about multiple imputation?
• Break boxplots by task.
• Relative prediction score difficult to interpret.
25 / 35
![Page 26: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/26.jpg)
Acknowledgments
Thank you for you attention.
26 / 35
![Page 27: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/27.jpg)
Appendix
27 / 35
![Page 28: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/28.jpg)
Introduction: the problem of missing values
• Missing values are omnipresent in real world problems
• Have long been studied in the statistical literature within the inferential framework
[Rubin, 1976] defined several missing values mechanisms:
• Missing At Random (MAR): the probability of a value to be missing only depends on theobserved variables.
• Missing Not At Random (MNAR): the missingness can depend on both the observed andunobserved values.
Most missing values methods in inference rely on the MAR hypothesis since theoretical resultsshow that the mechanism can be ignored. In practice, real data is often MNAR.
28 / 35
![Page 29: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/29.jpg)
Supplementary experiment
0 0.05 0.1
Relative prediction score
MIA
Linear+Mean
Linear+Mean+mask
Linear+Med
Linear+Med+mask
Linear+Iter
Linear+Iter+mask
Linear+KNN
Linear+KNN+mask
Con
stan
tim
pu
tati
onC
ond
itio
nal
imp
uta
tion
n=2500
0 0.05 0.1
Relative prediction score
n=10000
0 0.05 0.1
Relative prediction score
n=25000
Meanrank
MIA 1.3
Linear+Mean 5.5
Linear+Mean+mask
3.9
Linear+Med 5.9
Linear+Med+mask
4.0
Linear+Iter 5.9
Linear+Iter+mask
4.6
Linear+KNN 7.8
Linear+KNN+mask
6.20 0.05 0.1
Relative prediction score
n=100000 DatabaseTraumabase
UKBB
MIMIC
NHIS
Figure: Prediction performance.
29 / 35
![Page 30: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/30.jpg)
Supplementary experiment
150× 1
10× 1
3× 1× 3× 10×
Relative total training time
MIA
Linear+Mean
Linear+Mean+mask
Linear+Med
Linear+Med+mask
Linear+Iter
Linear+Iter+mask
Linear+KNN
Linear+KNN+mask
Con
stan
tim
pu
tati
onC
ond
itio
nal
imp
uta
tion
n=2500
150× 1
10× 1
3× 1× 3× 10×
Relative total training time
n=10000
150× 1
10× 1
3× 1× 3× 10×
Relative total training time
n=25000
150× 1
10× 1
3× 1× 3× 10×
Relative total training time
n=100000
DatabaseTraumabase
UKBB
MIMIC
NHIS
Figure: Computational time.
30 / 35
![Page 31: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/31.jpg)
Results - Significance
Table: Wilcoxon one-sided signed rank test.
Size 2500 10000 25000 100000Method
Mean 1.2e-03?? 4.6e-02? 2.3e-02? 6.2e-02Mean+mask 4.0e-02? 2.3e-01 1.5e-01 6.2e-02Median 5.2e-03? 1.7e-03?? 2.3e-02? 6.2e-02Median+mask 4.0e-02? 2.1e-01 1.5e-01 1.2e-01Iterative 5.2e-03? 3.2e-02? 3.9e-02? 6.2e-02Iterative+mask 2.4e-02? 2.1e-01 4.7e-01 6.2e-02KNN 1.2e-04?? 2.4e-04?? 3.1e-02?
KNN+mask 1.2e-04?? 7.3e-04?? 3.1e-02?
Linear+Mean 6.1e-04?? 4.9e-04?? 7.8e-03? 6.2e-02Linear+Mean+mask 8.5e-04?? 7.3e-04?? 1.6e-02? 6.2e-02Linear+Med 6.1e-04?? 4.9e-04?? 7.8e-03? 6.2e-02Linear+Med+mask 6.1e-04?? 4.9e-04?? 1.6e-02? 6.2e-02Linear+Iter 3.1e-03?? 1.2e-03?? 1.6e-02? 6.2e-02Linear+Iter+mask 2.3e-03?? 1.2e-03?? 1.6e-02? 6.2e-02Linear+KNN 1.2e-04?? 2.4e-04?? 1.6e-02? 5.0e-01Linear+KNN+mask 1.2e-04?? 2.4e-04?? 3.1e-02? 5.0e-01
31 / 35
![Page 32: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/32.jpg)
References I
Bouthillier, X. and Varoquaux, G. (2020).Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020.Research report, Inria Saclay Ile de France.
Buuren, S. v. and Groothuis-Oudshoorn, K. (2010).mice: Multivariate imputation by chained equations in r.Journal of statistical software, pages 1–68.
Friedman, M. (1937).The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis ofVariance.Journal of the American Statistical Association, 32(200):675–701.
32 / 35
![Page 33: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/33.jpg)
References II
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M.,Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G.MIMIC-III, a freely accessible critical care database.3(1):160035.
Le Morvan, M., Josse, J., Moreau, T., Scornet, E., and Varoquaux, G. (2020).NeuMiss networks: differentiable programming for supervised learning with missing values.Advances in Neural Information Processing Systems, 33:5980–5990.
National Center for Health Statistics (2017).National Health Interview Survey (NHIS).
Nemenyi, P. (1963).Distribution-free Multiple Comparisons.Princeton University.
33 / 35
![Page 34: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/34.jpg)
References III
Rubin, D. B. (1976).Inference and missing data.Biometrika, 63(3):581–592.
Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott,P., Green, J., Landray, M., Liu, B., Matthews, P., Ong, G., Pell, J., Silman, A., Young, A.,Sprosen, T., Peakman, T., and Collins, R.UK biobank: An open access resource for identifying the causes of a wide range ofcomplex diseases of middle and old age.12(3):e1001779.
The Traumabase Group.Traumabase.
34 / 35
![Page 35: Benchmarking missing-values approaches for predictive](https://reader031.vdocuments.net/reader031/viewer/2022022606/6218878f855f3431387ece91/html5/thumbnails/35.jpg)
References IV
Twala, B. E. T. H., Jones, M. C., and Hand, D. J. (2008).Good methods for coping with missing data in decision trees.Pattern Recogn. Lett., 29:950–956.
Wilcoxon, F. (1945).Individual Comparisons by Ranking Methods.Biometrics Bulletin, 1(6):80–83.
35 / 35