advanced statistical learning
TRANSCRIPT
Advanced Statistical Learning
Chapter 0: Organisation & General Information
Bernd Bischl, Julia Moosbauer, Andreas Groll
Department of Statistics – TU Dortmund
Winter term 2020/21
Acknowledgement
Special thanks go to Prof. Dr. Bernd Bischl andMs. Julia Moosbauer from LMU Munich for providingmaterials from their courses on Advanced
computer-intensive methods and Deep learning.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 1 / 6
ORGANISATION
Lecture
Tuesday, 4-6 pm, temporarily on Zoom (EF 50/HS3)
Thursday, 4-6 pm, temporarily on Zoom (EF 50/HS3)
Andreas Groll ([email protected])
Exercises
Monday, 12-2 pm, M/E 21 (still planned in physical presence)
Thursday, 2-4 pm, Zoom
Guillermo Briseno Sánchez ([email protected])Dmitri ArtjuchJonas GlowinskiYuliya Shrub
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 2 / 6
ORGANISATION
Website
https://moodle.tu-dortmund.de/course/view.php?id=21855
Password: ASL2021
Written exams (in physical presence)
Date 1: ??
Date 2: ??
Exercises
At least 50% from the points in the exercise sheets needed; studentsshould form groups of 3-4 members, which should remain the samethroughout the semester. For more details see info sheet in Moodle.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 3 / 6
GOALS
learn to understand advanced models and analyzing methods andare aware of their limitations
become able to adapt methods to unusually structured data
learn to choose appropriate methods for real data and apply themby means of statistical software
understand the underlying mathematical theory
Partly builds upon the BA course “Introduction to Statistical Learning”.
=) For a short recap consult the pre-course eLearning slides withvideos by Alexander Gerharz:
https://moodle.tu-dortmund.de/course/view.php?id=16736
(Password: vlds)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 4 / 6
(PRELIMINARY PLANNED) CONTENT
1 Introduction and Formalization2 Classification Tasks3 Hypothesis Spaces and Capacity4 Risk Minimization5 Univariate and Linear Modeling6 Information Theory7 (Multiclass Classification)8 Curse of Dimensionality9 Boosting
10 Advanced Performance Evaluation11 Automatic Machine Learning12 Feature Selection13 Imbalanced Classification problems14 Advanced Trees and Forests15 Deep Learning / Neural Networks
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 5 / 6
LITERATURE
References
Hastie, T., R. Tibshirani, and J. Friedman. "The elements ofstatistical learning: prediction, inference and data mining."Springer-Verlag, New York (2009).
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). "Anintroduction to statistical learning". New York: Springer.
Alpaydin, E. (2009). "Introduction to machine learning." MIT press.
Bishop, C. M. (2006). Pattern recognition and machine learning.Springer.
Shalev-Shwartz, S., and Ben-David, S. (2014). "Understandingmachine learning: From theory to algorithms." Cambridgeuniversity press.
Robert, C. (2014). "Machine learning: A probabilistic perspective."MIT press.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 6 / 6
Advanced Statistical Learning
Chapter 0: Data Descriptions
Bernd Bischl, Julia Moosbauer, Andreas Groll
Department of Statistics – TU Dortmund
Winter term 2020/21
Boston Housing Dataset
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 1 / 99
BOSTON HOUSING DATA SET
A widely used dataset to benchmark algorithms is the Boston housingdataset. The data was originally published 1978 by David Harrison andDaniel Rubinfeld in Hedonic Housing Prices and the Demand for CleanAir.This paper investigates the methodological problems associated withthe use of housing market data to measure the willingness to pay forclean air.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 2 / 99
BOSTON HOUSING DATA SETExample Data: Boston Housing
Variable Descriptionmedv median value of owner-occupied homes in USD 1000’scrim per capita crime rate by townzn prop. of residential land zoned for lots over 25,000 sq.ftindus proportion of non-retail business acres per townchas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)nox nitric oxides concentration (parts per 10 million)rm average number of rooms per dwellingage proportion of owner-occupied units built prior to 1940dis weighted distances to five Boston employment centresrad index of accessibility to radial highwaystax full-value property-tax rate per USD 10,000ptratio pupil-teacher ratio by townb 1000(B � 0.63)2 where B is the prop. of blacks by townlstat percentage of lower status of the population
506 obs., 13 features, medv numerical target.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 3 / 99
IMPORTING THE DATA
We use OpenML(R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 531)
# convert the OpenML object to a tibble (enhanced data.frame)
bh = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 4 / 99
EXPLORATORY DATA ANALYSIS
print(bh)
## # A tibble: 506 x 14## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX## <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296## 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242## 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242## 4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222## 5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222## 6 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222## 7 0.0883 12.5 7.87 0 0.524 6.01 66.6 5.56 5 311## 8 0.145 12.5 7.87 0 0.524 6.17 96.1 5.95 5 311## 9 0.211 12.5 7.87 0 0.524 5.63 100 6.08 5 311## 10 0.170 12.5 7.87 0 0.524 6.00 85.9 6.59 5 311## # ... with 496 more rows, and 4 more variables: PTRATIO <dbl>,## # B <dbl>, LSTAT <dbl>, MEDV <dbl>
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 5 / 99
EXPLORATORY DATA ANALYSISFactor variables
skimr::partition(skim(bh)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 7)
skim_variable n_missing complete_rate ordered n_unique top_counts
CHAS 0 1 FALSE 2 0: 471, 1: 35RAD 0 1 FALSE 9 24: 132, 5: 115, 4: 110, 3: 38
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 6 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
DataExplorer::plot_bar(bh, ggtheme = ggpubr::theme_pubr())
CHAS RAD
0 100 200 300 400 0 50 1007128634524
1
0
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 7 / 99
EXPLORATORY DATA ANALYSISNumeric variables
skimr::partition(skim(bh)) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 6)
skim_variable n_missing mean sd
CRIM 0 3.614 8.602ZN 0 11.364 23.322INDUS 0 11.137 6.860NOX 0 0.555 0.116RM 0 6.285 0.703
AGE 0 68.575 28.149DIS 0 3.795 2.106TAX 0 408.237 168.537PTRATIO 0 18.456 2.165B 0 356.674 91.295
LSTAT 0 12.653 7.141MEDV 0 22.533 9.197
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 8 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(bh, ggtheme = ggpubr::theme_pubr())
PTRATIO RM TAX ZN
INDUS LSTAT MEDV NOX
AGE B CRIM DIS
12.515.017.520.022.5 4 5 6 7 8 9 200300400500600700 0 25 50 75 100
0 10 20 0 10 20 30 10 20 30 40 50 0.4 0.5 0.6 0.7 0.8 0.9
0 25 50 75 100 0 100 200 300 400 0 25 50 75 2.5 5.0 7.510.012.50204060
02040
0100200300
0100200300
0204060
050100
0100200
01020304050
020406080
01020304050
050100
050100150
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 9 / 99
EXPLORATORY DATA ANALYSISIt is always useful to check the correlation among variables:
DataExplorer::plot_correlation(bh, ggtheme = ggpubr::theme_pubr(base_size = 10),type = "c", cor_args = list(use = "complete.obs"))
1 −0.2 0.41 0.42 −0.22 0.35 −0.38 0.58 0.29 −0.39 0.46 −0.39−0.2 1 −0.53 −0.52 0.31 −0.57 0.66 −0.31 −0.39 0.18 −0.41 0.360.41 −0.53 1 0.76 −0.39 0.64 −0.71 0.72 0.38 −0.36 0.6 −0.480.42 −0.52 0.76 1 −0.3 0.73 −0.77 0.67 0.19 −0.38 0.59 −0.43−0.22 0.31 −0.39 −0.3 1 −0.24 0.21 −0.29 −0.36 0.13 −0.61 0.70.35 −0.57 0.64 0.73 −0.24 1 −0.75 0.51 0.26 −0.27 0.6 −0.38−0.38 0.66 −0.71 −0.77 0.21 −0.75 1 −0.53 −0.23 0.29 −0.5 0.250.58 −0.31 0.72 0.67 −0.29 0.51 −0.53 1 0.46 −0.44 0.54 −0.470.29 −0.39 0.38 0.19 −0.36 0.26 −0.23 0.46 1 −0.18 0.37 −0.51−0.39 0.18 −0.36 −0.38 0.13 −0.27 0.29 −0.44 −0.18 1 −0.37 0.330.46 −0.41 0.6 0.59 −0.61 0.6 −0.5 0.54 0.37 −0.37 1 −0.74−0.39 0.36 −0.48 −0.43 0.7 −0.38 0.25 −0.47 −0.51 0.33 −0.74 1
CRIMZN
INDUSNOXRM
AGEDISTAX
PTRATIOB
LSTATMEDV
CRIM
ZN
INDU
S
NOX
RM AGE
DIS
TAX
PTRA
TIO
B
LSTA
T
MED
V
Features
Feat
ures
−1.0 −0.5 0.0 0.5 1.0Correlation Meter
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 10 / 99
Credit Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 11 / 99
CREDIT DATA SET
German Credit Dataset is a research dataset from the University ofHamburg from 1994 and donated by Prof. Hans Hoffman.
Each entry represents a person who takes a credit by a bank.Each person is classified as "good" or "bad" credit risks accordingto the set of attributes.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 12 / 99
CREDIT DATA SETVariable Descriptionclass "good" | "bad"checking_status Status of existing checking accountduration Duration in monthcredit_history Credit historycredit_amount Amount of the desired creditsaving_status Amount in savings accountemployment Present employment since, in yearsinstallment_commitment Installment rate in percentage of disposable incomepersonal_status Personal status and sexother_parties Other debtors or guarantorsresidence_since Current residence since, in yearsage Age in yearsother_payment_plans Other installment plansexisting_credits Number of existing credits at this bankjob Current jobnum_dependents Number of people being liable to provide maintenance forown_telephone Telephone (yes|no)foreign_worker foreign worker (yes|no)
1000 obs., 21 features, class binary target.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 13 / 99
IMPORTING THE DATA
We use OpenML (R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 31)
# convert the OpenML object to a tibble (enhanced data.frame)
credit = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 14 / 99
EXPLORATORY DATA ANALYSIS
print(head(credit))
## # A tibble: 6 x 21## checking_status duration credit_history purpose credit_amount savings_status employment## <fct> <dbl> <fct> <fct> <dbl> <fct> <fct>## 1 <0 6 critical/othe~ radio/~ 1169 no known savi~ >=7## 2 0<=X<200 48 existing paid radio/~ 5951 <100 1<=X<4## 3 no checking 12 critical/othe~ educat~ 2096 <100 4<=X<7## 4 <0 42 existing paid furnit~ 7882 <100 4<=X<7## 5 <0 24 delayed previ~ new car 4870 <100 1<=X<4## 6 no checking 36 existing paid educat~ 9055 no known savi~ 1<=X<4## # ... with 14 more variables: installment_commitment <dbl>, personal_status <fct>,## # other_parties <fct>, residence_since <dbl>, property_magnitude <fct>, age <dbl>,## # other_payment_plans <fct>, housing <fct>, existing_credits <dbl>, job <fct>,## # num_dependents <dbl>, own_telephone <fct>, foreign_worker <fct>, class <fct>
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 15 / 99
EXPLORATORY DATA ANALYSISFactor Variables
skimr::partition(skim(credit)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 5)
skim_variable n_missing complete_rate ordered n_unique top_counts
checking_status 0 1 FALSE 4 no : 394, <0: 274, 0<=: 269, >=2: 63credit_history 0 1 FALSE 5 exi: 530, cri: 293, del: 88, all: 49purpose 0 1 FALSE 10 rad: 280, new: 234, fur: 181, use: 103savings_status 0 1 FALSE 5 <10: 603, no : 183, 100: 103, 500: 63employment 0 1 FALSE 5 1<=: 339, >=7: 253, 4<=: 174, <1: 172
personal_status 0 1 FALSE 4 mal: 548, fem: 310, mal: 92, mal: 50other_parties 0 1 FALSE 3 non: 907, gua: 52, co : 41property_magnitude 0 1 FALSE 4 car: 332, rea: 282, lif: 232, no : 154other_payment_plans 0 1 FALSE 3 non: 814, ban: 139, sto: 47housing 0 1 FALSE 3 own: 713, ren: 179, for: 108
job 0 1 FALSE 4 ski: 630, uns: 200, hig: 148, une: 22own_telephone 0 1 FALSE 2 non: 596, yes: 404foreign_worker 0 1 FALSE 2 yes: 963, no: 37class 0 1 FALSE 2 goo: 700, bad: 300
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 16 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
plot_bar(credit[, 1:14], ggtheme = ggpubr::theme_pubr(base_size = 7.5))
other_parties property_magnitude other_payment_plans
savings_status employment personal_status
checking_status credit_history purpose
0 250 500 750 0 100 200 300 0 200 400 600 800
0 200 400 600 0 100 200 300 0 200 400
0 100 200 300 400 0 200 400 0 100 200retraining
domestic applianceother
repairseducationbusinessused car
furniture/equipmentnew carradio/tv
male div/sep
male mar/wid
female div/dep/mar
male single
stores
bank
none
no credits/all paidall paid
delayed previouslycritical/other existing credit
existing paid
unemployed<1
4<=X<7>=7
1<=X<4
no known property
life insurance
real estate
car
>=200
0<=X<200
<0
no checking
>=1000500<=X<1000
100<=X<500no known savings
<100
co applicant
guarantor
none
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 17 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
plot_bar(credit[, 15:21], ggtheme = ggpubr::theme_pubr(base_size = 9))
foreign_worker class num_dependents
housing job own_telephone
0 250 500 750 1000 0 200 400 600 0 200 400 600 800
0 200 400 600 0 200 400 600 0 200 400 600
yes
none
2
1
unemp/unskilled non res
high qualif/self emp/mgmt
unskilled resident
skilled
bad
good
for free
rent
own
no
yes
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 18 / 99
EXPLORATORY DATA ANALYSISNumerical Variables
skimr::partition(skim(credit)) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE)
skim_variable n_missing mean sd
duration 0 20.90 12.059credit_amount 0 3271.26 2822.737installment_commitment 0 2.97 1.119residence_since 0 2.85 1.104age 0 35.55 11.375
existing_credits 0 1.41 0.578num_dependents 0 1.16 0.362
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 19 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
plot_histogram(credit, ggtheme = ggpubr::theme_pubr())
installment_commitment residence_since
age credit_amount duration existing_credits
1 2 3 4 1 2 3 4
20 40 60 0 50001000015000 0 20 40 60 1 2 3 40
200
400
600
050100150
050100150200
0100200300400
0255075
0100200300400
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 20 / 99
EXPLORATORY DATA ANALYSISIt is always useful to check the correlation among variables:
DataExplorer::plot_correlation(credit, ggtheme = ggpubr::theme_pubr(base_size = 10),type = "c", cor_args = list(use = "complete.obs"))
1 0.62 0.07 0.03 −0.04 −0.01 −0.020.62 1 −0.27 0.03 0.03 0.02 0.020.07 −0.27 1 0.05 0.06 0.02 −0.070.03 0.03 0.05 1 0.27 0.09 0.04−0.04 0.03 0.06 0.27 1 0.15 0.12−0.01 0.02 0.02 0.09 0.15 1 0.11−0.02 0.02 −0.07 0.04 0.12 0.11 1
durationcredit_amount
installment_commitmentresidence_since
ageexisting_credits
num_dependents
dura
tion
cred
it_am
ount
inst
allm
ent_
com
mitm
ent
resi
denc
e_si
nce
age
exis
ting_
cred
its
num
_dep
ende
nts
Features
Feat
ures
−1.0 −0.5 0.0 0.5 1.0Correlation Meter
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 21 / 99
Iris Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 22 / 99
IRIS DATA SET
The iris dataset was introduced by the statistician Ronald Fisher and isone of the most frequent used datasets. Originally it was designed forlinear discriminant analysis.The set is a typical test case for many statistical classificationtechniques and has its own wikipedia page.
Setosa Versicolor Virginica
Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 23 / 99
IMPORTING THE DATA
We use OpenML (R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 61)
# convert the OpenML object to a tibble (enhanced data.frame)
iris = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 24 / 99
IMPORTING THE DATA150 iris flowers (50 setosa, 50 versicolor, 50 virginica), speciesshould be predicted.
Sepal length / width and petal length / width in [cm].
Source: https://holgerbrandl.github.io/kotlin4ds_kotlin_night_frankfurt//krangl_example_report.html
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 25 / 99
IMPORTING THE DATAprint(iris)
## # A tibble: 150 x 5## sepallength sepalwidth petallength petalwidth class## <dbl> <dbl> <dbl> <dbl> <fct>## 1 5.1 3.5 1.4 0.2 Iris-setosa## 2 4.9 3 1.4 0.2 Iris-setosa## 3 4.7 3.2 1.3 0.2 Iris-setosa## 4 4.6 3.1 1.5 0.2 Iris-setosa## 5 5 3.6 1.4 0.2 Iris-setosa## 6 5.4 3.9 1.7 0.4 Iris-setosa## 7 4.6 3.4 1.4 0.3 Iris-setosa## 8 5 3.4 1.5 0.2 Iris-setosa## 9 4.4 2.9 1.4 0.2 Iris-setosa## 10 4.9 3.1 1.5 0.1 Iris-setosa## # ... with 140 more rows
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 26 / 99
EXPLORATORY DATA ANALYSIS
Factor variables
skimr::partition(skim(iris)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 7)
skim_variable n_missing complete_rate ordered n_unique top_counts
class 0 1 FALSE 3 Iri: 50, Iri: 50, Iri: 50
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 27 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
DataExplorer::plot_bar(iris, ggtheme = ggpubr::theme_pubr())
class
0 10 20 30 40 50
Iris−setosa
Iris−versicolor
Iris−virginica
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 28 / 99
EXPLORATORY DATA ANALYSISNumeric variables
skimr::partition(skim(iris)) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 8)
skim_variable n_missing mean sd
sepallength 0 5.84 0.828sepalwidth 0 3.05 0.434petallength 0 3.76 1.764petalwidth 0 1.20 0.763
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 29 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(iris, ggtheme = ggpubr::theme_pubr())
petallength petalwidth sepallength sepalwidth
2 4 6 0.0 0.5 1.0 1.5 2.0 2.5 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5
0
10
20
0.0
2.5
5.0
7.5
10.0
12.5
0
10
20
0
10
20
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 30 / 99
Ozone Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 31 / 99
OZONE DATA SET
All measurements were taken in the area of Upland, CA, east ofLos Angeles.
The aim is to predict the daily maximum one-hour-average ozonereading (V4).
Introduced in Breiman, L. and Friedman, J. (1985), "EstimatingOptimal Transformations for Multiple Regression and Correlation",Journal of the American Statistical Association, 80, 580-598.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 32 / 99
IMPORTING THE DATA
We load the data from the mlbench package:
library(mlbench)data(Ozone)
# convert the OpenML object to a tibble (enhanced data.frame)
ozone = tibble::as_tibble(Ozone)
The variables are: elevation, temperature (surface and air), ozone, airpressure, and cloud cover (low, mid, and high).
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 33 / 99
EXPLORATORY DATA ANALYSIS
print(ozone)
## # A tibble: 366 x 13## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1 1 4 3 5480 8 20 NA NA 5000 -15 30.6 200## 2 1 2 5 3 5660 6 NA 38 NA NA -14 NA 300## 3 1 3 6 3 5710 4 28 40 NA 2693 -25 47.7 250## 4 1 4 7 5 5700 3 37 45 NA 590 -24 55.0 100## 5 1 5 1 5 5760 3 51 54 45.3 1450 25 57.0 60## 6 1 6 2 6 5720 4 69 35 49.6 1568 15 53.8 60## 7 1 7 3 4 5790 6 19 45 46.4 2631 -33 54.1 100## 8 1 8 4 4 5790 3 25 55 52.7 554 -28 64.8 250## 9 1 9 5 6 5700 3 73 41 48.0 2083 23 52.5 120## 10 1 10 6 7 5700 3 59 44 NA 2654 -2 48.4 120## # ... with 356 more rows
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 34 / 99
EXPLORATORY DATA ANALYSISVariable DescriptionV1 Month: 1 = January, ..., 12 = DecemberV2 Day of monthV3 Day of week: 1 = Monday, ..., 7 = SundayV4 Daily maximum one-hour-average ozone readingV5 500 millibar pressure height (m) measured at Vandenberg AFBV6 Wind speed (mph) at Los Angeles International Airport (LAX)V7 Humidity (%) at LAXV8 Temperature (degrees F) measured at Sandburg, CAV9 Temperature (degrees F) measured at El Monte, CAV10 Inversion base height (feet) at LAXV11 Pressure gradient (mm Hg) from LAX to Daggett, CAV12 Inversion base temperature (degrees F) at LAXV13 Visibility (miles) measured at LAX
366 obs., 13 features, V4 numeric target.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 35 / 99
EXPLORATORY DATA ANALYSISFactor Variables
skimr::partition(skim(ozone)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 7)
skim_variable n_missing complete_rate ordered n_unique top_counts
V1 0 1 FALSE 12 1: 31, 3: 31, 5: 31, 7: 31V2 0 1 FALSE 31 1: 12, 2: 12, 3: 12, 4: 12V3 0 1 FALSE 7 4: 53, 5: 53, 1: 52, 2: 52
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 36 / 99
EXPLORATORY DATA ANALYSISNumerical Variables
skimr::partition(skim(ozone)) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 7)
skim_variable n_missing mean sd
V4 5 11.53 7.92V5 12 5752.97 104.99V6 0 4.87 2.12V7 15 58.48 19.76V8 2 61.91 14.28
V9 139 56.85 11.66V10 15 2590.94 1796.83V11 1 17.80 36.11V12 14 60.93 13.87V13 0 123.30 80.28
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 37 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
plot_histogram(ozone, ggtheme = ggpubr::theme_pubr(base_size = 9))
V8 V9
V4 V5 V6 V7
V10 V11 V12 V13
40 60 80 40 60 80
0 10 20 30 40 5400 5600 5800 0 3 6 9 25 50 75
0 1000 2000 3000 4000 5000 −50 0 50 100 40 60 80 0 100 200 300 400 500010203040
0102030
0510152025
0204060
0
10
20
30
010203040
05101520
0255075100
010203040
0
10
20
30
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 38 / 99
EXPLORATORY DATA ANALYSISIt seems that we have quite some missing observations. Let’s take acloser look:
DataExplorer::plot_missing(ozone, ggtheme = ggpubr::theme_pubr())
0%0%0%
1.37%3.28%
0%
4.1%
0.55%
37.98%
4.1%
0.27%
3.83%
0%
V9V7
V10V12V5V4V8
V11V1V2V3V6
V13
0 50 100Missing Rows
Feat
ures
Band Good OK
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 39 / 99
PID Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 40 / 99
PID DATA SET
The response (diabetes) indicates whether the patient shows signsof diabetes according to World Health Organization criteria (i.e., if the2 hour post-load plasma glucose was at least 200 mg/dl at any surveyexamination or if found during routine medical care).
Variable Descriptiondiabetes "neg" = No, "pos" = Yespregnant Number of times pregnantglucose Plasma glucose concentration a 2 hours in an oral glucose tolerance
testpressure Diastolic blood pressure (mm Hg)triceps Triceps skin fold thickness (mminsulin 2-Hour serum insulinmass Body mass index (weight in kg/(height)2)pedigree Diabetes pedigree functionage Age in years
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 41 / 99
Sonar Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 42 / 99
SONAR DATA SET
This is the data set used by Gorman and Sejnowski in their study("Analysis of Hidden Units in a Layered Network Trained to ClassifySonar Targets" in Neural Networks, Vol. 1, pp. 75-89.) of theclassification of sonar signals using a neural network. The task is totrain a network to discriminate between sonar signals bounced off ametal cylinder and those bounced off a roughly cylindrical rock.
attribute_[1-60]: Each variable represents the energy withina particular frequency band, integrated over a certain period oftime. The integration aperture for higher frequencies occur later intime, since these frequencies are transmitted later during the chirp.
class: "Rock" / "Mine (metal cylinder)"
The numbers in the labels are in increasing order of aspect angle, butthey do not encode the angle directly.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 43 / 99
IMPORTING THE DATA
We use OpenML (R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 40)
# convert the OpenML object to a tibble (enhanced data.frame)
sonar = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 44 / 99
EXPLORATORY DATA ANALYSIS
print(head(sonar, n = 2L))
## # A tibble: 2 x 61## attribute_1 attribute_2 attribute_3 attribute_4 attribute_5 attribute_6 attribute_7## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.02 0.0371 0.0428 0.0207 0.0954 0.0986 0.154## 2 0.0453 0.0523 0.0843 0.0689 0.118 0.258 0.216## # ... with 54 more variables: attribute_8 <dbl>, attribute_9 <dbl>, attribute_10 <dbl>,## # attribute_11 <dbl>, attribute_12 <dbl>, attribute_13 <dbl>, attribute_14 <dbl>,## # attribute_15 <dbl>, attribute_16 <dbl>, attribute_17 <dbl>, attribute_18 <dbl>,## # attribute_19 <dbl>, attribute_20 <dbl>, attribute_21 <dbl>, attribute_22 <dbl>,## # attribute_23 <dbl>, attribute_24 <dbl>, attribute_25 <dbl>, attribute_26 <dbl>,## # attribute_27 <dbl>, attribute_28 <dbl>, attribute_29 <dbl>, attribute_30 <dbl>,## # attribute_31 <dbl>, attribute_32 <dbl>, attribute_33 <dbl>, attribute_34 <dbl>,## # attribute_35 <dbl>, attribute_36 <dbl>, attribute_37 <dbl>, attribute_38 <dbl>,## # attribute_39 <dbl>, attribute_40 <dbl>, attribute_41 <dbl>, attribute_42 <dbl>,## # attribute_43 <dbl>, attribute_44 <dbl>, attribute_45 <dbl>, attribute_46 <dbl>,## # attribute_47 <dbl>, attribute_48 <dbl>, attribute_49 <dbl>, attribute_50 <dbl>,## # attribute_51 <dbl>, attribute_52 <dbl>, attribute_53 <dbl>, attribute_54 <dbl>,## # attribute_55 <dbl>, attribute_56 <dbl>, attribute_57 <dbl>, attribute_58 <dbl>,## # attribute_59 <dbl>, attribute_60 <dbl>, Class <fct>
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 45 / 99
EXPLORATORY DATA ANALYSISFactor Variables
skimr::partition(skim(sonar))%>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 5)
skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1 FALSE 2 Min: 111, Roc: 97
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 46 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
plot_bar(sonar, ggtheme = ggpubr::theme_pubr(base_size = 7))
Class
0 30 60 90
Rock
Mine
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 47 / 99
EXPLORATORY DATA ANALYSISNumerical Variables
skimr::partition(skim(sonar[, 1:15])) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 4)
skim_variable n_missing mean sd
attribute_1 0 0.029 0.023
attribute_2 0 0.038 0.033
attribute_3 0 0.044 0.038
attribute_4 0 0.054 0.047
attribute_5 0 0.075 0.056
attribute_6 0 0.105 0.059
attribute_7 0 0.122 0.062
attribute_8 0 0.135 0.085
attribute_9 0 0.178 0.118
attribute_10 0 0.208 0.134
attribute_11 0 0.236 0.133
attribute_12 0 0.250 0.140
attribute_13 0 0.273 0.141
attribute_14 0 0.297 0.164
attribute_15 0 0.320 0.205
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 48 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
plot_histogram(sonar[, 1:16], ggtheme = ggpubr::theme_pubr(base_size = 7))
attribute_6 attribute_7 attribute_8 attribute_9
attribute_2 attribute_3 attribute_4 attribute_5
attribute_13 attribute_14 attribute_15 attribute_16
attribute_1 attribute_10 attribute_11 attribute_12
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6
0.00 0.05 0.10 0.15 0.20 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.05 0.10 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.605101520
05101520
0102030
05101520
05101520
0510152025
010203040
0
10
20
05101520
0
10
20
0102030
0
10
20
0
10
20
30
051015
0102030
0
10
20
30
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 49 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
plot_histogram(sonar[, 17:32], ggtheme = ggpubr::theme_pubr(base_size = 7))
attribute_29 attribute_30 attribute_31 attribute_32
attribute_25 attribute_26 attribute_27 attribute_28
attribute_21 attribute_22 attribute_23 attribute_24
attribute_17 attribute_18 attribute_19 attribute_20
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75
0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.000
5
10
0
5
10
15
051015
0
5
10
15
051015
0
5
10
0
10
20
30
0
5
10
15
05101520
0
5
10
15
0
5
10
15
051015
051015
0
5
10
0
5
10
15
051015
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 50 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
plot_histogram(sonar[, 33:48], ggtheme = ggpubr::theme_pubr(base_size = 7))
attribute_45 attribute_46 attribute_47 attribute_48
attribute_41 attribute_42 attribute_43 attribute_44
attribute_37 attribute_38 attribute_39 attribute_40
attribute_33 attribute_34 attribute_35 attribute_36
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.0 0.1 0.2 0.3
0.00 0.25 0.50 0.75 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00051015
051015
0
10
20
0510152025
0
5
10
15
05101520
0510152025
0
10
20
051015
05101520
05101520
0
10
20
30
05101520
051015
051015
0
10
20
30
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 51 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
plot_histogram(sonar[, 49:61], ggtheme = ggpubr::theme_pubr(base_size = 7))
attribute_57 attribute_58 attribute_59 attribute_60
attribute_53 attribute_54 attribute_55 attribute_56
attribute_49 attribute_50 attribute_51 attribute_52
0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.04
0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04
0.00 0.05 0.10 0.15 0.20 0.00 0.02 0.04 0.06 0.08 0.000 0.025 0.050 0.075 0.100 0.00 0.02 0.04 0.060
10
20
30
0
10
20
30
0
10
20
30
40
0
10
20
30
40
0
10
20
30
0
10
20
30
0
5
10
15
20
0
10
20
0
10
20
30
0510152025
0
5
10
15
20
0
10
20
30
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 52 / 99
EXPLORATORY DATA ANALYSISLet’s take a look at the correlation among the variables:
DataExplorer::plot_correlation(sonar, ggtheme = ggpubr::theme_pubr(base_size = 7))
attribute_1attribute_2attribute_3attribute_4attribute_5attribute_6attribute_7attribute_8attribute_9attribute_10attribute_11attribute_12attribute_13attribute_14attribute_15attribute_16attribute_17attribute_18attribute_19attribute_20attribute_21attribute_22attribute_23attribute_24attribute_25attribute_26attribute_27attribute_28attribute_29attribute_30attribute_31attribute_32attribute_33attribute_34attribute_35attribute_36attribute_37attribute_38attribute_39attribute_40attribute_41attribute_42attribute_43attribute_44attribute_45attribute_46attribute_47attribute_48attribute_49attribute_50attribute_51attribute_52attribute_53attribute_54attribute_55attribute_56attribute_57attribute_58attribute_59attribute_60Class_MineClass_Rockat
tribu
te_1
attri
bute
_2at
tribu
te_3
attri
bute
_4at
tribu
te_5
attri
bute
_6at
tribu
te_7
attri
bute
_8at
tribu
te_9
attri
bute
_10
attri
bute
_11
attri
bute
_12
attri
bute
_13
attri
bute
_14
attri
bute
_15
attri
bute
_16
attri
bute
_17
attri
bute
_18
attri
bute
_19
attri
bute
_20
attri
bute
_21
attri
bute
_22
attri
bute
_23
attri
bute
_24
attri
bute
_25
attri
bute
_26
attri
bute
_27
attri
bute
_28
attri
bute
_29
attri
bute
_30
attri
bute
_31
attri
bute
_32
attri
bute
_33
attri
bute
_34
attri
bute
_35
attri
bute
_36
attri
bute
_37
attri
bute
_38
attri
bute
_39
attri
bute
_40
attri
bute
_41
attri
bute
_42
attri
bute
_43
attri
bute
_44
attri
bute
_45
attri
bute
_46
attri
bute
_47
attri
bute
_48
attri
bute
_49
attri
bute
_50
attri
bute
_51
attri
bute
_52
attri
bute
_53
attri
bute
_54
attri
bute
_55
attri
bute
_56
attri
bute
_57
attri
bute
_58
attri
bute
_59
attri
bute
_60
Cla
ss_M
ine
Cla
ss_R
ock
Features
Feat
ures
−1.0 −0.5 0.0 0.5 1.0Correlation Meter
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 53 / 99
Spam Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 54 / 99
SPAM DATA SET
A data set collected at Hewlett-Packard Labs, which classifies 4601e-mails as spam or non-spam (variable "class").
The spam dataset is one of the datasets used in The Elements ofStatistical Learning by Trevor Hastie, Robert Tibshirani, and JeromeFriedman.
Besides the option to import it from OpenML it also comes as anexample dataset in the packages ElemStatLearn and kernlab.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 55 / 99
SPAM DATA SETclass : 0 = no spam, 1 = spam
word_freq_*: 48 features corresponding to the relative frequencyof a specific word in an e-mail
char_freq_*: 6 features that measures the percentage of asequence of specific characters occurs relative to the total numberof characters
capital_run_length_[average, longest, total]: 3features indicating the average, longest, and sum of uninterruptedsequence of capital letters
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 56 / 99
IMPORTING THE DATA
We use OpenML (R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 44)
# convert the OpenML object to a tibble (enhanced data.frame)
spam = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 57 / 99
EXPLORATORY DATA ANALYSIS
print(head(spam, n = 5L))
## # A tibble: 5 x 58## word_freq_make word_freq_addre~ word_freq_all word_freq_3d word_freq_our word_freq_over## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0 0.64 0.64 0 0.32 0## 2 0.21 0.28 0.5 0 0.14 0.28## 3 0.06 0 0.71 0 1.23 0.19## 4 0 0 0 0 0.63 0## 5 0 0 0 0 0.63 0## # ... with 52 more variables: word_freq_remove <dbl>, word_freq_internet <dbl>,## # word_freq_order <dbl>, word_freq_mail <dbl>, word_freq_receive <dbl>,## # word_freq_will <dbl>, word_freq_people <dbl>, word_freq_report <dbl>,## # word_freq_addresses <dbl>, word_freq_free <dbl>, word_freq_business <dbl>,## # word_freq_email <dbl>, word_freq_you <dbl>, word_freq_credit <dbl>,## # word_freq_your <dbl>, word_freq_font <dbl>, word_freq_000 <dbl>,## # word_freq_money <dbl>, word_freq_hp <dbl>, word_freq_hpl <dbl>,## # word_freq_george <dbl>, word_freq_650 <dbl>, word_freq_lab <dbl>,## # word_freq_labs <dbl>, word_freq_telnet <dbl>, word_freq_857 <dbl>,## # word_freq_data <dbl>, word_freq_415 <dbl>, word_freq_85 <dbl>,## # word_freq_technology <dbl>, word_freq_1999 <dbl>, word_freq_parts <dbl>,## # word_freq_pm <dbl>, word_freq_direct <dbl>, word_freq_cs <dbl>,## # word_freq_meeting <dbl>, word_freq_original <dbl>, word_freq_project <dbl>,## # word_freq_re <dbl>, word_freq_edu <dbl>, word_freq_table <dbl>,## # word_freq_conference <dbl>, char_freq_.3B <dbl>, char_freq_.28 <dbl>,## # char_freq_.5B <dbl>, char_freq_.21 <dbl>, char_freq_.24 <dbl>, char_freq_.23 <dbl>,## # capital_run_length_average <dbl>, capital_run_length_longest <dbl>,## # capital_run_length_total <dbl>, class <fct>
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 58 / 99
EXPLORATORY DATA ANALYSISFactor variables
skimr::partition(skim(spam)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 7)
skim_variable n_missing complete_rate ordered n_unique top_counts
class 0 1 FALSE 2 0: 2788, 1: 1813
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 59 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
DataExplorer::plot_bar(spam, ggtheme = ggpubr::theme_pubr())
class
0 1000 2000
1
0
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 60 / 99
EXPLORATORY DATA ANALYSISThe distribution of most variables is highly skewed:
DataExplorer::plot_histogram(spam[, 1:16], ggtheme = ggpubr::theme_pubr())
word_freq_receive word_freq_remove word_freq_report word_freq_will
word_freq_order word_freq_our word_freq_over word_freq_people
word_freq_free word_freq_internet word_freq_mail word_freq_make
word_freq_3d word_freq_address word_freq_addresses word_freq_all
0 1 2 0 2 4 6 0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
0 2 4 0.0 2.5 5.0 7.5 10.0 0 2 4 6 0 2 4
0 5 10 15 20 0 3 6 9 0 5 10 15 0 1 2 3 4
0 10 20 30 40 0 5 10 15 0 1 2 3 4 0 1 2 3 4 50
10002000
0100020003000
01000200030004000
0500100015002000
01000200030004000
0100020003000
0100020003000
01000200030004000
01000200030004000
01000200030004000
0100020003000
01000200030004000
01000200030004000
0100020003000
01000200030004000
01000200030004000
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 61 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(spam[, 17:32], ggtheme = ggpubr::theme_pubr())
word_freq_money word_freq_telnet word_freq_you word_freq_your
word_freq_hp word_freq_hpl word_freq_lab word_freq_labs
word_freq_credit word_freq_email word_freq_font word_freq_george
word_freq_000 word_freq_650 word_freq_857 word_freq_business
0 5 10 0 5 10 0 5 10 15 20 0 3 6 9
0 5 10 15 20 0 5 10 15 0 5 10 15 0 2 4 6
0 5 10 15 0.0 2.5 5.0 7.5 0 5 10 15 0 10 20 30
0 2 4 0.0 2.5 5.0 7.5 0 1 2 3 4 5 0 2 4 60100020003000
01000200030004000
01000200030004000
0500100015002000
01000200030004000
01000200030004000
01000200030004000
050010001500
01000200030004000
0100020003000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
0100020003000
01000200030004000
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 62 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(spam[, 33:48], ggtheme = ggpubr::theme_pubr())
word_freq_project word_freq_re word_freq_table word_freq_technology
word_freq_meeting word_freq_original word_freq_parts word_freq_pm
word_freq_cs word_freq_data word_freq_direct word_freq_edu
word_freq_1999 word_freq_415 word_freq_85 word_freq_conference
0 5 10 15 20 0 5 10 15 20 0.0 0.5 1.0 1.5 2.0 0 2 4 6 8
0 5 10 15 0 1 2 3 0 2 4 6 8 0 3 6 9
0 2 4 6 0 5 10 15 0 1 2 3 4 5 0 5 10 15 20
0 2 4 6 0 1 2 3 4 5 0 5 10 15 20 0.0 2.5 5.0 7.5 10.001000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
0100020003000
01000200030004000
01000200030004000
01000200030004000
01000200030004000
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 63 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(spam[, 49:58], ggtheme = ggpubr::theme_pubr())
char_freq_.5B
char_freq_.23 char_freq_.24 char_freq_.28 char_freq_.3B
capital_run_length_average capital_run_length_longest capital_run_length_total char_freq_.21
0 1 2 3 4
0 5 10 15 20 0 2 4 6 0.0 2.5 5.0 7.5 10.0 0 1 2 3 4
0 300 600 900 0 25005000750010000 0 50001000015000 0 10 20 300
1000200030004000
01000200030004000
0100020003000
0100020003000
01000200030004000
0100020003000
01000200030004000
01000200030004000
01000200030004000
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 64 / 99
EXPLORATORY DATA ANALYSISLet’s take a look at the correlation among the variables:
DataExplorer::plot_correlation(spam, ggtheme = ggpubr::theme_pubr(base_size =7))
word_freq_makeword_freq_addressword_freq_allword_freq_3dword_freq_ourword_freq_overword_freq_removeword_freq_internetword_freq_orderword_freq_mailword_freq_receiveword_freq_willword_freq_peopleword_freq_reportword_freq_addressesword_freq_freeword_freq_businessword_freq_emailword_freq_youword_freq_creditword_freq_yourword_freq_fontword_freq_000word_freq_moneyword_freq_hpword_freq_hplword_freq_georgeword_freq_650word_freq_labword_freq_labsword_freq_telnetword_freq_857word_freq_dataword_freq_415word_freq_85word_freq_technologyword_freq_1999word_freq_partsword_freq_pmword_freq_directword_freq_csword_freq_meetingword_freq_originalword_freq_projectword_freq_reword_freq_eduword_freq_tableword_freq_conferencechar_freq_.3Bchar_freq_.28char_freq_.5Bchar_freq_.21char_freq_.24char_freq_.23capital_run_length_averagecapital_run_length_longestcapital_run_length_totalclass_0class_1
word
_fre
q_m
ake
word
_fre
q_ad
dres
swo
rd_f
req_
all
word
_fre
q_3d
word
_fre
q_ou
rwo
rd_f
req_
over
word
_fre
q_re
mov
ewo
rd_f
req_
inte
rnet
word
_fre
q_or
der
word
_fre
q_m
ail
word
_fre
q_re
ceive
word
_fre
q_w
illwo
rd_f
req_
peop
lewo
rd_f
req_
repo
rtwo
rd_f
req_
addr
esse
swo
rd_f
req_
free
word
_fre
q_bu
sine
sswo
rd_f
req_
emai
lwo
rd_f
req_
you
word
_fre
q_cr
edit
word
_fre
q_yo
urwo
rd_f
req_
font
word
_fre
q_00
0wo
rd_f
req_
mon
eywo
rd_f
req_
hpwo
rd_f
req_
hpl
word
_fre
q_ge
orge
word
_fre
q_65
0wo
rd_f
req_
lab
word
_fre
q_la
bswo
rd_f
req_
teln
etwo
rd_f
req_
857
word
_fre
q_da
tawo
rd_f
req_
415
word
_fre
q_85
word
_fre
q_te
chno
logy
word
_fre
q_19
99wo
rd_f
req_
parts
word
_fre
q_pm
word
_fre
q_di
rect
word
_fre
q_cs
word
_fre
q_m
eetin
gwo
rd_f
req_
orig
inal
word
_fre
q_pr
ojec
two
rd_f
req_
rewo
rd_f
req_
edu
word
_fre
q_ta
ble
word
_fre
q_co
nfer
ence
char
_fre
q_.3
Bch
ar_f
req_
.28
char
_fre
q_.5
Bch
ar_f
req_
.21
char
_fre
q_.2
4ch
ar_f
req_
.23
capi
tal_
run_
leng
th_a
vera
geca
pita
l_ru
n_le
ngth
_lon
gest
capi
tal_
run_
leng
th_t
otal
clas
s_0
clas
s_1
Features
Feat
ures
−1.0 −0.5 0.0 0.5 1.0Correlation Meter
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 65 / 99
Titanic Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 66 / 99
TITANIC DATA SET
The original Titanic dataset, describing the survival status ofindividual passengers (1309) on the Titanic. The titanic data does notcontain information from the crew, but it does contain actual ages of halfof the passengers. The principal source for data about Titanicpassengers is the Encyclopedia Titanica.
One of the original sources is Eaton & Haas (1994). Titanic: Triumphand Tragedy, Patrick Stephens Ltd. It includes a passenger list createdby many researchers (edited by Michael A. Findlay).
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 67 / 99
TITANIC DATA SETVariable Descriptionsurvived 0 = No, 1 = Yespclass 1 = 1st; 2 = 2nd; 3 = 3rdname First and last Namesex SexAge Agesibsp Number of Siblings/Spouses Aboardparch Number of Parents/Children AboardTicket Ticket Numberfare Passenger Farecabin Cabinembarked Port of Embarkation C = Cherbourg; Q = Queenstown;
S = Southamptonbody Body Identification Numberboat Boat numberhome.dest Home destination
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 68 / 99
IMPORTING THE DATA
We use OpenML (R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 40945)
# convert the OpenML object to a tibble (enhanced data.frame)
titanic = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 69 / 99
EXPLORATORY DATA ANALYSIS
print(titanic)
## # A tibble: 1,309 x 14## pclass survived name sex age sibsp parch ticket fare cabin embarked boat body## <dbl> <fct> <chr> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <fct> <chr> <dbl>## 1 1 1 Alle~ fema~ 29 0 0 24160 211. B5 S 2 NA## 2 1 1 Alli~ male 0.917 1 2 113781 152. C22 ~ S 11 NA## 3 1 0 Alli~ fema~ 2 1 2 113781 152. C22 ~ S <NA> NA## 4 1 0 Alli~ male 30 1 2 113781 152. C22 ~ S <NA> 135## 5 1 0 Alli~ fema~ 25 1 2 113781 152. C22 ~ S <NA> NA## 6 1 1 Ande~ male 48 0 0 19952 26.6 E12 S 3 NA## 7 1 1 Andr~ fema~ 63 1 0 13502 78.0 D7 S 10 NA## 8 1 0 Andr~ male 39 0 0 112050 0 A36 S <NA> NA## 9 1 1 Appl~ fema~ 53 2 0 11769 51.5 C101 S D NA## 10 1 0 Arta~ male 71 0 0 PC 17~ 49.5 <NA> C <NA> 22## # ... with 1,299 more rows, and 1 more variable: home.dest <chr>
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 70 / 99
EXPLORATORY DATA ANALYSISFactor variables
skimr::partition(skim(titanic)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 8)
skim_variable n_missing complete_rate ordered n_unique top_counts
survived 0 1.000 FALSE 2 0: 809, 1: 500sex 0 1.000 FALSE 2 mal: 843, fem: 466embarked 2 0.998 FALSE 3 S: 914, C: 270, Q: 123
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 71 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
DataExplorer::plot_bar(titanic, ggtheme = ggpubr::theme_pubr())
boat
survived sex embarked
0 200 400 600 800
0 200 400 600 800 0 200 400 600 800 0 250 500 750NA
QCS
female
male
1
0
13 15 B15 165 98 1013 155 7C D1BA2126D1678119351041415C13NA
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 72 / 99
EXPLORATORY DATA ANALYSISNumeric variables
skimr::partition(skim(titanic)) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 8)
skim_variable n_missing mean sd
pclass 0 2.295 0.838age 263 29.881 14.413sibsp 0 0.499 1.042parch 0 0.385 0.866fare 1 33.295 51.759
body 1188 160.810 97.697
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 73 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(titanic, ggtheme = ggpubr::theme_pubr())
pclass sibsp
age body fare parch
1.0 1.5 2.0 2.5 3.0 0 2 4 6 8
0 20 40 60 80 0 100 200 300 0 100200300400500 0.0 2.5 5.0 7.50
250
500
750
1000
0100200300400
0
2
4
6
0
250
500
750
0
30
60
90
0
200
400
600
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 74 / 99
EXPLORATORY DATA ANALYSISIt seems that we have quite some missing observations. Let’s take acloser look:
DataExplorer::plot_missing(titanic, ggtheme = ggpubr::theme_pubr())
0%0%0%0%
20.09%
0%0%0%
0.08%
77.46%
0.15%
62.87%
90.76%
43.09%
bodycabinboat
home.destage
embarkedfare
pclasssurvived
namesex
sibspparchticket
0 250 500 750 1000Missing Rows
Feat
ures
Band Bad Good OK Remove
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 75 / 99
EXPLORATORY DATA ANALYSISIt is always useful to check the correlation among variables:
DataExplorer::plot_correlation(titanic, ggtheme = ggpubr::theme_pubr(base_size = 10),type = "c", cor_args = list(use = "complete.obs"))
1 −0.49 0.06 0.09 −0.59 −0.04
−0.49 1 −0.17 −0.02 0.27 0.04
0.06 −0.17 1 0.22 0.22 −0.1
0.09 −0.02 0.22 1 0.14 0.05
−0.59 0.27 0.22 0.14 1 −0.04
−0.04 0.04 −0.1 0.05 −0.04 1
pclass
age
sibsp
parch
fare
bodypc
lass
age
sibs
p
parc
h
fare
body
Features
Feat
ures
−1.0 −0.5 0.0 0.5 1.0Correlation Meter
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 76 / 99
Breast Cancer Data Set
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 77 / 99
BREAST CANCER DATA SET
Dataset with information on the diagnosis of breast tissues (Class; M= malignant, B = benign). Features are computed from a digitizedimage of a fine needle aspirate (FNA) of a breast mass. They describecharacteristics of the cell nuclei present in the image.
Source: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 78 / 99
IMPORTING THE DATA
We use OpenML (R-Package) to download the dataset in amachine-readable format and convert it into a data.frame:
# load the dataset from OpenML Library
d = OpenML::getOMLDataSet(data.id = 15)
# convert the OpenML object to a tibble (enhanced data.frame)
bc = tibble::as_tibble(d)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 79 / 99
EXPLORATORY DATA ANALYSIS
print(bc)
## # A tibble: 699 x 10## Clump_Thickness Cell_Size_Unifo~ Cell_Shape_Unif~ Marginal_Adhesi~ Single_Epi_Cell~## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 5 1 1 1 2## 2 5 4 4 5 7## 3 3 1 1 1 2## 4 6 8 8 1 3## 5 4 1 1 3 2## 6 8 10 10 8 7## 7 1 1 1 1 2## 8 2 1 2 1 2## 9 2 1 1 1 2## 10 4 2 1 1 2## # ... with 689 more rows, and 5 more variables: Bare_Nuclei <dbl>, Bland_Chromatin <dbl>,## # Normal_Nucleoli <dbl>, Mitoses <dbl>, Class <fct>
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 80 / 99
EXPLORATORY DATA ANALYSISFactor variables
skimr::partition(skim(bc)) %>%.$factor %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 7)
skim_variable n_missing complete_rate ordered n_unique top_counts
Class 0 1 FALSE 2 ben: 458, mal: 241
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 81 / 99
EXPLORATORY DATA ANALYSISNumeric variables
skimr::partition(skim(bc)) %>%.$numeric %>%dplyr::select(skim_variable, n_missing, mean, sd) %>%knitr::kable(format = ’latex’, booktabs = TRUE) %>%kableExtra::kable_styling(latex_options = ’HOLD_position’, font_size = 8)
skim_variable n_missing mean sd
Clump_Thickness 0 4.42 2.82Cell_Size_Uniformity 0 3.13 3.05Cell_Shape_Uniformity 0 3.21 2.97Marginal_Adhesion 0 2.81 2.85Single_Epi_Cell_Size 0 3.22 2.21
Bare_Nuclei 16 3.54 3.64Bland_Chromatin 0 3.44 2.44Normal_Nucleoli 0 2.87 3.05Mitoses 0 1.59 1.72
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 82 / 99
EXPLORATORY DATA ANALYSISBarplots of discrete features
DataExplorer::plot_bar(bc, ggtheme = ggpubr::theme_pubr())
Class
0 100 200 300 400
malignant
benign
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 83 / 99
EXPLORATORY DATA ANALYSISHistograms of numerical features
DataExplorer::plot_histogram(bc, ggtheme = ggpubr::theme_pubr())
Single_Epi_Cell_Size
Clump_Thickness Marginal_Adhesion Mitoses Normal_Nucleoli
Bare_Nuclei Bland_Chromatin Cell_Shape_Uniformity Cell_Size_Uniformity
2.5 5.0 7.5 10.0
2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0
2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.00
100200300400
0100200300400
0100200300
0200400600
050100150
0100200300400
0100200300400
050100150
0100200300400
value
Frequency
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 84 / 99
EXPLORATORY DATA ANALYSISLet’s take a look at the correlation among the variables:
DataExplorer::plot_correlation(bc, ggtheme = ggpubr::theme_pubr(base_size = 8))
1 0.64 0.65 0.49 0.52 0.56 0.54 0.35 −0.72 0.720.64 1 0.91 0.71 0.75 0.76 0.72 0.46 −0.82 0.820.65 0.91 1 0.68 0.72 0.74 0.72 0.44 −0.82 0.820.49 0.71 0.68 1 0.6 0.67 0.6 0.42 −0.7 0.70.52 0.75 0.72 0.6 1 0.62 0.63 0.48 −0.68 0.68
10.56 0.76 0.74 0.67 0.62 1 0.67 0.34 −0.76 0.760.54 0.72 0.72 0.6 0.63 0.67 1 0.43 −0.71 0.710.35 0.46 0.44 0.42 0.48 0.34 0.43 1 −0.42 0.42−0.72 −0.82 −0.82 −0.7 −0.68 −0.76 −0.71 −0.42 1 −10.72 0.82 0.82 0.7 0.68 0.76 0.71 0.42 −1 1
Clump_Thickness
Cell_Size_Uniformity
Cell_Shape_Uniformity
Marginal_Adhesion
Single_Epi_Cell_Size
Bare_Nuclei
Bland_Chromatin
Normal_Nucleoli
Mitoses
Class_benign
Class_malignant
Clu
mp_
Thic
knes
s
Cel
l_Si
ze_U
nifo
rmity
Cel
l_Sh
ape_
Uni
form
ity
Mar
gina
l_Ad
hesi
on
Sing
le_E
pi_C
ell_
Size
Bare
_Nuc
lei
Blan
d_C
hrom
atin
Nor
mal
_Nuc
leol
i
Mito
ses
Cla
ss_b
enig
n
Cla
ss_m
alig
nant
Features
Feat
ures
−1.0 −0.5 0.0 0.5 1.0Correlation Meter
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 85 / 99
EXPLORATORY DATA ANALYSISIt seems that we have quite some missing observations. Let’s take acloser look:
DataExplorer::plot_missing(bc, ggtheme = ggpubr::theme_pubr())
0%
0%
0%
0%
0%
2.29%
0%
0%
0%
0%
Bare_Nuclei
Clump_Thickness
Cell_Size_Uniformity
Cell_Shape_Uniformity
Marginal_Adhesion
Single_Epi_Cell_Size
Bland_Chromatin
Normal_Nucleoli
Mitoses
Class
0 5 10 15Missing Rows
Feat
ures
Band Good
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 86 / 99
Forbes Data
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 87 / 99
FORBES DATA
The Forbes2000 data include the top 2000 list of world leadingcompanies from the year 2004. The data are collected from the wellknown *Forbes Magazine*.
The magazine is well known for its lists andrankings, including of the richest Americans(the Forbes 400), of the world’s top com-panies (the Forbes Global 2000), and TheWorld’s Billionaires.
Source: https://en.wikipedia.org/wiki/Forbes
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 88 / 99
FORBES DATAThe HSAUR2 package provides the data set, which can be imported intoR via
## rank name country category sales profits assets marketvalue## 1 1 Citigroup United States Banking 94.7 17.85 1264 255## 2 2 General Electric United States Conglomerates 134.2 15.59 627 329## 3 3 American Intl Group United States Insurance 76.7 6.46 648 195## 4 4 ExxonMobil United States Oil & gas operations 222.9 20.96 167 277## 5 5 BP United Kingdom Oil & gas operations 232.6 10.27 178 174## 6 6 Bank of America United States Banking 49.0 10.81 736 118
Note: Not all data sets contained in R packages do have help files.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 89 / 99
FORBES DATAFor each company, the following eight variables are available:
rank : the ranking of the company,
name : the name of the company,
country : where the company is situated,
category : products the company produces,
sales : the amount of sales of the company, US dollars,
profits : the profit of the company,
assets : the assets of the company,
marketvalue : the market value of the company.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 90 / 99
Heptathlon Data
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 91 / 99
HEPTATHLON DATAThe heptathlon data includes 25 athletes of the Olympic Heptathlonin Seoul 1988. The dataset can be loaded after insatlling the HSAUR3package via:
data("heptathlon", package = "HSAUR3")
Source:https://en.wikipedia.org/wiki/1988_Summer_Olympics
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 92 / 99
HEPTATHLON DATAThe heptathlon datasets includes the the results of each of the sevendisciplins:
## hurdles highjump shot run200m longjump javelin run800m score## Joyner-Kersee (USA) 12.7 1.86 15.8 22.6 7.27 45.7 129 7291## John (GDR) 12.8 1.80 16.2 23.6 6.71 42.6 126 6897## Behmer (GDR) 13.2 1.83 14.2 23.1 6.68 44.5 124 6858## Sablovskaite (URS) 13.6 1.80 15.2 23.9 6.25 42.8 132 6540## Choubenkova (URS) 13.5 1.74 14.8 23.9 6.32 47.5 128 6540## Schulz (GDR) 13.8 1.83 13.5 24.6 6.33 42.8 126 6411
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 93 / 99
Flights Data
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 94 / 99
FLIGHTS DATA
hflights is a dataset of all plain departures at two Houstonairports (IAH and HOU)in 2011.
It contains 227,496 rows and 21 variables, with details of the flightlike delays, flight time, carrier, ...
HOU
IAH
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 95 / 99
FLIGHTS DATAThe dataset is contained within an additional package also calledhflights
The package needs to be installed and loaded to access the data:
## Year Month DayofMonth## Min. :2011 Min. : 1.00 Min. : 1.0## 1st Qu.:2011 1st Qu.: 4.00 1st Qu.: 8.0## Median :2011 Median : 7.00 Median :16.0## Mean :2011 Mean : 6.51 Mean :15.7## 3rd Qu.:2011 3rd Qu.: 9.00 3rd Qu.:23.0## Max. :2011 Max. :12.00 Max. :31.0
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 96 / 99
Munich Rent Index
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 97 / 99
MUNICH RENT INDEX 2015
The Munich rent index is published by Prof. Kauermann and MichaelWindmann from the Department of Statistics at the LMU. The data weprovide is anonymised and reduced to 1000 observations:
kable(rbind(head(rent15), c("...")))
rent size rooms year good best hw0 ch0 bathtiled0 bathextra kitchen822.6 110 5 1957.5 0 0 0 1 1 1 0595 70 3 1972 0 0 0 0 0 0 01100 91 4 2000.5 0 0 0 0 1 1 0693.36 64 3 1957.5 0 0 0 0 1 0 0700 56 2 1992.5 0 0 0 0 1 1 1686 55 2 2000.5 0 0 0 0 1 0 1... ... ... ... ... ... ... ... ... ... ...
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 98 / 99
MUNICH RENT INDEX 2015The data contains the following variables:
rent: the net rent in euros.
size: living space in square meter.
rooms: number of rooms.
year: year of construction.
good: good address / residential area (yes: 1, no: 0).
best: best address / residential area (yes: 1, no: 0).
hw0: hot water supply (1: no, 0: yes).
ch0: central heating (1: no, 0: yes).
bathtiled0: tiled bathroom (1: no, 0: yes).
bathextra: special furniture in bathroom (1: yes, 0: no).
kitchen: well equipped kitchen (1: yes, 0: no).
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 99 / 99
Advanced Statistical Learning
Chapter 0: Notation and definitions
Bernd Bischl, Julia Moosbauer, Andreas Groll
Department of Statistics – TU Dortmund
Winter term 2020/21
FUNDAMENTAL DEFINITIONS AND NOTATION
X : p-dim. input space, usually we assume X = Rp, butcategorical features can occur as well.
Y : target space, e.g. Y = R, Y = {0, 1}, Y = {�1, 1},Y = {1, . . . , g} or Y = {label1 . . . label
g
}x : feature vector, x = (x1, . . . , xp
)T 2 Xy : target / label / output. y 2 YP
xy
: joint probability distribution on X ⇥ Yp(x, y) or p(x, y | ✓) : joint pdf for x and y
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 1 / 7
FUNDAMENTAL DEFINITIONS AND NOTATIONRemark:This lecture is mainly developed from a frequentist perspective. Ifparameters appear behind the "|", this is for better reading, and doesnot imply that we condition on them in a Bayesian sense (but thisnotation would actually make a Bayesian treatment simple). Soformally, p(x|✓) should be read as p✓(x) or p(x,✓) or p(x;✓).
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 2 / 7
FUNDAMENTAL DEFINITIONS AND NOTATION�x(i), y (i)
�: i-th observation or instance
D =��
x(1), y (1)�, . . . ,
�x(n), y (n)
� : data set with n observations
Dtrain, Dtest : data for training and testing, often D = Dtrain[ Dtest
f (x) or f (x | ✓) 2 R or Rg : prediction function (model) learnedfrom data, we might suppress ✓ in notation
h(x) or h(x|✓) 2 Y : discrete prediction for classification (see later)
✓ 2 ⇥ : model parameters(some models may traditionally use different symbols)
H : hypothesis space. f lives here; it restricts the functional formof f
✏ = y � f (x) or ✏(i) = y
(i) � f
�x(i)
�: residual in regression
yf (x) or y
(i)f
�x(i)
�: margin for binary classification with
Y = {�1, 1} (see later)
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 3 / 7
FUNDAMENTAL DEFINITIONS AND NOTATION⇡
k
(x) = P(y = k | x) : posterior probability for class k , given x, incase of binary labels we might abbreviate ⇡(x) = P(y = 1 | x)
⇡k
= P(y = k) : prior probability for class k , in case of binarylabels we might abbreviate ⇡ = P(y = 1)
L(✓) and `(✓) : Likelihood and log-Likelihood for a parameter ✓,based on a statistical model
f , h, ⇡k
(x), ⇡(x) and ✓ : learned functions and parameters
Remark: With a slight abuse of notation we write random variables,e.g., x and y , in lowercase, as normal variables or function arguments.The context will make clear what is meant.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 4 / 7
FUNDAMENTAL DEFINITIONS AND NOTATIONIn the simplest case we have i.i.d. data D, where the input and outputspace are both real-valued and one-dimensional.
10
15
20
25
30
35
2 3 4 5x
y
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 5 / 7
FUNDAMENTAL DEFINITIONS AND NOTATIONDesign matrix (with or w/o intercept term):
X =
0
B@x
(1)1 · · · x
(1)p
......
...
x
(n)1 · · · x
(n)p
1
CA X =
0
B@1 x
(1)1 · · · x
(1)p
......
......
1 x
(n)1 · · · x
(n)p
1
CA
xj
=⇣
x
(1)j
, . . . , x(n)j
⌘T
: j-th observed feature vector.
y =�y
(1), . . . , y (n)�
T
: vector of target values.
The design matrix on the right demonstrates the trick to encodethe intercept via an additional constant-1 feature, so the featurespace will be (p + 1)-dimensional. This allows to simplify notation,e.g., to write f (x) = ✓T x, instead of f (x) = ✓0 + ✓T x.
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 6 / 7
BINARY LABEL CODING
Remark: Notation in binary classification can be sometimes confusingbecause of different coding styles, and as we have to talk aboutpredicted scores, classes and probabilities.
A binary variable can take only two possible values. For probability /likelihood-based model derivations a 0-1-coding, for geometric /loss-based models the -1/+1-coding is often preferred:
Y = {0, 1}. Here, the approach often models ⇡(x), the posteriorprobability for class 1 given x. Usually, we then defineh(x) = [⇡(x) � 0.5] 2 Y .
Y = {�1, 1}. Here, the approach often models f (x), a real-valuedscore from R given x. Usually, we define h(x) = sign(f (x)) 2 Y ,and we interpret |f (x)| as “confidence” for the predicted class h(x).
Bernd Bischl, Julia Moosbauer, Andreas Groll c� Winter term 2020/21 Advanced Statistical Learning – 7 / 7