exploratory data analysis...exploratory data analysis email data set > email # a tibble: 3,921 ×...
TRANSCRIPT
![Page 1: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/1.jpg)
Introducing the dataE XP L OR ATOR Y DATA AN ALYSIS IN R
Andrew Bray
Assistant Professor, Reed College
![Page 2: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/2.jpg)
EXPLORATORY DATA ANALYSIS IN R
Email data setemail
# A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image <fctr> <dbl> <dbl> <int> <dbl> <dttm> <dbl> 1 not-spam 0 1 0 0 2012-01-01 01:16:41 0 2 not-spam 0 1 0 0 2012-01-01 02:03:59 0 3 not-spam 0 1 0 0 2012-01-01 11:00:32 0 4 not-spam 0 1 0 0 2012-01-01 04:09:49 0 5 not-spam 0 1 0 0 2012-01-01 05:00:01 0 6 not-spam 0 1 0 0 2012-01-01 05:04:46 0 7 not-spam 1 1 0 1 2012-01-01 12:55:06 0 8 not-spam 1 1 1 1 2012-01-01 13:45:21 1 9 not-spam 0 1 0 0 2012-01-01 16:08:59 0 10 not-spam 0 1 0 0 2012-01-01 13:12:00 0 # ... with 3,911 more rows, and 14 more variables: attach <dbl>, # dollar <dbl>, winner <fctr>, inherit <dbl>, viagra <dbl>, # password <dbl>, num_char <dbl>, line_breaks <int>, format <dbl>, # re_subj <dbl>, exclaim_subj <dbl>, urgent_subj <dbl>, # exclaim_mess <dbl>, number <fctr>
![Page 3: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/3.jpg)
EXPLORATORY DATA ANALYSIS IN R
Histogramsggplot(data, aes(x = var1)) + geom_histogram()
![Page 4: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/4.jpg)
EXPLORATORY DATA ANALYSIS IN R
Histogramsggplot(data, aes(x = var1)) + geom_histogram() + facet_wrap(~var2)
![Page 5: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/5.jpg)
EXPLORATORY DATA ANALYSIS IN R
Boxplotsggplot(data, aes(x = var2, y = var1)) + geom_boxplot()
![Page 6: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/6.jpg)
EXPLORATORY DATA ANALYSIS IN R
Boxplotsggplot(data, aes(x = 1, y = var1)) + geom_boxplot()
![Page 7: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/7.jpg)
EXPLORATORY DATA ANALYSIS IN R
Density plotsggplot(data, aes(x = var1)) + geom_density()
![Page 8: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/8.jpg)
EXPLORATORY DATA ANALYSIS IN R
Density plotsggplot(data, aes(x = var1, fill = var2)) + geom_density(alpha = .3)
![Page 9: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/9.jpg)
Let's practice!E XP L OR ATOR Y DATA AN ALYSIS IN R
![Page 10: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/10.jpg)
Check-in 1E XP L OR ATOR Y DATA AN ALYSIS IN R
Andrew Bray
Assistant Professor, Reed College
![Page 11: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/11.jpg)
EXPLORATORY DATA ANALYSIS IN R
Review
![Page 12: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/12.jpg)
EXPLORATORY DATA ANALYSIS IN R
Review
![Page 13: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/13.jpg)
EXPLORATORY DATA ANALYSIS IN R
Zero inflation strategiesAnalyze the two components separately
Collapse into two-level categorical variable
![Page 14: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/14.jpg)
EXPLORATORY DATA ANALYSIS IN R
Zero inflation strategiesAnalyze the two components separately
Collapse into two-level categorical variable
![Page 15: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/15.jpg)
EXPLORATORY DATA ANALYSIS IN R
Zero inflation strategiesemail %>% mutate(zero = exclaim_mess == 0) %>% ggplot(aes(x = zero)) + geom_bar() + facet_wrap(~spam)
![Page 16: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/16.jpg)
EXPLORATORY DATA ANALYSIS IN R
Barchart optionsemail %>% mutate(zero = exclaim_mess == 0) %>% ggplot(aes(x = zero, fill = spam)) + geom_bar()
![Page 17: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/17.jpg)
EXPLORATORY DATA ANALYSIS IN R
Barchart optionsemail %>% mutate(zero = exclaim_mess == 0) %>% ggplot(aes(x = zero, fill = spam)) + geom_bar(position = "fill")
![Page 18: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/18.jpg)
Let's practice!E XP L OR ATOR Y DATA AN ALYSIS IN R
![Page 19: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/19.jpg)
Check-in 2E XP L OR ATOR Y DATA AN ALYSIS IN R
Andrew Bray
Assistant Professor, Reed College
![Page 20: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/20.jpg)
EXPLORATORY DATA ANALYSIS IN R
Spam and images
email %>% mutate(has_image = image 0) %>% ggplot(aes(x = as.factor(has_image), fill = spam)) + geom_bar(position = "fill")
![Page 21: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/21.jpg)
EXPLORATORY DATA ANALYSIS IN R
Spam and images
email %>% mutate(has_image = image 0) %>% ggplot(aes(x = spam, fill = has_image)) + geom_bar(position = "fill")
![Page 22: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/22.jpg)
EXPLORATORY DATA ANALYSIS IN R
Ordering bars
![Page 23: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/23.jpg)
EXPLORATORY DATA ANALYSIS IN R
Ordering bars
![Page 24: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/24.jpg)
EXPLORATORY DATA ANALYSIS IN R
Ordering bars
![Page 25: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/25.jpg)
EXPLORATORY DATA ANALYSIS IN R
Ordering bars
email <- email %>% mutate(zero = exclaim_mess == 0) levels(email$zero)
NULL
email$zero <- factor(email$zero, levels = c("TRUE", "FALSE"))
![Page 26: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/26.jpg)
EXPLORATORY DATA ANALYSIS IN R
Ordering bars
email %>% ggplot(aes(x = zero)) + geom_bar() + facet_wrap(~spam)
![Page 27: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/27.jpg)
EXPLORATORY DATA ANALYSIS IN R
Ordering bars..
email %>% ggplot(aes(x = zero)) + geom_bar() + facet_wrap(~spam)
![Page 28: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/28.jpg)
Let's practice!E XP L OR ATOR Y DATA AN ALYSIS IN R
![Page 29: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/29.jpg)
ConclusionE XP L OR ATOR Y DATA AN ALYSIS IN R
Andrew Bray
Assistant Professor, Reed College
![Page 30: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/30.jpg)
EXPLORATORY DATA ANALYSIS IN R
Pie chart vs. bar chart
![Page 31: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/31.jpg)
EXPLORATORY DATA ANALYSIS IN R
Faceting vs. stacking
![Page 32: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/32.jpg)
EXPLORATORY DATA ANALYSIS IN R
Histogram
ggplot(data, aes(x = var1)) + geom_histogram()
![Page 33: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/33.jpg)
EXPLORATORY DATA ANALYSIS IN R
Density plotcars %>% filter(eng_size < 2.0) %>% ggplot(aes(x = hwy_mpg)) + geom_density()
![Page 34: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/34.jpg)
EXPLORATORY DATA ANALYSIS IN R
Side-by-side box plotsggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) + geom_boxplot()
Warning message: Removed 11 rows containing non-finite values (stat_boxplot).
![Page 35: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/35.jpg)
EXPLORATORY DATA ANALYSIS IN R
Center: mean, median, modex
76 78 75 74 76 72 74 73 73 75 74
table(x) x
72 73 74 75 76 78 1 2 3 2 2 1
![Page 36: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/36.jpg)
EXPLORATORY DATA ANALYSIS IN R
Shape of income
ggplot(life, aes(x = income, fill = west_coast)) + geom_density(alpha = .3) ggplot(life, aes(x = log(income), fill = west_coast)) + geom_density(alpha = .3)
![Page 37: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/37.jpg)
EXPLORATORY DATA ANALYSIS IN R
With group_by()life %>% slice(240:247) %>% group_by(west_coast) %>% summarize(mean(expectancy))
# A tibble: 2 x 2 west_coast mean(expectancy) <lgl <dbl> 1 FALSE 79.26125 2 TRUE 79.29375
![Page 38: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/38.jpg)
EXPLORATORY DATA ANALYSIS IN R
Spam and exclamation pointsemail %>% mutate(zero = exclaim_mess == 0) %>% ggplot(aes(x = zero, fill = spam)) + geom_bar()
![Page 39: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/39.jpg)
EXPLORATORY DATA ANALYSIS IN R
Spam and imagesemail %>% mutate(has_image = image 0) %>% ggplot(aes(x = as.factor(has_image), fill = spam)) + geom_bar(position = "fill")
![Page 40: EXPLORATORY DATA ANALYSIS...Exploratory Data Analysis Email data set > email # A tibble: 3,921 × 21 spam to_multiple from cc sent_email time image](https://reader035.vdocuments.net/reader035/viewer/2022063020/5fe27fb6849d283b9050ce7e/html5/thumbnails/40.jpg)
Let's practice!E XP L OR ATOR Y DATA AN ALYSIS IN R