deconstructing data sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...computational...
TRANSCRIPT
![Page 1: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/1.jpg)
Deconstructing Data ScienceDavid Bamman, UC Berkeley
Info 290
Lecture 4: Regression overview
Feb 1, 2016
![Page 2: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/2.jpg)
Regression
x = the empire state building y = 17444.5625”
A mapping from input data x (drawn from instance space 𝓧) to a point y in ℝ
(ℝ = the set of real numbers)
![Page 3: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/3.jpg)
task 𝓧 𝒴
predicting box office revenue movie ℝ
![Page 4: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/4.jpg)
Experiment designtraining development testing
size 80% 10% 10%
purpose training models model selectionevaluation;
never look at it until the very
end
![Page 5: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/5.jpg)
Metrics• Measure difference between the prediction ŷ and
the true y
1N
N�
i=1(yi � yi)2Mean squared error
(MSE)
1N
N�
i=1|yi � yi|Mean absolute error
(MAE)
![Page 6: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/6.jpg)
Linear regression
y =F�
i=1xiβi
β ∈ ℝF
(F-dimensional vector of real numbers)
0
5
10
15
20
0 5 10 15 20x
y
![Page 7: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/7.jpg)
Polynomial regression
y =F�
i=1xiβa,i +
F�
i=1x2i βb,i
βa, βb ∈ ℝF
(F-dimensional vector of real numbers)
0
100
200
300
400
-20 -10 0 10 20x
x^2
![Page 8: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/8.jpg)
Polynomial regression
βa, βb, βc ∈ ℝF
(F-dimensional vector of real numbers)
y =F�
i=1xiβa,i +
F�
i=1x2i βb,i +
F�
i=1x3i βc,i
-5000
0
5000
-20 -10 0 10 20x
x^3
![Page 9: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/9.jpg)
Support vector machines (regression)
Probabilistic graphical models
Networks
Neural networks
Deep learning
Decision trees
Random forests
Nonlinear regression
![Page 10: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/10.jpg)
Number of Parametersorder 1
(linear reg.)
order 2
order 3 y =F�
i=1xiβa,i +
F�
i=1x2i βb,i +
F�
i=1x3i βc,i
y =F�
i=1xiβa,i +
F�
i=1x2i βb,i
y =F�
i=1xiβa,i
![Page 11: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/11.jpg)
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
![Page 12: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/12.jpg)
labeled data
labeled data
labeled data
𝓧instance space
![Page 13: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/13.jpg)
-20
0
20
0 5 10 15 20x
y
degree 1, training MSE = 73.4
![Page 14: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/14.jpg)
-20
0
20
0 5 10 15 20x
y
degree 2, training MSE = 71.9
![Page 15: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/15.jpg)
-20
0
20
0 5 10 15 20x
y
degree 3, training MSE = 60.9
![Page 16: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/16.jpg)
-20
0
20
0 5 10 15 20x
y
degree 4, training MSE = 60.6
![Page 17: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/17.jpg)
-20
0
20
0 5 10 15 20x
y
degree 5, training MSE = 59.1
![Page 18: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/18.jpg)
-20
0
20
0 5 10 15 20x
y
degree 6, training MSE = 50.2
![Page 19: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/19.jpg)
-20
0
20
0 5 10 15 20x
y
degree 7, training MSE = 49.6
![Page 20: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/20.jpg)
-20
0
20
0 5 10 15 20x
y
degree 8, training MSE = 46.8
![Page 21: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/21.jpg)
-20
0
20
0 5 10 15 20x
y
degree 9, training MSE = 41.2
![Page 22: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/22.jpg)
-20
0
20
0 5 10 15 20x
y
degree 10, training MSE = 35.8
![Page 23: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/23.jpg)
-20
0
20
0 5 10 15 20x
y
degree 11, training MSE = 21.1
![Page 24: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/24.jpg)
-20
0
20
0 5 10 15 20x
y
degree 12, training MSE = 18.4
![Page 25: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/25.jpg)
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
18.4
![Page 26: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/26.jpg)
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
18.4 118.8
![Page 27: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/27.jpg)
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
18.4 118.8
136.5
![Page 28: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/28.jpg)
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
18.4 118.8
136.5 87.7
![Page 29: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/29.jpg)
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
73.4 86.3
65.0 94.7
![Page 30: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/30.jpg)
Overfitting• Memorizing the nuances (and noise) of the training
data that prevents generalizing to unseen data
-20
0
20
0 5 10 15 20x
y
-20
0
20
0 5 10 15 20x
y
![Page 31: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/31.jpg)
Sources of error• Bias: Error due to mis-specifying the relationship
between input and the output. [too few parameters, or the wrong kinds]
• Variance: Error due to sensitivity to random fluctuations in the training data. If you train on different data, do you get radically different predictions?[too many parameters]
![Page 32: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/32.jpg)
X XX
XX
X
X
X
X
X
X
X
X
X
X
X XX
XX
Low variance High variance
Low bias
High bias
Image from Flach 2012
![Page 33: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/33.jpg)
X XX
XX High bias, low variance: Always
predict “Berkeley”
X
X
X
X
X
High bias, high variance: Predict most frequent city in training data
X
X
X
X
X
Low bias, high variance: many features, some of which capture true signal but capture random noise
X XX
XX
Low bias, low variance: enough features to capture the true signal
Example: geolocation on
![Page 34: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/34.jpg)
Ordinal regression• In between classification and regression
• 𝒴 is categorical (e.g.,✩, ✩✩, ✩✩✩)
• Elements of 𝒴 are ordered
• ✩ < ✩✩ • ✩✩ < ✩✩✩ • ✩ < ✩✩✩
![Page 35: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/35.jpg)
task 𝓧 𝒴
predicting star ratings movie {✩, ✩✩, ✩✩✩}
Ordinal regression
![Page 36: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/36.jpg)
• Sarah Cohen, James T. Hamilton, and Fred Turner, “Computational Journalism,” Communications of the ACM (2011)
• Sylvain Parasie, “Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of ‘big data,’” Digital Journalism (2015)
Computational Journalism
![Page 37: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/37.jpg)
Computational Journalism
• “Changing how stories are discovered, presented, aggregated, monetized and archived” (Cohen et al. 2012)
• Draws on earlier tradition of computer-assisted reporting and “precision journalism” (Meyer 1972)
![Page 38: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/38.jpg)
• Database linking, e.g.:
• voting records to the deceased • press releases from different members of
congress • indictments/settlements from U.S. attorneys • documents from SEC, Pentagon, defense
contractors to note movement to industry (Cohen 2012)
• DSA database of safety status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 2015)
Computational Journalism
![Page 39: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/39.jpg)
• Information extraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents.
• Analyzing the relationship between entities
Computational Journalism
![Page 40: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/40.jpg)
Computational Journalism• Data-driven stories about large-scale trends
40
Change in insured Americans under the ACA, NY Times (Oct 29, 2014)
Relationship between birth year and political views NY Times (July 7, 2014)
![Page 41: Deconstructing Data Sciencecourses.ischool.berkeley.edu/i290-dds/s16/dds/slides/4...Computational Journalism • Data-driven stories about large-scale trends 40 Change in insured Americans](https://reader033.vdocuments.net/reader033/viewer/2022042322/5f0bf9e67e708231d43325b3/html5/thumbnails/41.jpg)
• Data-driven lead generation; the outliers in analysis that point to a story
Computational Journalism