clustering. computational journalism week 2
DESCRIPTION
Jonathan Stray, Columbia University, Fall 2014Syllabus at http://www.compjournalism.com/?p=113TRANSCRIPT
![Page 1: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/1.jpg)
Fron%ers of Computa%onal Journalism
Columbia Journalism School
Week 2: Clustering
September 12, 2014
![Page 2: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/2.jpg)
Classifica%on and Clustering
“Classifica%on is arguably one of the most central and generic of all our conceptual exercises. It is the founda%on not only for conceptualiza%on, language, and speech, but also for mathema%cs, sta%s%cs, and data analysis in general.”
-‐ Kenneth D. Bailey, Typologies and Taxonomies: An Introduc7on to Classifica7on Techniques
![Page 3: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/3.jpg)
Each xi is a numerical or categorical feature N = number of features or “dimension”
x1x2x3xN
!
"
#######
$
%
&&&&&&&
Vector representa%on of objects
![Page 4: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/4.jpg)
Examples of vector representa%ons Obvious – movies watched / items purchased – Legisla%ve vo%ng history for a poli%cian – crime loca%ons
Less obvious, but standard – document vector space model – psychological survey results
Tricky research problem: disparate field types – Corporate filing document – Wikileaks SIGACT
![Page 5: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/5.jpg)
What can we do with vectors? Predict one variable based on others – this is called “regression” – supervised machine learning
Group similar items together – This is classifica%on or clustering – We may or may not know pre-‐exis%ng classes
![Page 6: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/6.jpg)
Distance metric
Intui%vely: how (dis)similar are two items? Formally:
d(x, y) ≥ 0 d(x, x) = 0
d(x, y) = d(y, x) d(x, z) ≤ d(x, y) + d(y, z)
![Page 7: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/7.jpg)
Distance metric
d(x, y) ≥ 0 -‐ distance is never nega%ve
d(x, x) = 0 -‐ “reflexivity”: zero distance to self
d(x, y) = d(y, x) -‐ “symmetry”: x to y same as y to x
d(x, z) ≤ d(x, y) + d(y, z) -‐ “triangle inequality”: going direct is shorter
![Page 8: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/8.jpg)
Distance matrix Data matrix for M objects of N dimensions
Distance matrix
X =
x1x2xM
!
"
####
$
%
&&&&
=
x1,1 x1,2 x1,Nx2,1 x2,2 x1,M xM ,N
!
"
#####
$
%
&&&&&
Dij = Dji = d(xi , x j ) =
d1,1 d1,2 dM ,Md2,1 d2,2 d1,M dM ,M
!
"
#####
$
%
&&&&&
![Page 9: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/9.jpg)
![Page 10: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/10.jpg)
We think of a cluster like this…
![Page 11: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/11.jpg)
Real data isn’t so simple…
![Page 12: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/12.jpg)
Many possible defini%ons of a cluster
![Page 13: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/13.jpg)
Many possible defini%ons of a cluster
• “every point inside is closer to the center of this cluster than the center of any other”
• “no point outside this cluster is closer than ε to any point inside”
• “every point in this cluster is closer to all points inside than any point outside”
![Page 14: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/14.jpg)
Different clustering algorithms
• Par%%oning – keep adjus%ng clusters un%l convergence – e.g. K-‐means
• Agglomera%ve hierarchical – start with leaves, repeatedly merge clusters – e.g. MIN and MAX approaches
• Divisive hierarchical – start with root, repeatedly split clusters – e.g. binary split
![Page 15: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/15.jpg)
K-‐means demo
hjp://www.paused21.net/off/kmeans/bin/
![Page 16: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/16.jpg)
Agglomera%ve – combining clusters
put each item into a leaf node while num clusters > 1 find two closest clusters merge them
![Page 17: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/17.jpg)
single link or “min” complete link or “max”
average
![Page 18: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/18.jpg)
![Page 19: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/19.jpg)
UK House of Lords vo%ng clusters Algorithm instructed to separate MPs into five clusters. Output: !!1 2 2 1 3 2 2 2 1 4 !1 1 1 1 1 1 5 2 1 1 !2 2 1 2 3 2 2 4 2 1 !2 3 2 1 3 1 1 2 1 2 !1 5 2 1 4 2 2 1 2 1 !
1 4 1 1 4 1 2 2 1 5 !1 1 1 2 3 3 2 2 2 5 !2 3 1 2 1 4 1 1 4 4 !1 1 2 1 1 2 2 2 2 1 !2 1 2 1 2 2 1 3 2 1 !1 2 2 1 2 3 4 2 2 2!
! ! ! ! ! ! ! .!! ! ! ! ! ! ! .!! ! ! ! ! ! ! . !
![Page 20: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/20.jpg)
Vo%ng clusters with par%es LDem XB Lab LDem XB Lab XB Lab Con XB ! 1 2 2 1 3 2 2 2 1 4 ! Con Con LDem Con Con Con LDem Lab Con LDem ! 1 1 1 1 1 1 5 2 1 1 !
Lab Lab Con Lab XB XB Lab XB Lab Con ! 2 2 1 2 3 2 2 4 2 1 ! Lab XB Lab Con XB XB LDem Lab XB Lab !
2 3 2 1 3 1 1 2 1 2 ! Con Con Lab Con XB Lab Lab Con XB XB ! 1 5 2 1 4 2 2 1 2 1 ! Con XB Con Con XB Con Lab XB LDem Con !
1 4 1 1 4 1 2 2 1 5 ! Con Con Con Lab Bp XB Lab Lab Lab LDem ! 1 1 1 2 3 3 2 2 2 5 !
Lab XB Con Lab Con XB Con Con XB XB ! 2 3 1 2 1 4 1 1 4 4 ! Con Con Lab Con Con XB Lab Lab Lab Con ! 1 1 2 1 1 2 2 2 2 1 !
Lab LDem Lab Con Lab Lab Con XB Lab Con ! 2 1 2 1 2 2 1 3 2 1 ! Con Lab XB Con XB XB XB Lab Lab Lab ! 1 2 2 1 2 3 4 2 2 2!
! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!! ! ! ! ! ! ! ! .!
!
![Page 21: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/21.jpg)
Clustering Algorithm
Input: data points (feature vectors). Output: a set of clusters, each of which is a set of points.
Visualiza%on
Input: data points (feature vectors). Output: a picture of the points.
![Page 22: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/22.jpg)
Dimensionality reduc%on
Problem: vector space is high-‐dimensional. Up to thousands of dimensions. The screen is two-‐dimensional. We have to go from
x ∈ RN to much lower dimensional points
y ∈ RK<<N Probably K=2 or K=3.
![Page 23: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/23.jpg)
This is called "projec%on"
Projec%on from 3 to 2 dimensions
![Page 24: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/24.jpg)
Linear projec%ons
Projects in a straight line to closest point on "screen." Mathema%cally,
y = Px
where P is a K by N matrix.
Projec%on from 2 to 1 dimensions
![Page 25: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/25.jpg)
Think of this as rota%ng to align the "screen" with coordinate axes, then simply throwing out values of higher dimensions.
Projec%on from 3 to 2 dimensions
![Page 26: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/26.jpg)
Which direc%on should we look from? Principal components analysis: find a linear projec%on that preserves greatest variance
Take first K eigenvectors of covariance matrix corresponding to largest eigenvalues. This gives a K-‐dimensional sub-‐space for projec%on.
![Page 27: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/27.jpg)
Some%mes overlap is unavoidable
![Page 28: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/28.jpg)
Real data isn’t so simple…
![Page 29: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/29.jpg)
Nonlinear projec%ons
S%ll going from high-‐dimensional x to low-‐dimensional y, but now
y = f(x) for some func%on f(), not linear. So, may not preserve rela%ve distances, angles, etc.
Fish-‐eye projec%on from 3 to 2 dimensions
![Page 30: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/30.jpg)
Mul%dimensional scaling
Idea: try to preserve distances between points "as much as possible." If we have the distances between all points in a distance matrix,
D = |xi – xj| for all i,j We can recover the original {xi} coordinates exactly (up to rigid transforma%ons.) Like working out a country map if you know how far away each city is from every other.
![Page 31: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/31.jpg)
Mul%dimensional scaling Torgerson's "classical MDS" algorithm (1952)
![Page 32: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/32.jpg)
Reducing dimension with MDS
No%ce: dimension N is not encoded in the distance matrix D (it’s M by M where M is number of points) MDS formula (theore%cally) allows us to recover point coordinates {x} in any number of dimensions k.
![Page 33: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/33.jpg)
MDS Stress minimiza%on The formula actually minimizes “stress” Think of “springs” between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-‐dimensional distances matched exactly in low dimension.
stress(x) = xi − x j − dij( )2
i, j∑
![Page 34: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/34.jpg)
Mul%-‐dimensional Scaling
Like "flajening" a stretchy structure into 2D, so that distances between points are preserved (as much as possible")
![Page 35: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/35.jpg)
House of Lords MDS plot
![Page 36: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/36.jpg)
Robustness of results
Regarding these analyses of congressional vo%ng, we could s%ll ask: • Are we modeling the right thing? (What about other legisla%ve work, e.g. in commijee?)
• Are our underlying assump%ons correct? (do representa%ves really have “ideal points” in a preference space?)
• What are we trying to argue? What will be the effect of poin%ng out this result?
![Page 37: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/37.jpg)
Why do clusters have meaning?
What is the connec%on between mathema%cal and seman%c proper%es?
![Page 38: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/38.jpg)
No unique “right” clustering
Different distance metrics and clustering algorithms give different results. Should we sort incident reports by loca%on, %me, actor, event type, author, cost, casual%es…? There is only context-‐specific categoriza%on. And the computer doesn’t understand your context.
![Page 39: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/39.jpg)
Different libraries, different categories
![Page 40: Clustering. Computational Journalism week 2](https://reader034.vdocuments.net/reader034/viewer/2022042905/577cc4aa1a28aba7119a09cd/html5/thumbnails/40.jpg)