learning similarity metrics for event identification in social media
DESCRIPTION
WSDM2010 Talk.TRANSCRIPT
![Page 1: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/1.jpg)
Learning Similarity Metrics
for Event Identification in
Social Media
Hila Becker, Luis Gravano Mor Naaman
Columbia University Rutgers University
![Page 2: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/2.jpg)
“Event” Content in Social Media Sites
![Page 3: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/3.jpg)
“Event” Content in Social Media Sites
“Event”= something that occurs at a certain time in a certain place [Yang et al. ‟99]
Smaller events, without traditional
news coverage Popular, widely known events
![Page 4: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/4.jpg)
Identifying Events and Associated
Social Media Documents
![Page 5: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/5.jpg)
Identifying Events and Associated
Social Media Documents
Applications
Event browsing
Local event search
…
General approach: group similar documents
via clustering Each cluster corresponds to one event and its
associated social media documents
![Page 6: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/6.jpg)
Identifying Events and Associated
Social Media Documents
Applications
Event browsing
Local event search
…
General approach: group similar documents
via clustering Each cluster corresponds to one event and its
associated social media documents
![Page 7: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/7.jpg)
Identifying Events and Associated
Social Media Documents
Applications
Event browsing
Local event search
…
General approach: group similar documents
via clustering Each cluster corresponds to one event and its
associated social media documents
![Page 8: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/8.jpg)
Event Identification: Challenges
![Page 9: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/9.jpg)
Event Identification: Challenges
Uneven data quality
Missing, short, uninformative text
… but revealing structured context available:
tags, date/time, geo-coordinates
Scalability
Dynamic data stream of event information
Number of events unknown
Difficult to estimate
Constantly changing
![Page 10: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/10.jpg)
Event Identification: Challenges
Uneven data quality
Missing, short, uninformative text
… but revealing structured context available:
tags, date/time, geo-coordinates
Scalability
Dynamic data stream of event information
Number of events unknown
Difficult to estimate
Constantly changing
![Page 11: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/11.jpg)
Event Identification: Challenges
Uneven data quality
Missing, short, uninformative text
… but revealing structured context available:
tags, date/time, geo-coordinates
Scalability
Dynamic data stream of event information
Number of events unknown
Difficult to estimate
Constantly changing
![Page 12: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/12.jpg)
Event Identification: Challenges
Uneven data quality
Missing, short, uninformative text
… but revealing structured context available:
tags, date/time, geo-coordinates
Scalability
Dynamic data stream of event information
Number of events unknown
Difficult to estimate
Constantly changing
![Page 13: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/13.jpg)
Clustering Social Media Documents
![Page 14: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/14.jpg)
Clustering Social Media Documents
Social media document representation
Social media document similarity
Social media document clustering framework
Similarity metric learning for clustering Ensemble-based
Classification-based
Evaluation results
![Page 15: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/15.jpg)
Clustering Social Media Documents
Social media document representation
Social media document similarity
Social media document clustering framework
Similarity metric learning for clustering Ensemble-based
Classification-based
Evaluation results
![Page 16: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/16.jpg)
Clustering Social Media Documents
Social media document representation
Social media document similarity
Social media document clustering framework
Similarity metric learning for clustering Ensemble-based
Classification-based
Evaluation results
![Page 17: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/17.jpg)
Clustering Social Media Documents
Social media document representation
Social media document similarity
Social media document clustering framework
Similarity metric learning for clustering Ensemble-based
Classification-based
Evaluation results
![Page 18: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/18.jpg)
Clustering Social Media Documents
Social media document representation
Social media document similarity
Social media document clustering framework
Similarity metric learning for clustering Ensemble-based
Classification-based
Evaluation results
![Page 19: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/19.jpg)
Social Media Document Features 46
![Page 20: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/20.jpg)
Social Media Document Features
Title
47
![Page 21: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/21.jpg)
Social Media Document Features
Title
48
![Page 22: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/22.jpg)
Social Media Document Features
Title
Description
49
![Page 23: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/23.jpg)
Social Media Document Features
Title
Description
50
![Page 24: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/24.jpg)
Social Media Document Features
Title
Description
Tags
51
![Page 25: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/25.jpg)
Social Media Document Features
Title
Description
Tags
52
![Page 26: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/26.jpg)
Social Media Document Features
Title
Description
Tags
Date/Time
53
![Page 27: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/27.jpg)
Social Media Document Features
Title
Description
Tags
Date/Time
54
![Page 28: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/28.jpg)
Social Media Document Features
Title
Description
Tags
Date/Time
Location
55
![Page 29: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/29.jpg)
Social Media Document Features
Title
Description
Tags
Date/Time
Location
56
![Page 30: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/30.jpg)
Social Media Document Features
Title
Description
Tags
Date/Time
Location
All-Text
57
![Page 31: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/31.jpg)
Social Media Document Similarity
Title
Description
Tags
Location
All-Text
Date/Time
![Page 32: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/32.jpg)
Social Media Document Similarity
Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)
Title
Description
Tags
Location
All-Text
Date/Time
A A A B B B
![Page 33: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/33.jpg)
Social Media Document Similarity
Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)
Title
Description
Tags
Location
All-Text
Date/Time
time
A A A B B B
Time: proximity in minutes
![Page 34: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/34.jpg)
Social Media Document Similarity
Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)
Title
Description
Tags
Location
All-Text
Date/Time
time
A A A B B B
Time: proximity in minutes
Location: geo-coordinate proximity
![Page 35: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/35.jpg)
Social Media Document Similarity
Text: cosine similarity of tf-idf vectors (tf-idf version?; stemming?; stop-word elimination?)
Title
Description
Tags
Location
All-Text
Date/Time
time
A A A B B B
Time: proximity in minutes
Location: geo-coordinate proximity
![Page 36: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/36.jpg)
General Clustering Framework
Document feature
representation
Social media
documents Event clusters
63
![Page 37: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/37.jpg)
General Clustering Framework
Document feature
representation
Social media
documents Event clusters
64
![Page 38: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/38.jpg)
General Clustering Framework
Document feature
representation
Social media
documents Event clusters
65
![Page 39: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/39.jpg)
General Clustering Framework
Document feature
representation
Social media
documents Event clusters
66
![Page 40: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/40.jpg)
General Clustering Framework
Document feature
representation
Social media
documents Event clusters
67
![Page 41: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/41.jpg)
General Clustering Framework
Document feature
representation
Social media
documents Event clusters
68
![Page 42: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/42.jpg)
Clustering Algorithm
![Page 43: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/43.jpg)
Clustering Algorithm
Many alternatives possible! [Berkhin 2002]
Single-pass incremental clustering algorithm
Scalable, online solution
Used effectively for event identification in textual news
Does not require a priori knowledge of number of clusters
Parameters:
Similarity Function σ
Threshold μ
![Page 44: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/44.jpg)
Clustering Algorithm
Many alternatives possible! [Berkhin 2002]
Single-pass incremental clustering algorithm
Scalable, online solution
Used effectively for event identification in textual news
Does not require a priori knowledge of number of clusters
Parameters:
Similarity Function σ
Threshold μ
![Page 45: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/45.jpg)
Cluster Representation and
Parameter Tuning
![Page 46: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/46.jpg)
Cluster Representation and
Parameter Tuning
Centroid cluster representation
Average tf-idf scores
Average time
Geographic mid-point
Parameter tuning in supervised training
phase
Clustering quality metrics to optimize:
Normalized Mutual Information (NMI) [Amigó et al. 2008]
B-Cubed [Strehl et al. 2002]
![Page 47: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/47.jpg)
Cluster Representation and
Parameter Tuning
Centroid cluster representation
Average tf-idf scores
Average time
Geographic mid-point
Parameter tuning in supervised training
phase
Clustering quality metrics to optimize:
Normalized Mutual Information (NMI) [Amigó et al. 2008]
B-Cubed [Strehl et al. 2002]
![Page 48: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/48.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
![Page 49: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/49.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
![Page 50: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/50.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
![Page 51: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/51.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
✔
![Page 52: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/52.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
✔
![Page 53: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/53.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
✔
![Page 54: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/54.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
✔
✔
![Page 55: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/55.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
✔
✔
Captured by both NMI and B-Cubed
Optimize both metrics using a single (Pareto
optimal) objective function: NMI+B-Cubed
![Page 56: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/56.jpg)
Clustering Quality Metrics
Characteristics of clusters:
Homogeneity
Completeness
✔
✔
Captured by both NMI and B-Cubed
Optimize both metrics using a single (Pareto
optimal) objective function: NMI+B-Cubed
![Page 57: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/57.jpg)
Learning a Similarity Metric for Clustering
![Page 58: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/58.jpg)
Learning a Similarity Metric for Clustering
Ensemble-based similarity
Training a cluster ensemble
Computing a similarity score by:
Combining individual partitions
Combining individual similarities
Classification-based similarity
Training data sampling strategies
Modeling strategies
![Page 59: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/59.jpg)
Learning a Similarity Metric for Clustering
Ensemble-based similarity
Training a cluster ensemble
Computing a similarity score by:
Combining individual partitions
Combining individual similarities
Classification-based similarity
Training data sampling strategies
Modeling strategies
![Page 60: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/60.jpg)
Overview of a Cluster Ensemble Algorithm
![Page 61: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/61.jpg)
Overview of a Cluster Ensemble Algorithm
Ctitle
Ctag
s
Ctime
![Page 62: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/62.jpg)
Consensus Function:
combine ensemble
similarities
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Learned in a
training step
![Page 63: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/63.jpg)
Consensus Function:
combine ensemble
similarities
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
Ensemble
clustering
solution
Learned in a
training step
![Page 64: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/64.jpg)
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
![Page 65: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/65.jpg)
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
![Page 66: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/66.jpg)
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
![Page 67: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/67.jpg)
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
Ctitle
Ctag
s
Ctime
![Page 68: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/68.jpg)
Overview of a Cluster Ensemble Algorithm
Wtitle
Wtags
Wtime
f(C,W)
σCtitle(di,cj)>μCtitle
σCtags(di,cj)>μCtags
σCtime(di,cj)>μCtime
For each
document di
and cluster cj
![Page 69: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/69.jpg)
Learning a Similarity Metric for Clustering
Classification-based similarity
Training data sampling strategies
Modeling strategies
![Page 70: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/70.jpg)
Classification-based Similarity Metrics
![Page 71: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/71.jpg)
Classification-based Similarity Metrics
Classify pairs of documents as similar/dissimilar
Feature vector
Pairwise similarity scores
One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
Modeling strategies
Document pairs
Document-centroid pairs
![Page 72: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/72.jpg)
Classification-based Similarity Metrics
Classify pairs of documents as similar/dissimilar
Feature vector
Pairwise similarity scores
One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
Modeling strategies
Document pairs
Document-centroid pairs
![Page 73: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/73.jpg)
Classification-based Similarity Metrics
Classify pairs of documents as similar/dissimilar
Feature vector
Pairwise similarity scores
One feature per similarity metric (e.g., time-
proximity, location-proximity, …)
Modeling strategies
Document pairs
Document-centroid pairs
![Page 74: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/74.jpg)
Training Classification-based Similarity
![Page 75: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/75.jpg)
Training Classification-based Similarity
Challenge: most document pairs do not correspond to the
same event
Skewed label distribution
Small, highly homogeneous clusters
Sampling strategies
Random
Select a document at random
Randomly create one positive and one negative example
Time-based
Create examples for the first NxN documents
Resample such that the label distribution is balanced
![Page 76: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/76.jpg)
Training Classification-based Similarity
Challenge: most document pairs do not correspond to the
same event
Skewed label distribution
Small, highly homogeneous clusters
Sampling strategies
Random
Select a document at random
Randomly create one positive and one negative example
Time-based
Create examples for the first NxN documents
Resample such that the label distribution is balanced
![Page 77: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/77.jpg)
Experiments: Alternative Similarity
Metrics
![Page 78: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/78.jpg)
Experiments: Alternative Similarity
Metrics
Ensemble-based techniques
Combining individual partitions (ENS-PART)
Combining individual similarities (ENS-SIM)
Classification-based techniques
Modeling: document-document vs. document-centroid pairs
Sampling: time-based vs. random
Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)
Baselines
Title, Description, Tags, All-Text, Time-Proximity, Location-Proximity
![Page 79: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/79.jpg)
Experiments: Alternative Similarity
Metrics
Ensemble-based techniques
Combining individual partitions (ENS-PART)
Combining individual similarities (ENS-SIM)
Classification-based techniques
Modeling: document-document vs. document-centroid pairs
Sampling: time-based vs. random
Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)
Baselines
Title, Description, Tags, All-Text, Time-Proximity, Location-Proximity
![Page 80: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/80.jpg)
Experiments: Alternative Similarity
Metrics
Ensemble-based techniques
Combining individual partitions (ENS-PART)
Combining individual similarities (ENS-SIM)
Classification-based techniques
Modeling: document-document vs. document-centroid pairs
Sampling: time-based vs. random
Logistic Regression (CLASS-LR), Support Vector Machines (CLASS-SVM)
Baselines
Title, Description, Tags, All-Text, Time-Proximity, Location-Proximity
![Page 81: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/81.jpg)
Experimental Setup
![Page 82: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/82.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 83: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/83.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 84: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/84.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 85: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/85.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 86: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/86.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 87: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/87.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 88: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/88.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 89: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/89.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 90: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/90.jpg)
Experimental Setup
Datasets:
Upcoming
>270K Flickr photos
Event labels from the “upcoming” event database (upcoming:event=12345)
Split into 3 parts for training/validation/testing
LastFM
>594K Flickr photos
Event labels from last.fm music catalog (lastfm:event=6789)
Used as an additional test set
![Page 91: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/91.jpg)
Clustering Accuracy over Upcoming Test Set
All similarity learning techniques outperform the baselines
Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
![Page 92: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/92.jpg)
Clustering Accuracy over Upcoming Test Set
All similarity learning techniques outperform the baselines
Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
![Page 93: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/93.jpg)
Clustering Accuracy over Upcoming Test Set
All similarity learning techniques outperform the baselines
Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
![Page 94: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/94.jpg)
Clustering Accuracy over Upcoming Test Set
All similarity learning techniques outperform the baselines
Classification-based techniques perform better than
ensemble-based techniques
Algorithm NMI B-Cubed
All-Text 0.9240 0.7697
Tags 0.9229 0.7676
ENS-PART 0.9296 0.7819
ENS-SIM 0.9322 0.7861
CLASS-SVM 0.9425 0.8095
CLASS-LR 0.9444 0.8155
![Page 95: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/95.jpg)
Statistical Significance Analysis
Clustering results for 10 partitions of Upcoming
test set
Significant using Friedman test, p<0.05
Post-hoc analysis:
![Page 96: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/96.jpg)
NMI: Clustering Accuracy over Both Test Sets
Upcoming LastFM
NM
I
Similarity learning models trained on Upcoming
data show similar trends when tested on LastFM
data
![Page 97: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/97.jpg)
Conclusions
![Page 98: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/98.jpg)
Conclusions
Structured context features of social media documents
Effective complementary cues for social media
document similarity
Tags, Time-Proximity among highest weighted
features
Domain-appropriate similarity metrics
Weighted combination yields high quality clustering
results
Significantly outperform text-only techniques
Similarity learning models generalize to unseen data
sets
![Page 99: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/99.jpg)
Conclusions
Structured context features of social media documents
Effective complementary cues for social media
document similarity
Tags, Time-Proximity among highest weighted
features
Domain-appropriate similarity metrics
Weighted combination yields high quality clustering
results
Significantly outperform text-only techniques
Similarity learning models generalize to unseen data
sets
![Page 100: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/100.jpg)
Conclusions
Structured context features of social media documents
Effective complementary cues for social media
document similarity
Tags, Time-Proximity among highest weighted
features
Domain-appropriate similarity metrics
Weighted combination yields high quality clustering
results
Significantly outperform text-only techniques
Similarity learning models generalize to unseen data
sets
![Page 101: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/101.jpg)
Current and Future Work
Improving clustering accuracy with social media
“links” [SSM „10 poster]
Capturing event content across sites (YouTube,
Flickr, Twitter)
Designing event search strategies
![Page 102: Learning Similarity Metrics for Event Identification in Social Media](https://reader035.vdocuments.net/reader035/viewer/2022062617/54c6cc3c4a795933718b4591/html5/thumbnails/102.jpg)
Thank You!