visual analytics for interactive exploration of large-scale documents via nonnegative matrix...
TRANSCRIPT
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization
Jaegul Choo*, Barry L. Drake†, and Haesun Park**Georgia Institute of Technology
†Georgia Tech Research Institute
Big Data Innovators Gathering (BIG) 2014
What is Visual Analytics?
2
Automated Interactive (human in the loop)
Clearly defined tasks Exploratory analysis
Fast computation Deeper understanding
>Millions of data items Thousands of data items
Data Mining Visualization
Automated Interactive (human in the loop)
Clearly defined tasks Exploratory analysis
Fast computation Deeper understanding
>Millions of data items Thousands of data items
What is Visual Analytics?Leveraging Both Worlds
3
Data Mining Visualization
Visual Analytics
+
Visual Analytics forLarge-Scale Documents
4
Topic merging
Topic splitting
Doc-induced topic creation
Keyword-induced topic creation
UTOPIAN: User-driven Topic Modeling based on
Interactive NMF
VisIRR: Information Retrieval and Personalized
Recommender System
Motivation: Too Many Documents to Read
5
Product reviewsWhich tablet to buy?
iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews)
Research papersWhich sub-area in data mining to focus on?
>Thousands of new papers every year
Patent search
Many other applications
Topic Modeling: Summarizing Documents
6 genedna lifeevolve organismbrain neuronnerve
Document 1
Document 2
Document 3
Document 4
6
…
…
Topic Modeling: Summarizing Documents
Topic: distribution over keywords
7 genedna lifeevolve organismbrain neuronnerve
Document 1
Document 2
Document 3
Document 4
Topic 1 Topic 2 Topic 3
7
…
…
Topic Modeling: Summarizing Documents
Topic: distribution over keywords
Document: distribution over topics
8 genedna lifeevolve organismbrain neuronnerve
Document 1
Document 2
Document 3
Document 4
Topic 1 Topic 2 Topic 3
8
…
…
Nonnegative Matrix Factorization (NMF)
Low-rank approximation via matrix factorization
Why nonnegativity constraints?Better interpretation (vs. better approximation, e.g., SVD)
9
~=
min || A – WH ||F
W>=0, H>=0
A
H
W
~=
A
H
W
H
W
Topic: distribution over keywords
Document: distribution over topics
10 genedna lifeevolvebrain neuronnerve
Document 1
Document 2
Document 3
Document 4
Topic 1 Topic 2 Topic 3
organism
NMF as Topic Modeling
…
…
Documents’ topical membership changes among 10 runs
Why NMF (instead of LDA)?Consistency from Multiple Runs
11
InfoVis/VAST paper data set 20 newsgroup data set
Why NMF (instead of LDA)?Empirical Convergence
Documents’ topical membership changes between iterations
12 LDANMF
10 minutes48 seconds
InfoVis/VAST paper data set
NMF vs. LDATopic Summary (Top Keywords)
13
NMFRun Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
#1visualization
designinformation
useranalysissystem
graphlayout
visualanalytics
datasets
colorweaving
#2visualization
designinformation
useranalysissystem
graphlayout
visualanalytics
datasets
colorweaving
LDARun Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
#1 documentsimilarities
knowledgeedge
querycollaborative
socialtree
measuresmultivariate
treeanimation
dimensiontreemap
#2 documentquery
analystsscatterplot
spatialcollaborative
textdocument
multidimensionalhigh
treeaggregation
dimensiontreemap
InfoVis/VAST paper data set
Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.
UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF
[Choo et al., TVCG’13]
14
Topic merging
Topic splitting
Doc-induced topic creation
Keyword-induced topic creation
Visualization Example:Car Reviews
Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.
Weakly Supervised NMF: Supporting User Interactions
Weakly supervised NMF [Choo et al., DMKD, accepted with rev.]
min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2
W>=0, H>=0
Wr, Hr : reference matrices for W and H (user-input)
MW, MH : diagonal matrices for weighting/masking columns and rows of W and H
Algorithm: block-coordinate descent framework
16
Interaction Demo Video
17
After topic splitting (triangle) and topic merging
(circle)
Before interaction
InfoVis-VAST Paper Data
http://tinyurl.com/UTOPIAN2013
VisIRR: Information Retrieval and Personalized Recommender System
18
FeaturesEfficient Large-scale Data Processing
19
Document corpus: ~400,000 academic papers in CS
Data managementStructured data: author, year, venue, keywords, citation/reference countUnstructured data: bag-of-words vectors of title, abstract, keywordsGraph data: content, citation, and co-authorship
Efficient data handlingDynamic loading from disk to memory via Cache-like strategyScalable data expansion in O(n)
FeaturesPersonalized Recommendation
20
Works based on user preference on documentPreference scale of 1 (highly dislike) to 5 (highly like) Various recommendation schemes
Based on content, citation network, and co-authorship
AlgorithmPreference propagation on graph using heat kernel
rα = α ∑k (1- α)kfWk
rα is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph
VisIRR DemoCitation-based Recommendation
21
Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’
Most of the recommended items are highly cited. Computational zoom-in shows sub-areas relevant to the article.
http://tinyurl.com/VisIRR
VisIRR DemoCo-authorship-based Recommendation
22
http://tinyurl.com/VisIRR
Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’
It shows other areas of the authors of this paper.
Computational zoom-in on recommended items
Retrieved + recommended items
23
Kiva
1. Request loan
Field PartnerLender Borrower
2. Disburse loan
3. Post loan4. Fund loan
$$$$5. Send money
6. Pay money
6. Pay money
$$$$
6. Pay money $$$$
$$$$
$$$$
$$$$
Interested in learning Micro-Financing Analysis in Kiva.org?
Check out my presentation at Room 104, Wed 4pm
24
Thank you! Jaegul Choo [email protected] (Currently on the Academic Job Market)
Selected PapersChoo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013
Topic merging
Topic splitting
Doc-induced topic creation
Keyword-induced topic creation
UTOPIAN: User-driven Topic Modeling based on
Interactive NMF
VisIRR: Information Retrieval and Personalized
Recommender System
Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm
• Refining topic keywords• Merging topics• Splitting a topic• Creating new topics from seed
documents/keywords
UTOPIANInteractions and Key Techniques
Visualization• Supervised t-SNE
Topic modeling• NMF
Interaction
Weakly-supervised
NMF
Per-iteration Visualization Framework
Original t-SNE• Documents do not have
clear topic clusters.
Supervised t-SNE: Visualizing documents
Supervised t-SNE• d(xi, xj) ← α•d(xi, xj) if xi and xj
belong to the same topic. (e.g., α = 0.3)
Per-iteration routine
...Computational method
Visualization
Interaction
Input data
PIVE: (Per-iteration Visualization Environment)
Standard approach
Input data
Interaction
Per-iteration routine Visualization
Thread 1 Thread 2
...
PIVE approach
Integration methodology of Iterative Methods for Real-Time Interactive Visualization [Choo et al., VAST’14, to submit]
27