visual analytics for interactive exploration of large-scale documents via nonnegative matrix...

27
Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo *, Barry L. Drake , and Haesun Park* *Georgia Institute of Technology Georgia Tech Research Institute Big Data Innovators Gathering (BIG) 2014

Upload: harold-warner

Post on 24-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization

Jaegul Choo*, Barry L. Drake†, and Haesun Park**Georgia Institute of Technology

†Georgia Tech Research Institute

Big Data Innovators Gathering (BIG) 2014

Page 2: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

What is Visual Analytics?

2

Automated Interactive (human in the loop)

Clearly defined tasks Exploratory analysis

Fast computation Deeper understanding

>Millions of data items Thousands of data items

Data Mining Visualization

Page 3: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Automated Interactive (human in the loop)

Clearly defined tasks Exploratory analysis

Fast computation Deeper understanding

>Millions of data items Thousands of data items

What is Visual Analytics?Leveraging Both Worlds

3

Data Mining Visualization

Visual Analytics

+

Page 4: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Visual Analytics forLarge-Scale Documents

4

Topic merging

Topic splitting

Doc-induced topic creation

Keyword-induced topic creation

UTOPIAN: User-driven Topic Modeling based on

Interactive NMF

VisIRR: Information Retrieval and Personalized

Recommender System

Page 5: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Motivation: Too Many Documents to Read

5

Product reviewsWhich tablet to buy?

iPad (2,000 reviews) vs. Galaxy Tab (1,300 reviews)

Research papersWhich sub-area in data mining to focus on?

>Thousands of new papers every year

Patent search

Many other applications

Page 6: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Topic Modeling: Summarizing Documents

6 genedna lifeevolve organismbrain neuronnerve

Document 1

Document 2

Document 3

Document 4

6

Page 7: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Topic Modeling: Summarizing Documents

Topic: distribution over keywords

7 genedna lifeevolve organismbrain neuronnerve

Document 1

Document 2

Document 3

Document 4

Topic 1 Topic 2 Topic 3

7

Page 8: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Topic Modeling: Summarizing Documents

Topic: distribution over keywords

Document: distribution over topics

8 genedna lifeevolve organismbrain neuronnerve

Document 1

Document 2

Document 3

Document 4

Topic 1 Topic 2 Topic 3

8

Page 9: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Nonnegative Matrix Factorization (NMF)

Low-rank approximation via matrix factorization

Why nonnegativity constraints?Better interpretation (vs. better approximation, e.g., SVD)

9

~=

min || A – WH ||F

W>=0, H>=0

A

H

W

Page 10: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

~=

A

H

W

H

W

Topic: distribution over keywords

Document: distribution over topics

10 genedna lifeevolvebrain neuronnerve

Document 1

Document 2

Document 3

Document 4

Topic 1 Topic 2 Topic 3

organism

NMF as Topic Modeling

Page 11: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Documents’ topical membership changes among 10 runs

Why NMF (instead of LDA)?Consistency from Multiple Runs

11

InfoVis/VAST paper data set 20 newsgroup data set

Page 12: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Why NMF (instead of LDA)?Empirical Convergence

Documents’ topical membership changes between iterations

12 LDANMF

10 minutes48 seconds

InfoVis/VAST paper data set

Page 13: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

NMF vs. LDATopic Summary (Top Keywords)

13

NMFRun Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7

#1visualization

designinformation

useranalysissystem

graphlayout

visualanalytics

datasets

colorweaving

#2visualization

designinformation

useranalysissystem

graphlayout

visualanalytics

datasets

colorweaving

LDARun Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7

#1 documentsimilarities

knowledgeedge

querycollaborative

socialtree

measuresmultivariate

treeanimation

dimensiontreemap

#2 documentquery

analystsscatterplot

spatialcollaborative

textdocument

multidimensionalhigh

treeaggregation

dimensiontreemap

InfoVis/VAST paper data set

Topics are more consistent in NMF than in LDA. Topic quality is comparable between NMF and LDA.

Page 14: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

UTOPIAN: User-Driven Topic Modeling Based on Interactive NMF

[Choo et al., TVCG’13]

14

Topic merging

Topic splitting

Doc-induced topic creation

Keyword-induced topic creation

Page 15: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Visualization Example:Car Reviews

Topic summaries are NOT perfect. UTOPIAN allows user interactions for improving them.

Page 16: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Weakly Supervised NMF: Supporting User Interactions

Weakly supervised NMF [Choo et al., DMKD, accepted with rev.]

min ||A – WH ||F2 + α||(W – Wr)MW ||F2 + β||MH(H – DHHr) ||F2

W>=0, H>=0

Wr, Hr : reference matrices for W and H (user-input)

MW, MH : diagonal matrices for weighting/masking columns and rows of W and H

Algorithm: block-coordinate descent framework

16

Page 17: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Interaction Demo Video

17

After topic splitting (triangle) and topic merging

(circle)

Before interaction

InfoVis-VAST Paper Data

http://tinyurl.com/UTOPIAN2013

Page 18: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

VisIRR: Information Retrieval and Personalized Recommender System

18

Page 19: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

FeaturesEfficient Large-scale Data Processing

19

Document corpus: ~400,000 academic papers in CS

Data managementStructured data: author, year, venue, keywords, citation/reference countUnstructured data: bag-of-words vectors of title, abstract, keywordsGraph data: content, citation, and co-authorship

Efficient data handlingDynamic loading from disk to memory via Cache-like strategyScalable data expansion in O(n)

Page 20: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

FeaturesPersonalized Recommendation

20

Works based on user preference on documentPreference scale of 1 (highly dislike) to 5 (highly like) Various recommendation schemes

Based on content, citation network, and co-authorship

AlgorithmPreference propagation on graph using heat kernel

rα = α ∑k (1- α)kfWk

rα is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph

Page 21: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

VisIRR DemoCitation-based Recommendation

21

Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’

Most of the recommended items are highly cited. Computational zoom-in shows sub-areas relevant to the article.

http://tinyurl.com/VisIRR

Page 22: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

VisIRR DemoCo-authorship-based Recommendation

22

http://tinyurl.com/VisIRR

Preference-assigned item as ‘highly like’ : ‘Automatic Classification System for the Diagnosis of Alzheimer Disease Using Component-Based SVM Aggregations’

It shows other areas of the authors of this paper.

Computational zoom-in on recommended items

Retrieved + recommended items

Page 23: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

23

Kiva

1. Request loan

Field PartnerLender Borrower

2. Disburse loan

3. Post loan4. Fund loan

$$$$5. Send money

6. Pay money

6. Pay money

$$$$

6. Pay money $$$$

$$$$

$$$$

$$$$

Interested in learning Micro-Financing Analysis in Kiva.org?

Check out my presentation at Room 104, Wed 4pm

Page 24: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

24

Thank you! Jaegul Choo [email protected] (Currently on the Academic Job Market)

Selected PapersChoo et al., Document Topic Modeling and Discovery in Visual Analytics via Nonnegative Matrix Factorization, TVCG, 2013Choo et al., VisIRR: Interactive Visual Information Retrieval and Recommendation for Large-scale Document Data, Tech Report, Georgia Tech, 2013

Topic merging

Topic splitting

Doc-induced topic creation

Keyword-induced topic creation

UTOPIAN: User-driven Topic Modeling based on

Interactive NMF

VisIRR: Information Retrieval and Personalized

Recommender System

Micro-Financing Analysis in Kiva.org, : Room 104, Wed 4pm

Page 25: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

• Refining topic keywords• Merging topics• Splitting a topic• Creating new topics from seed

documents/keywords

UTOPIANInteractions and Key Techniques

Visualization• Supervised t-SNE

Topic modeling• NMF

Interaction

Weakly-supervised

NMF

Per-iteration Visualization Framework

Page 26: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Original t-SNE• Documents do not have

clear topic clusters.

Supervised t-SNE: Visualizing documents

Supervised t-SNE• d(xi, xj) ← α•d(xi, xj) if xi and xj

belong to the same topic. (e.g., α = 0.3)

Page 27: Visual Analytics for Interactive Exploration of Large-Scale Documents via Nonnegative Matrix Factorization Jaegul Choo*, Barry L. Drake †, and Haesun Park*

Per-iteration routine

...Computational method

Visualization

Interaction

Input data

PIVE: (Per-iteration Visualization Environment)

Standard approach

Input data

Interaction

Per-iteration routine Visualization

Thread 1 Thread 2

...

PIVE approach

Integration methodology of Iterative Methods for Real-Time Interactive Visualization [Choo et al., VAST’14, to submit]

27