fodava-lead : visual analytics for large-scale high...

FODAVA-Lead : Visual Analytics for Large -scale High Dimensional Data:from Algorithms to Software Systems

Presented by Haesun Park and Alex Gray School of Computational Science and Engineering

Georgia Institute of TechnologyAtlanta, GA, U.S.A.

FODAVA Annual Meeting, Dec. 2012(PIs: H. Park, A. Gray, J. Stasko, V. Koltchinskii, R. Monteiro)

Contributors• Jaegul Choo (Georgia Tech)• Changhyun Lee (Georgia Tech)• Hanseung Lee (Univ. of Maryland)• Zhicheng Liu (Stanford University)• Fuxin Li (Georgia Tech)• Yunlong He (Georgia Tech)• Jaeyeon Kihm (Cornell University)• Jingu Kim (Nokia)• Da Kuang (Georgia Tech)• Sen Yang (Arizona State University)• Ed Clarkson (Georgia Tech Research Institute)• Polo Chau (Georgia Tech)• Alexander Gray ( and many of his students, Georgia Tech)• Vladimir Koltchinskii (Georgia Tech)• Renato Monteiro (Georgia Tech)• John Stasko (Georgia Tech)• Jieping Ye (Arizona State University)

Challenges in Computational Methods for High Dimensional Large-scale Data

on Visual Analytics System • Data challenges

– Massive, High-dimensional, Nonlinear– Vast majority of data is unstructured– Noisy, errors and missing values are inevitable in real data set– Heterogeneous format/sources/reliability – Time varying, dynamic, …

• Visualization challenges– Screen Space and Visual Perception

• High dimensional data: Effective dimension reduction • Large data sets: Informative representation of data

– Speed: necessary for real-time, interactive use• Scalable algorithms• Adaptive algorithms

• Dimension Reduction • Dimension reduction with prior info/interpretability constraints• Manifold learning

• Informative Presentation of Large Scale Data• Sparse recovery by L1 penalty• Clustering, semi-supervised clustering• Multi-resolution data approximation

• Fast Algorithms • Large-scale optimization/matrix decompositions• Adaptive updating algorithms for dynamic and time-varying data,

and interactive vis.

• Information Fusion • Fusion of different types of data from various sources, vis. comparisons

• Integration with DAVA systems • Testbed , Jigsaw , iVisClassifier, iVisClustering, VisIRR ..

Key Foundational Components forVA System Development

FODAVA Research Test Bed for Visual Analytics of High Dimensional Data

• Library of key computational methods for visual analytics of high dimensional large scale data

• With visual representations and interactions • Easily accessible for DAVA researchers

and readily available for applications• Identifies effective methods for specific problems (evaluation)• Modular: A base for specialized VA systems

(e.g. iVisClassifier, iVisClustering, VisIRR)

FODAVAFundamentalResearch

ApplicationsApplications

Test Bed

FODAVA Research Testbed Software: Available at http://fodava.gatech.edu/fodava-

testbed-software • Supports various dimension reduction, clustering, and their visual representations

and comparisons through alignments for high-dimensional data• Application domains: document analysis, bioinformatics, seismic data analysis,

healthcare, communications, computer vision, …• Language used: backend library in Matlab, GUI in JAVA (no need for Matlab installed)• System support: Windows 32/64 bit, Linux 32/64 bit

Testbed Modules• Computational modules

• Vector encoding• Pre-processing• Clustering• Dimension reduction

• Interactive visualization modules• Parallel coordinates• Scatter plot• Cluster summary• Brushing and Linking• Space alignment• Raw data view

Dimension Reduction

• Visualizes high-dimensional data by parallel coordinates and/or scatter plot

• Methods• Linear methods

• PCA, FA, ProbPCA, LDA, OCM, NPE, LPP, LLTSA, NCA, MCML

• Nonlinear methods• MDS, Isomap, LLE, LTSA, Sammon, HessLLE, MVU,

LandMVU, KernPCA, GDA, DiffMaps, SPE, AutoEnc, LLC, ManiChart, CFA, GPLVM, SNE, T-SNE

• Provides initial parameters that can be changed interactively• Can recursively apply dimension reduction

on user-selected data• Fast algorithms implemented

Clustering and Classification

• Generates cluster/class labels of data,which are color-coded in visualization.

• Methods• Clustering

• Hierarchical clustering, K-means, spherical K-means, GMM, NMF, constrained K-means, DisCluster/ DisKmeans [J. Ye]

• Classification (on-going work)• K-nearest neighbors classifier, SVM, Logistic

regression, Naïve Bayes• Provides cluster summary• Provides GUI for semi-supervision, e.g., must/cannot link• Can hierarchically construct cluster structures

Computational Zoom -in

Normal Zoom-in

Comp. Zoom-in

Computational zoom-in by recursive dimension reduction on selected data

Controls the level of focus on locality

Interactive Parameter Change e.g. in Isomap k value in k-NN Graph

Fast Comp. Modules for Interactive Vis.

• Essential for real-time interaction

• Let computational precision be governed by visual precision/resolution

• Hierarchical refinement• Adaptive algorithms

p-Isomap computing time vs.k value in k-NN graph

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

−0.2

−0.1

0

0.1

0.2

0.3

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

−0.2

−0.1

0

0.1

0.2

0.3

10 20 30 40 50 6010

12

14

16

18

20

22

24

26

Initial K

Co

mp

utin

g t

ime

s in

se

co

nd

s

ISOMAP (k=initial K)p−ISOMAP (k:initial K−>initial K−2)p−ISOMAP (k:initial K−>initial K+2)

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

50

100

150

200

250

300

350

400

450

500

Data Size

Tim

e in

Sec

onds

Double precisionSingle precision

48x36 vs 80x60

PCA timing: double vs single precision computation time and results

Key Computational Methods

• NMF (Nonnegative Matrix Factorization) and its variations:for dimension reduction and clustering

• LDA/GSVD (Linear Discriminant Analysis) and its variations:for informative 2D representation of clustered and large scale data

• Orthogonal Procrustes and MDS (Multi-Dimensional Scaling): for space alignment and comparisons of visual representations

Nonnegative Matrix Factorization (NMF)(Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, Kim and Park 11 SISC…)

A W

H~=

�min || A – WH ||FW>=0, H>=0

• Why Nonnegativity Constraints?•Better Approx. vs. Better Representation/Interpretation•Nonnegative Constraints often physically meaningful• Interpretation of analysis results possible

• One of the Fastest Algorithms for NMF & theoretical convergence analysis

• Matlab codes publicly available (J. Kim and H. Park, IDCM08, SISC11)

http://www.cc.gatech.edu/~hpark/nmfsoftware.php• NMF is better and faster than K-means in clustering

• K-means: W: k cluster centroids, hi: cluster membership indicator• NMF: W: basis vectors for rank-k approx., hi: k-dim rep. of ai• SymNMF (Kuang, Ding, Park, SDM12), Sparse NMF for clustering (Kim and Park, Bioinfo., 07)

• NMF more accurate and faster on document and image data • (Xu et al. 03; Pauca et al. 04; Li et al. 07; Kim & Park, 08; Ding et al. 10 ...)

– Clustering accuracy averaged over 20 runs:

– Problem sizes:

TDT2 Reuters COIL20 NIPS ORL PIE Overall

K-means 0.6734 0.4289 0.6184 0.4650 0.6499 0.7384 0.5957

NMF/ANLS 0.8534 0.3770 0.6312 0.4877 0.7020 0.7912 0.6404

SymNMF 0.8979 0.5305 0.7258 0.5129 0.7798 0.7517 0.6998

NMF for Clustering

K K-means SphKmeans NMF/ANLS GNMF Spectral SymNMF

TDT2

4 .7994 .7978 .9440 .9150 .9093 .9668

8 .6147 .6208 .8292 .8200 .7357 .8819

16 .5286 .5305 .6709 .6812 .5959 .7635

Reuters

4 .5755 .5738 .7737 .7798 .7171 .8077

8 .5170 .5049 .6747 .6758 .6452 .7343

16 .3712 .3687 .4608 .5338 .5001 6688

TDT2 Reuters COIL20 NIPS ORL PIE 20 Newsgroups

m 26618 12998 4096 17583 5796 4096 36982

n 8741 8095 1440 420 400 232 18669

k 20 20 20 9 40 68 20

On average, NMF is faster than k-means by a factor of at least 2

Linear Discriminant Analysis for2D/3D Representation of Clustered Data

(J. Choo, S. Bohn, HP, VAST09)

Max trace (GT Sb G) min trace(GT Sw G)&

max trace (GT (Sw + µI)G)-1 (GT Sb G)

• Regularization in LDA for Computational Zooming-in

Small regularization Large regularization

LDA/GSVDα 2Hb Hb

Tx = β 2Hw HwTx

−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

aa

a

aa

a

aaa

a

aa

a

aaa

aaa

aaaaa

aaa

aaaa

aaaaa

a

aaa

a

a

aaa

aa

a

aa

bbb

bb

b

bbb

b

bbb

bbb

bbb

bbbb

b

bbb

bbb

b

bbbbbb

bbb

b

b

bbb

bbb

bb

cc

c

ccc

c cc

c

cc

c

ccc

c cc

cc

ccc

c cc

ccc

c

ccc

ccc

ccc

c

c

ccc

c

c

c

cc

d d

d

dd d

dd

d

d

ddd

dd d

ddd

dd dd d

ddd

dddd

ddd d

dd

ddd

d

d

ddd

ddd

dd

e

ee

e e

e ee

e

e

eee

ee

e

e ee

ee

e ee

ee

e

ee

ee

ee

e

eee

eee

ee

ee

e

ee

e

eefff

fff

fff

f

fff

fff

fff

ff

fff

fff

fff

f

fff fff

ff

f

f

f

fff

f f

f

ff

g

g

g

gg

gggg

g

gg g

ggg

gg g

gg

g gg

ggg

g ggg

ggg

ggg

ggg

gg

ggg

gg

g

gg

hhh

hh

hh h h

h

hhh

hh h

hhh

hhh

hh

hhh

hhh

h

hh

hhh

h

hhh

h

h

hhh

hhh

hh

ii i

iii

i ii

i

ii i

i i i

iii

ii ii

iii

iiii

i ii i ii

iii

i

i

iii

iii

iii

jjj

jj j

jjj

j

j jj

jjj

jjj

jj jjj

jj j

jjjj

jjj

jjj

jjj

j

j

jjj

jjj

jj

kk

k

kk k kkk

k

kk

k

kk k

kk k

kkk

k

k

kk

k

k

kk

k

kk

k

kkk

kk

k

kk

k

kk

k

kk

k

kll

l

ll

l

ll

l

l

lll

ll

l

ll

l

ll

lll

l l

l

l l

l

l

ll

l

ll

l

ll

l

l

l

l l

l

ll

l

ll

mmm

mm m

mm m

m

mmm

mm m

mmm

mm m

mm

mm

m

mmm

m

mmm m

mm

mmm

m

m

mm

m

mmm

m m

nnn

nnn

nn

n

n

nnn

nnn

nnn

nn

nnn

nnn

nnnn

nnn

nnn

n nn

nn

nnn

nnn

n noo o

ooo

ooo

o

oo o

ooo

ooo

ooooo

ooo

ooo

o

ooo oo o

ooo

oo

oo

o

oo

o

oo

pp

p

pp

p

p pp

p

pp p

ppp

ppp

ppppp

ppp

pppp

ppp

ppp

p pp

p

p

pp p

p

pp

p

p

q

qq

qqq

qqq

q

q

qq

qqq

qqq

qq

q qq

qqq

qqqq

qqq q

qq

qqq

qq

qqqq

qq

qq

r

r

r

rr

r

rr r

r

rr

r

rr r

rr r

rrrr

r

rr r

rr

rr

rr r

rr r

rrr

rr

rr r

rr r

r

r

ss

s

ss

s sss

s

sss

ss s

sss

ss sss

sss

ssss

sss sss

sss

s

s

sss

ss

s

ss

t

tt

ttt tt

t

t

t

t tt

ttt

t tt

ttt

tt

ttt

t tt

t

t

ttt t t

t

t tt

t

t

t tt

ttt

u

uu

uuu

u

uu

u

uu

u

uuu

uuu

u

uuu

u

uuu

uuuu

uuu

uuu

uuu

uu

uuu

uuu

uu

vvv

vvv

vvv

v

vvv

vvv

vvv

vv

vvv

vvv

vv

v

v

vvv

vv v

vvv

vv

vvv

vvv

v

v

www

www

www

w

ww

w

ww

w

ww

w

wwww

w

www

ww

ww

ww

w www

www

w

w

www

www

ww

xx

x

x

xx

x

xx

x

xxx

xxx

xx

x

xx xx

x

xxx

xxx

x

xxx xx

x

xxx

x

x

xxx

xxx

x x

yy

y

yy y

y y y

y

yy y

y yy

yy y

yy

yyy

yyy

yy

y

y

yyy yyy

yyy

yy

yyy

yyy

yy

z z

z

zzz

zzz

z

zz

zz

zzz

zz

zzz

zzz

zzz

z

zzzzzz

z zz

zz

zzz

zzz

z z

z

z

1

1

1

11 1

11 1

1

11

1

111

11

1

11

111

111

111

1

111 1 1

1

111

1

1

1 11

1 11

11

a

bc

d

efgh

ij kl

m

n o

pqr

stuv

w

xy

z1

−0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04

−0.04

−0.02

0

0.02

0.04

0.06

aa

a

aa

aaaa

aaa

a

aa

aaa

a

aaaa

a

aa

a

aa aa

aa a

aaaaa a

aa

aaa

aaa

aa

bbbb bbbbb

b

bbb

bbb

bbb bb

bbb

bbbbbbb

bbbbbb

bbbbbb bb

bbbbb

ccc

ccc

ccccc

cc

cc

cc

cc

cc ccc

c

c

c

c

cc

ccc c

c

cc

ccc

c ccc

ccc

ccc

ddd

dd d

ddddd

ddd

d d

dddd ddd

d

d

dd

dd d

ddd

d

dd

d

dd

ddddd

d

ddd

d

d

ee

ee e

e

ee

e

e ee

e

e e

eee

e

eeeee

ee

e

ee

e

e

ee

eee

e

e

e

eee

eee ee

e

ee

fff

ff

f

ff

f

f

ff

fff

ff

ff

ff

ff

ff

ff

f

f

f

f

ff

fff f

fff

ff

ff

ffff

f fg

gg gg

gggg

g

gg

g

ggggg

ggg

ggg ggggg

g

ggggggg

gg

g

gg ggggg

gg

g

hhh

h hh

h hh

hh

hh h

h

hhhh

hhhhhh

h

hh

hhh h

hh

h

h

h

hh

hh

hh

hhhhh

hh

i i i

i ii

ii iiiiii i i

iiiiiii

i i iiiii i

iii

ii

iii

iiii i

i iii i

i

jjj

jjj

jjj

jjjjj j

jj jj

jj

jj jjj j

jjj j

jjj

jj j

jjj

jj

j jjj j

j

jj

kk kkk

k

k kkkk

k k kkk

k kk

kkkk

kkk

k

k k

k

kk kk

kk

k

k k

k

k kkkk

k

kk

kk

l l

l

lll

ll llll

lll

lll

l

lll l

l lll

lll

l

lll

ll lll ll l ll l

ll l

ll

m mm

m mm

mmm

m mmmm mm

m mm

mmm

m mmm

m

mmm

mmm m

mmmm

mmm

mmm

mmm m

mm

nnn

nnn

n

nn

nnnn

n nn

nnn

nn

nnn

nnn

nnn

n

n nn

n nn

n nnn

n

n nnn nn

n n

oo

oooooo ooo

oo

oo o

oo oooooo

ooo ooo

ooo

ooooooo

oooooo o

o

oo

p

p p

p

p p

pp

p

pp

ppp

p pp

pppp

p

p ppp p

pp p

p p

p pp

p p

pp

ppp

p

pp

pp pp p

qq q

qqq

qqqqq

qq

qqq

qqq

qq qq

qqqq

qqq

qqqqqq

q

qqq

qq

qq

q

qq

qqqr

r

r

rr rrrrr

rrr rr rrr

r rr r

rr

r

rrr

r rr r

rr

r rr

rrr

rrr rrr

rrr

r

sss

ss

s

sss s

ss

sss s

sss s

s

ss s

sss s

ssssss

sss s

sss s

sss

sss ss t

t

t

tt t

t

t tt

t

tt t

t tt

ttt

tt

ttt

ttt

t

tt

t

t

t

tt

t

tt

t

t t

t t

tt

t

t tt

u u

u

uu

u

u

uu

uuu

u

uu uu

u uuuu u

uuuu

uu uu

uuuuuuu

uuuuu uuu

u uu

u

vvv

vvv

vvvv

vvv vvv

vv

vv

v

vvvv

vvvv v

vvvv

v

vv

vvvv

vvvv

vvv

vv

w

ww

wwwww

ww

www

www

ww

www

www

www

ww

w wwww w w

ww ww

wwww

w

ww www

xxx

x

xxx

xx

xxxx

x

xx

x

xx

xxx x

xxxx

xx

x

xxxxx x

x

xx

x

xxxx x

xxx

xx yy

y

yyyyy

y

yyy

y

yyy

y y

yyy yy

yyyy yyy

y yyy yyyyy

y

y

y

yy

y

y

y

yyyzz

z

zzzzzz

z zzz

zzzz

zzzzz

zzzz

zzz

zz

z zz z

zzz

zz

zz

zzzz

zz z

z

11 111

111 1

1 11 1

111

11

1

111

1 11

1

111

1

1

11

1

11 1

111

11 11 111 1

1

1

a

bc

d

e

f

g

h

i

j

k

l

m

n

o

p

q

r

st

u

v

w

xy

z1

2D Visualization of Clustered Text, Image, Audio Data

20news Data (Text)

LDA+PCA

PCA

Rank-2 LDA

PCA

−0.04 −0.02 0 0.02 0.04 0.06−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

aa

a

a

a

a

a

a

aa

a

a

aa

a

a

aa

aa

a

aa

a

a

aaa

aa

a aaa

a

a

a

a

aa

a

a

aa

a

a

a

a

a

a

a

a aa

aa

aaaa

b

b

b

bb

b

b

b

b

b

b

b

b b

b

b

b

b

b

bb

b

b

b

b

b

b

b

bb

bbb

bb

b

b

b

b

bb

b

b

b

b

b

bbb

b

b

bb

b

b

bb

b

b

b

c

c

c

c

c

c

cc

c

cc

c

cc

c

c

ccc

cc

cc cccc c

cc

c

c

c

ccc

c

cc

c

c

c

c

cc

c

c

cc

c

cc

cccccc

c

c

dd

dd

dd

dd dd

d

d

d

d

ddd d

dd

d

d d

d

d d

d

d

dd

d

d

dd

d

d

d

dd dd

dd

dd

d

d

d

d

d

ddd

d

dd

d

d

d

d

ee

e

e

e

ee

ee

e

e

e

e

e

e

e

ee

e

ee

e

ee

ee e

e

e

e

ee

ee

e

ee

e

e

eeee

ee

e

ee

ee

ee

ee

ee

e

ee

e

f

f

f

f

f

f

f ff

f

fff

fff

f f

f

ff f

f

f

f

f

f

f

f

ff

ff

f

ff

ff

f

fff

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

f

g

g

gg

g

g

g

g

g

g

gg

g

g

g

g

g

g

g

g

g

g

g

g

g

g

g

gg

g

g

gg

g

g

g

g

g

g

gg

g

gg

g

gg

g

g

g

g

g

gg

g

g

g

gg

g

h

h

h

hh h

h

h

h

h hh

h

h

hh

h hh

h

h hh

h

h

h

h

h

h

h

h

h

h

h

hh

h

h

h

h

h

h

h

hh

hh

h

h

h

h

h

h

h

hh

h

h

h

h

i

i

i

i

i

i

i

i

i

i

ii

i

i

ii

ii

i

i

ii

i

i

i

ii i

i

i

i

i

i ii

i

ii

i

ii i

i

i

ii

i

i

ii

i

i i

i

i

ii i

i

i

j

j

j

jj

j

j

j

j

jj

j

j

jj

jjjj

j

j

j j

jj

j

j

j

j

j

j

jj

j

jjj

j

j

jj

jjj

jj

jj j

j

j

jj

jjj

j

j

jj

k

k

k

k

k

kk

k

k

k

k

k

kk

k k

k

kkk

k k

k

k

k

k

k

k

kk

k

k

k

kk

kk

k

kk

k

k k

k

k

k

k

k

k

k

k

k

k

k kk

k

kk

k

l

l

l

ll

ll

l

l

l

l

l

ll

l

l

l

l

l

l

l

ll l

l

l

ll ll

l

l

l

l

l

l

ll

l

l

l

ll

l l

ll

l

l

l

l

l

ll

ll

l

lll

m

m

m

m

m

m

m

m

m

m

mm

m

m

mm

mm

m m

m

m

m

m

mm

mm

m

m

m

m

m m

m

mm

mm

m

m

m

m

m

m

m

m

m

m

m

mm

m

m

m

m

m

m

mnn

nn

n

n

n

n

nnnn

n

n

n

nn

n

n

n

nnn

n

n

nnn

n

n

n

n

nn

n

n

n n

n

n

nn

n

n

n

n

nnn

n n

n

n

nnn

n

nn

n

o oo

o

o

o

oo

o

o

oo

o

oo

o

oo

o

oo

oo

o

o

o oo

oo

o

oo

oo

oo

oo

o

o

ooo

o

o

o

o

oo

oo

o

o

o

o

ooo

o

pp

p

ppp

p

p

p

p

p

p

p

p

p ppp

p

p

pp

pp

pp

pp

p pp p

pp

p

pp

p

p

p

p

p

p

p

pp

p

p

p

pp

p

p

p

pp

pp

p

p qq

q

q

q

q

q

q

q

q

q

q

q

qq

qq

q

qq

q

q q

q

q

q

qq

q

q

q

q

qqq

q

q q

q

q

q q

qq

q

q

q q

q

q

q

qq

q

q

qq

q

q

q

r

r

r

rrr

r

r

rr

r

r

r

rr

rr

r

rrrr

r

rr

r

r

r

r

rr

r

r

rr

r

rr rrr r

r

r

r

r

r

r

r

r

r

rr

r

rr r

rr

rs

s

ss

s sss

s

s

ss

s

s

s

ss

s

s

s ss

s

ss

ss

s

ss

ss

s

s

s

ss

ss

ss

s

ss

s

ss

s

s

s

s

s s

s

ss

s

s s

s

t

t

tt

t

t

t

tt

t

t

t

ttt

t

t

t

t

t

t

t

t

t

t tt

t

tt

t

t

t

t

t

t

t

t

t

tttt t

t

t

t

tt t

t

t

t t

tt

t

t

t

t

u

u

uu

u

u

u

u

uuu

u

u

u u

uu

u

u

uu u

u

u

u u

u

uu

u

u

uu

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u u

u

uu

u

uu

u

v

v

vv

v

v

v

vv

v

v

v

v

v

v

v

v

v

v

vvv

v

v

v

v

v v

v

v v

v

v

v

vv

v

v

vv

v

v

v

v v

v

v v

v v

vv

vvv

v

v

vv

vw

w

w

w

ww

w

w

w

w

w

w

w

w

w

w

w

ww

w

w

w

w

w

w

w

w

ww

w w

w

w

w

w w

w

ww

w

w

w

ww

w

w

w

w

w

w

w

w

w

w

w

ww

ww

w

x

x

x

xx x

xxx

x

xxx

x

x

xx

xx

x

xx

xx

xx

x

x

x

xx

xx

x

xx

xx

x

x

x

x

x

x

x

x

xx

x x

x

xx

x

x

x

x

x

x

x yy

yy

y

y y

y

y

y

yy

y

yy

y

y

y

y

y

y

yy

y

y yy

y

y

yy

y

yy

y

y

y

y

y

y

y

yy

yyy

y

y

y

y

y y y

yy

y y

y

yy

z

z

z

z

zz

z

zzz

z

zz

z

z

z

z

z

z

z z

z

zz

z

z

z

z

zz

z

z

zz

zz

z

z z

z

z

z

zzz

z

z z

zz

z

z

z

z

z

z

zz

z

ab

c

d

e

f

g

h

ij

k

l m

n

op q

rs

t

uv

w

x y

z

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05

−0.04

−0.02

0

0.02

0.04

0.06

aa

aa a a

aa

a

aa

a

aaa

a

a

a

a a

aa

aaa

a

aaa aaa

a

a aa

aa

aa aa

aa

aa

aa

aa

aaaa

aa

aa

aa

bb

bbbbb

bbbb

b

bb

bb

b

b b

b

b

b b

bb

b

b

bb

bbbb

bb

bb

b

bbbbbb

bbb

b

bb

b

bb

b

b

bb bb

b

c

c

c

ccc

c

c

cc

ccc

c

cc

c

cc

c

c

cc

ccc

c

cc

ccc

c

c

cc c

c

ccc

cc

cc

c c

ccc c

cc

c

ccc

c

c c

dd

dd

ddd

dd ddd

dd

d ddd

d

d

dd

dd

d

dd

d

dd

dd

ddddd

d d dd

dd

dd

ddd

d

d

dd

d ddd

d

d

dd ee

eeee

e

ee

e

e

e

ee

e eee

eeee e

ee

e eee

eee

eeee eeee

ee

ee

ee e

e

e

ee e

ee

e

e

eee

e

ff

f

ff

ff f

ff

fff

f ff f

ff

f

f ff f

fffff

ff

f

f

f

f

ff

ff

f

ff

f f

f

f

ff

f

f

ff

f ff

f

ff

f

f

g

ggg

gggg

g

gg

g ggg

ggg

g

ggg

g

g

gg g

gg

g

gg gg gg

g

gg

g

g

g

ggg

g g g

g

g

gggg

gg

gg

gg

hh

h

h

h

h

h

h

hh

hh

hh

hh

hhhh

hhhh

h

hhh

h

hhh

h h hhh

h

h h

h

hhhh

hh

h

hhh h

hh

h

h h

h

h h

i

iii

ii

i

i

ii

i

ii iii

i

i

i ii i

iiii

ii

iiii i iiii

i

i iii

i ii i iiii

iii

i

ii i

ii

i

j j

jj

jj

jj j j

j

j j

j

jj

j

j

jj

jj jj j

jj

j

jjj

jj j

jjj

jj

jj jj

jj

j

jj

j

jjj

jj jj

jj

j

j

kkk

kk

kkk

kk

kkk kk

kkk k

kk

k

kk

k kkkkkkk kk kkkk

kkk

k

k kk k

kk k

k

k

kkk

kk k

kkk

lll

ll

l

ll

l

lll

ll

lllll

lll ll

ll

l

llll

lll

ll

lll lll

l

lll

ll

l

ll

l ll

lll l

l

l

mmm

m

mmmm

m

mm mm

mmm

m m

m

m

mm mm

mm

mm

m

mm

mmm

mm

mmm m

m mmm

m

mmmm

mmmmm

mm

m

m

m

nnn nn

n nn n

n

nn

nn

nnnn

nnn

nn n

nn

nn

n

nn

n

n

n

n

n

n nnnnn

nn

nn

nnn

n

n

nn nn

n

nnn

n

o

o o

oo

oo

o

oo

ooo o

o

ooo

ooo

oo

ooo

ooo

ooo

oo

oo

o

oo

oo

ooo

o o

oo

o

oo

oo o

o

oo

oo o

p

p

pp p

ppppp pp

p

pp

ppp

ppppp

p

p

p

p

pp

pp

p p ppp pppp

pp

pp

p

pp ppp

p

pppppp

pp

p

qqq

qq q

qq q

q q

qq q

qq

q

qq

q

qqq

qq qq

qq

q

qqq q

qqq qq

q

q

q

qqqq

qqq

q

qq q q

qqq

qqq

r

rrr

rr

rr

rr

rrrrr

r

r

rr rr rr r rr r

rrr

rr

rr rr

r rr

rr

rrrrr

rr r

r

rr

rr

rr

rr

r

r

s

ssss

sss

s

s

s

ss

ss

s

s

s

s

s

sss

sss

s

sss

s

s

ss

sss

s

ssss

ss

sss

ssssss

s

s s

s

s

ss

tt

tt

tt

t tt

t

tt

ttt

ttt t

tt

t tt

tt

tt

tt

t

t t

tt

t

tt tt

ttttt t

ttt tt

t

t

t

ttt

tt t

u uu

uu

uu

uuuu

u

uuu u

u

uu

uuuu

uu u uu

u

uu uuuu

u

uuuu

u

u

uu

u u uuu

u u uuuuuu

u

uu

v

vv

vv vv v

v

v

v

vv v

vv

v

v

vv

v

v

v vv vvv

vvv v

v

vv

vv

vv

v vv

vv

vvv v

vv

v

vv

v

v vv

v

v v

ww

www www

w w wwwww

ww

w ww

ww

ww

w

w

wwww

w

ww

w w

w

ww

w

ww

w

w

w

w

w

w

www www

w w

w

w

ww

w

xxx x

x

xxx

xxx

xxx

x

x

x

x

xxx

x

xxxxx

xx

x

xxx

x

x

xx

x

x

xx

xxxx

x

x

x

xxx

xx

xx

x

x

x

xx

yyy

y

yy

yy

yy

yy

y

y

yy

y

yy

y yy

yyyy

y

y

yy

yy

y

yy

yy

y

yy yy yyyy

y

yy

yyy

y y

y

yy

y

yy

zzz

zz zz

z zz

z

zz

zzz

z

zzz

z

zzz

zzz

zz

zz

zzz

z

z

z

zzz

zz

z

zz zzz

zz

zz

zzz z

z

z

z

a

b

c

d e

f

g

h

i

j k

lmn

o

p

q

r

s

t

u

v

w

x

y

z

Rank-2 LDA

PCA

Facial Data (Image) Spoken Letters (Audio)

Two-stage Dimension Reduction for 2D Vis. of Clustered Data

• LDA + LDA = Rank2 LDA • LDA + PCA• OCM + PCA • OCM + Rank-2 PCA on SF

b = Rank-2 PCA on Sb

Information Fusion and Visual Comparisons based on Space Alignment (J. Choo, S. Bohn, G. Nakamura, A. White, HP)

• Want: Unified visual representations of different results• Assume: Reference correspondence information between data

pairs or cluster correspondence• Two conflicting criteria: maximize alignment and minimize

deformation

Data sets Similarity graph

Fused data

• Procrustes analysismin || (A-µA1T)-kQ(B-µB1T)||FQTQ=I

• Graph embedding approach (MDS)

Fusion and Alignment in Testbed

AlignedReference Un-Aligned

Data set 1

−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

e

e e

ee

e

ee

eee

e ee e

e

e

eee

e

ee

ee

e

ee

e

ee

e

ee

ee

e

ee e

e

e e ee

ee

e

e

e

e

e

e

eee

eee

e ee

e

ee

eee

ee

bbb

bb

b

bb

b

b

b

bb

b

b

bb

b

b

bb

b

b

b

b

b

b

b

b

bb

b

b

b

bb

bb

b

b

b

b b

b

bb

bb

b

bbbb

b

b

bb

b

bb

b

b

b

b

bb

bb

b

b ff

f

ff

ff

ff

f

f

ff

f

f

f

ff

f

f

f

f

f

f

f ff

f

ff

f

ff

f

f

ff

ffff f

f

f ff

f

f

ff

ff

f

f

f f

ff

f

f

f fff

f

f

f

f

f

faa a

a

a

a

a aa

aa

a

aaa

a

aa a

a

a a

aa

a

aa

aa

a

a

a

a

aa

a

aa

a

a

a aa

aa

aaa aa

a

a

a

aa

a

a

a

a

a

a

a

a

aa

aa

aa

a

d

d

d

dd

d

d

dd

dd

d

d

d

ddd

d

d

dd

d

d

ddd

dd

d

d

d

dd

dd

d

d

d

d

dd

dd

d

dd

d

d d

dd

dd

dd

d

d

d

d

dd dd

d

d

dd

d

d

d

rr

r

r

r

r

r

rrr

r

r

r

r

r

r

r

r

r

r

r

rr

r

r

rr

r

r

rrr

r

r

r

rr

r

r

r

r r

r

r

r

r

r

rrr

rr

rrr

r

r

rr

rr

r

r

r

r

rr

rr

r

gg

g

g

gg

g

g

gg

g

gg g

g

g

gg

g

g

g g

g

g g

g gg

g

g

gg

g

g

g

g

g

gg

gg

g

gg

ggg

g

gg g

g

gg

g

g ggg

ggg

g

g

g

gg g

g

g

b

ae

f

g

d

r

−0.06 −0.04 −0.02 0 0.02 0.04 0.06

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

ee

e

e eee

e

e

e

e

ee

e

ee

e

e

ee

ee

ee

e

eee

eee

e

e

e

ee

e

e

e

e

e

ee ee

e

ee e

e

e

e

e

e

e

ee

e

ee

eee

e

ee

e

e ee

b b

bb

b

bb

b

b

b

b

bb

b

bbb

b

bb

b

b

b

b

bb

b b

b

b

b

bb

b

bb

b

b

bb

b

b

bb

bb

b

b

b

bb

bb

b

bb

b

b

b

bb

bb b

b

bbbb

b

f

f f

f

f ff

f

f f

fff

f ff

f ff f

f f

fff

f

f

ff

ff

f

ff

ff

ff

f f

ff

f

f f

f

ff

fff

f

f

ffff

ff

ff

f

f

f

f

f

f

fffp

p

p

p

p

pp

p

p

p

p p

p

p

pp

p

p

p

p

p

p p

p

p p

p

p

pp

p

p p

p

p

p

p

p

p

p

p

p p

p

p

pp

pp

pp p

p

p

p

p

pp

p

p

p

p

p

p

p

p p

pp

p

yy y

yy

y

y

y

y

y

y

y

y

y

y

yy

y yy

y

y

yy

y

y

y

y

y

y

y

y

yy

y y

y

y

y

y

y

y

y

y y

y

y

yy

y

y

y

y y

y

yy

y

y

y

yy

yy

y

y

yy

y

y

cc

c

cc

cc

cc

cc cc

cc

c

c

c

c

c

c

c

c

cc

cc

cc

cc

c

c

c

c

c

c

cc

cc

c

c

c

cc c

ccc

c

c

c c

cc

c

cc

c

c

c

c

c c

ccc

cc m

m

m m

m

m

mm

m

m

m

m

m

m

mm

m

mm

m

m

m

m m

m

mm

mmm

m

m

mm

mm

m

m

m

m

m m m

mm

mmmm

m

m

m mmm

m

m

m

m

m

m

m

m

m

mm

m

mmm e

e

e

e eee

e

e e

e

ee

eee

e

e

eee

ee

ee eeee

e

e

ee

e

e

e

e

e

e e

e

ee eee

ee e

e

e

e

e

e

e

eeee

e

e ee

e

ee

e

eee

bb

bbb

bb

bb

b

b

bb

b

b

bb b

b bb

bb

b

b

b

b b

b

b

b

b

b

b

b

b

bb

b bb

b

b b

bb

b

b

bbb

bb

b

b

bbb

b

bb

b

b b

b

bb bb

b

ff f

f

f ff

f

f f

f

f

f

ff

f

ff

f

f

ff

ff

ff

fff f

f

ff f

fff f

ff

f

f

f

ff

f

ff

ffff

f

fff

f

ff f

f

f

f

f

f

f

f

f

ff

a

a

a

aa

aaa

a

aa

a

a

a

aa

aa

a

a

a

a

a

a

aa

a

a

a

a

a

a

a

aaa

a

a

a

a

a

a

a

a

aa

a

aa

aa

a aa

a

aa

a

a

a

a

a

a

a

aa

aa

a

a

d

d

d

ddd

d d

dd

d

d

d

d

d

d

dd

d

d

dd

d

dd

d

dd

d

d

d

d

dd

dd

d

d

ddd

dd

d

d

ddd

d

dd

d

d

d

d

d

d

d

dd

d

d

dd

d

dd

d

dd

r

r

r

r

r

rrr

rr

r

rr

r

r

rr

r

r r

r

r

r

rr

r

r

rr

rr

r r

r

r

r

r r

rrr

rr

rr

r

r

r

rr

r

rr

r

r

r

r

rr

r r

r

r

rr

rr

rr r

g

g

g

g gg

g

gg

g

g

g

g

gg

g

g

g

g

g

gg

g

g

g

g

gg

gg

g

g

g

g

g

g

gg

gg

g g

g

g

g

gg

g

g

g

gg

gg

g

gg

g

g

g

g

g

g

g

ggg

g

g

g

p

b

c

ay

em

f

g

d

r

−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08−0.1

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

e

e

e

ee

ee

e

e

e

e

eee e

e

e

e

e ee

e

e e

e

e e

e

e ee

ee

e

e

e ee

e

e

ee

e e

eee

ee e

eee

e

e

e

ee

eee e

e

ee

e

e

e

e

e

b

b

bb b

b

bb

bb

b

bb

b

b

b

bbb

b

b bb

b

b

bb

b

b

bb

b

b

b

b

b

b

b

b

b

b

b

b

bb

bb

bb

b

b

bb

bb

b

bb

b

b

b

b

b

b

bb

bb

b

b

f

f fff f

f

f

f

fff

ff f

ffff

ff

ff

f ff

ff

fff

f

ff

f

f

f

f

f

f

ff

ff f

f

f

ff f

ff

f

ff

f

ff

ff

f

f f

ff

f ff ffp

p

p

p

p ppp

p

pp

ppp

pp

pp

p

pp

p p

pp

p

p

p

p

p

p pp ppp

p

pp

pp

p

pp

pp p

p

pp

pp

pp

p

pppppp

pp

p

p

pp

p

pp

yy

y

yy y

y

y

y

yy y

y

y

y

y

yy

yy

y

y

yy y

y

y

y

y

y

y

y

yy

y y

y

y y y

y

y

y

y

yy

yy

y

y

y

y

y

y

y

y

y

y

yy

y y

y

y

y

y

y yy

y

cc

c

c

cc

ccc

cc

c

c

ccc

c

c

c

c

c

c cc

c

c

c

cc

c

c

cc

c

c

c

ccc

c

c

cc

c

ccc c

cc

c

c

c

cc

c

cc

c c

c

cc

c

c

ccc cc

mm

mm m

mm

m

m

m

m

mm

m

m

m

m

m

m

m

m

m

m mm

mm

mmm

m

m m m

mm

mm mm

m

mmm

mmm mm m

m

m

m mm

m

m

m

m

mmm

m

m

m

mm

mmm

p

b

c

y

em f

Data set 2

Fused• Data set 1 only: comp.sys.ibm.pc.hardware (‘p’) , sci_crypt (‘y’), soc.religion.christian (‘c’), talk.politics.misc (‘m’)• Data set 2 only: comp.sys.mac.hardware (‘a’), sci.med (‘d’), talk.religion.misc (‘r’), talk.politics.guns (‘g’)• Shared: rec.sport.baseball (‘b’), sci.electronics (‘e’), misc.forsale (‘f’)

Cluster Alignment: Label Matching and Space Alignment

Reference Un-Aligned

• InfoVis and VAST paper data set• Help refine cluster results and obtain consensus clustering

Graph vis

Social net.

Graph vis

High-dim vis

High-dim vis + applications

Aligned

AlignedReference Un-Aligned

Testbed Overview

Interactive visual analytics system for classification of high-dim. data (image, text, etc) and search space reduction

iVisClassifier(J. Choo, H. Lee, J. Kihm, HP, VAST10)

• Interactive visual document clustering system using topic modeling

• Refines clusters and supports hierarchical cluster structure in an interactive way

iVisClustering(H. Lee, J. Kihm, J. Choo,, J. Stasko, HP, EuroVis12)

Key Interactions with LDA: Topic Refinement and Noisy Data Filtering

After filtering noisy data

After topic refinement

Jigsaw• Combining computational text analysis (text mining) with interactive visualization• Placed system on web in Fall ‘12 where anyone can download it http://www.cc.gatech.edu/gvu/ii/jigsaw

• Created video tutorials• Many sample data sets provided

• Working on opening up architecture

• Interviewed six people who had been using Jigsaw for 2-14 months

• fraud, law enforcement, intelligence analysis, research

• Understand how they are using Jigsaw, different domains• Learn about strengths of system and its limitations

Case Study of System Usage(Y. Kang & J. Stasko, VAST12)

VisIRR: Visual Information Retrieval and Recommendation System for Document Discovery

Improves personalization and understandability viaintegrated visualizations of

document retrieval and recommendation

• Visual IR: beyond Google-like keyword search:– See more documents– See relationships : topical, inter-document– Whole content -based, not keyword-based

• Visual Recommendation: enables discovery– Personalized based on user feedback, persistent– Understand “why” due to visualized relationships

• Only possible due to new/fast ML algorithms

Related workCommercial tools for researchers:�Mendeley - www.mendeley.com – Free reference manager and PDF organizer

– Offline client, personal homepage, social features (Community, follow researchers etc), recommendation engine (people/paper), plug in support. Naive collaborative filtering based recommendations, no visualization.

�Arnetminer - www.arnetminer.com – Academic researcher Social network search

– Metrics (uptrend, longevity, diversity etc), Authorship Network. Author specific website. Research is beyond only authors.

�Microsoft academic – academic.research.microsoft.com – A free academic search engine

– Innovative ways to explore academic publications, authors, conferences, journals, organizations and keywords, connecting millions of scholars, students, librarians, and other users, very rich visualization features. Very limited set of domains (only for computer science), No social features

�Google scholar – scholar.google.com – Search engine for scholarly articles.

– A simple search interface to search all scholarly articles, multiple disciplines, multiple sources (books, patents, articles, university websites, etc). No recommendation, No visualization, Irrelevant search results, Very limited research specific information (number of citations alone).

�Braque.cc - informs researchers of others' research. - Academic launch. Not successful commercially.

�Exlibris - BxRecommenderSystems - http://www.exlibrisgroup.com/category/bXOverview -Discovery, Distribution and management of print/electronic and digital materials. Recommendations for librarians.

– Official librarian tool, encompasses huge repository of data from all the university libraries, Web based tool, multi disciplinary recommendations. No author level information (h-index), journal/conference level (rating of the journal), paper specific information(citation etc). Even though recommends papers, does not provide the statistics that researchers relies up on.

Related workCommercial conference management systems:Web based software that supports organization of scientific conferences.

•Easychair•Confmaster.net

•CMT – Microsoft academic conference management site

•Openconf•PCS

Academic systems:•C. Basu, H. Hirsh, W. Cohen, and C. Nevill-Manning. Technical paper recommendation: a study in combining multiple information sources. Journal of AI Research, pages 231–252, 2001•D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proc. KDD’07, pages 500–509, 2007

•J. Goldsmith and R. Sloan. The conference paper assignment problem. In Proc. AAAI Workshop on Preference Handling for Artificial Intelligence, 2007•C. J. Taylor. On the optimal assignment of conference papers to reviewers. Technical Report MS-CIS-08-30, University of Pennsylvania, 2008

•Don Conry, Yehuda Koren, Naren Ramakrishnan: Recommender systems for the conference paper assignment problem. RecSys 2009: 357-360

•Naveen Garg, Telikepalli Kavitha, Amit Kumar, Kurt Mehlhorn, Julián Mestre: Assigning Papers to Referees. Algorithmica 58(1): 119-136 (2010)•Chong Wang, David M. Blei: Collaborative topic modeling for recommending scientific articles. KDD 2011: 448-456

Our differentiators• Visual big-picture interface

– See more documents: utilize screen space limit better– See relationships : inter-paper, topic/clusters, – Relevance based on full content , not just a few keywords

• Personalized and persistent– Feedback from the user– Machine learning under the hood:

1. 2-d projection2. topical clustering3. recommendation4. neighborhood graphs5. classification

• Interactive speed : Only possible due to fast algorithms

VisIRRAn interactive visual information retrieval and recommender system for large-scale document data

•

Visualization Example of Queried Set

Clear topics

Computational zoom -in

QueryKeyword query, ‘dimension reduction’

Recommendation ExamplePreference-assigned item as ‘highly like’ : ‘Enhancing the visualization process with principal component analysis to support the exploration of trends’

Recommended docswith re-clustering

Recommended docs in existing view (in rectangles)

Features•Dynamic query-retrieval

• Keyword search on contents such as title, abstract, and keywords as well as author and venue fields

• Filtering on year, citation/reference count• Different queries created either separately or jointly with their own

visualization snapshots•Interactive visualization

• Multiple visualizations via dimension reduction (for 2d coordinate) and clustering (for color-coded summary) on dynamically retrieved sets

• Support for easy comparison between views via clustering and dimension reduction alignment

•Preference feedback and recommendation• Document preference assigned by users• Recommendation performed based on document contents, citation,

or co-authorship information• Recommended items projected into the same space along with their

predicted cluster labels

Large -scale Data Collection/Ingestion•Data collection

• Starting with DBLP data set (432,605 data items)• Data cleanup and missing value handling via Microsoft Academic

Search API• Title, author, year, venue, abstract, keywords, citation/reference

count, and citation network info•Data management

• Structured information stored in database• Term-document information pre-computed• Top K Cosine similarity pre-computed• Citation network and co-authorship network pre-built• Scalable streaming data handling with efficient update in O(n)

•Dynamic memory loading• Document information dynamically loaded on the fly depending on

user queries/interactions• Cache-like memory management using “least recently used”

approach

Graph -based Recommendation•Various recommendation schemes

• Content-, co-authorship-, and citation-based recommendation supported

•Heat-kernel-based propagation algorithm• Weighted graphs as an input (for content-based, k-NN cosine-

similarity graph)• User preference propagated efficiently on large-scale sparse graphs

rα

= α ∑k (1- α)kfWk

• rα

is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph.

•Embedding on existing visualization• Out-of-sample embedding into previously computed dimension

reduction• Color-coding using k-NN classification on previous clusters

Summary

• Foundational algorithms for visual representations of high dimensional, large scale, heterogeneous data (dimension reduction, clustering,

space alignment)• Fast algorithms for real time interaction• Development of VA testbed• Development of proof-of-concept VA system

fodava-lead : visual analytics for large-scale high...

Documents