fodava-lead : visual analytics for large-scale high...
TRANSCRIPT
FODAVA-Lead : Visual Analytics for Large -scale High Dimensional Data:from Algorithms to Software Systems
Presented by Haesun Park and Alex Gray School of Computational Science and Engineering
Georgia Institute of TechnologyAtlanta, GA, U.S.A.
FODAVA Annual Meeting, Dec. 2012(PIs: H. Park, A. Gray, J. Stasko, V. Koltchinskii, R. Monteiro)
Contributors• Jaegul Choo (Georgia Tech)• Changhyun Lee (Georgia Tech)• Hanseung Lee (Univ. of Maryland)• Zhicheng Liu (Stanford University)• Fuxin Li (Georgia Tech)• Yunlong He (Georgia Tech)• Jaeyeon Kihm (Cornell University)• Jingu Kim (Nokia)• Da Kuang (Georgia Tech)• Sen Yang (Arizona State University)• Ed Clarkson (Georgia Tech Research Institute)• Polo Chau (Georgia Tech)• Alexander Gray ( and many of his students, Georgia Tech)• Vladimir Koltchinskii (Georgia Tech)• Renato Monteiro (Georgia Tech)• John Stasko (Georgia Tech)• Jieping Ye (Arizona State University)
Challenges in Computational Methods for High Dimensional Large-scale Data
on Visual Analytics System • Data challenges
– Massive, High-dimensional, Nonlinear– Vast majority of data is unstructured– Noisy, errors and missing values are inevitable in real data set– Heterogeneous format/sources/reliability – Time varying, dynamic, …
• Visualization challenges– Screen Space and Visual Perception
• High dimensional data: Effective dimension reduction • Large data sets: Informative representation of data
– Speed: necessary for real-time, interactive use• Scalable algorithms• Adaptive algorithms
• Dimension Reduction • Dimension reduction with prior info/interpretability constraints• Manifold learning
• Informative Presentation of Large Scale Data• Sparse recovery by L1 penalty• Clustering, semi-supervised clustering• Multi-resolution data approximation
• Fast Algorithms • Large-scale optimization/matrix decompositions• Adaptive updating algorithms for dynamic and time-varying data,
and interactive vis.
• Information Fusion • Fusion of different types of data from various sources, vis. comparisons
• Integration with DAVA systems • Testbed , Jigsaw , iVisClassifier, iVisClustering, VisIRR ..
Key Foundational Components forVA System Development
FODAVA Research Test Bed for Visual Analytics of High Dimensional Data
• Library of key computational methods for visual analytics of high dimensional large scale data
• With visual representations and interactions • Easily accessible for DAVA researchers
and readily available for applications• Identifies effective methods for specific problems (evaluation)• Modular: A base for specialized VA systems
(e.g. iVisClassifier, iVisClustering, VisIRR)
FODAVAFundamentalResearch
ApplicationsApplications
Test Bed
FODAVA Research Testbed Software: Available at http://fodava.gatech.edu/fodava-
testbed-software • Supports various dimension reduction, clustering, and their visual representations
and comparisons through alignments for high-dimensional data• Application domains: document analysis, bioinformatics, seismic data analysis,
healthcare, communications, computer vision, …• Language used: backend library in Matlab, GUI in JAVA (no need for Matlab installed)• System support: Windows 32/64 bit, Linux 32/64 bit
Testbed Modules• Computational modules
• Vector encoding• Pre-processing• Clustering• Dimension reduction
• Interactive visualization modules• Parallel coordinates• Scatter plot• Cluster summary• Brushing and Linking• Space alignment• Raw data view
Dimension Reduction
• Visualizes high-dimensional data by parallel coordinates and/or scatter plot
• Methods• Linear methods
• PCA, FA, ProbPCA, LDA, OCM, NPE, LPP, LLTSA, NCA, MCML
• Nonlinear methods• MDS, Isomap, LLE, LTSA, Sammon, HessLLE, MVU,
LandMVU, KernPCA, GDA, DiffMaps, SPE, AutoEnc, LLC, ManiChart, CFA, GPLVM, SNE, T-SNE
• Provides initial parameters that can be changed interactively• Can recursively apply dimension reduction
on user-selected data• Fast algorithms implemented
Clustering and Classification
• Generates cluster/class labels of data,which are color-coded in visualization.
• Methods• Clustering
• Hierarchical clustering, K-means, spherical K-means, GMM, NMF, constrained K-means, DisCluster/ DisKmeans [J. Ye]
• Classification (on-going work)• K-nearest neighbors classifier, SVM, Logistic
regression, Naïve Bayes• Provides cluster summary• Provides GUI for semi-supervision, e.g., must/cannot link• Can hierarchically construct cluster structures
Computational Zoom -in
Normal Zoom-in
Comp. Zoom-in
Computational zoom-in by recursive dimension reduction on selected data
Controls the level of focus on locality
Interactive Parameter Change e.g. in Isomap k value in k-NN Graph
Fast Comp. Modules for Interactive Vis.
• Essential for real-time interaction
• Let computational precision be governed by visual precision/resolution
• Hierarchical refinement• Adaptive algorithms
p-Isomap computing time vs.k value in k-NN graph
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3
−0.2
−0.1
0
0.1
0.2
0.3
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3
−0.2
−0.1
0
0.1
0.2
0.3
10 20 30 40 50 6010
12
14
16
18
20
22
24
26
Initial K
Co
mp
utin
g t
ime
s in
se
co
nd
s
ISOMAP (k=initial K)p−ISOMAP (k:initial K−>initial K−2)p−ISOMAP (k:initial K−>initial K+2)
1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
50
100
150
200
250
300
350
400
450
500
Data Size
Tim
e in
Sec
onds
Double precisionSingle precision
48x36 vs 80x60
PCA timing: double vs single precision computation time and results
Key Computational Methods
• NMF (Nonnegative Matrix Factorization) and its variations:for dimension reduction and clustering
• LDA/GSVD (Linear Discriminant Analysis) and its variations:for informative 2D representation of clustered and large scale data
• Orthogonal Procrustes and MDS (Multi-Dimensional Scaling): for space alignment and comparisons of visual representations
Nonnegative Matrix Factorization (NMF)(Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, Kim and Park 11 SISC…)
A W
H~=
�min || A – WH ||FW>=0, H>=0
• Why Nonnegativity Constraints?•Better Approx. vs. Better Representation/Interpretation•Nonnegative Constraints often physically meaningful• Interpretation of analysis results possible
• One of the Fastest Algorithms for NMF & theoretical convergence analysis
• Matlab codes publicly available (J. Kim and H. Park, IDCM08, SISC11)
http://www.cc.gatech.edu/~hpark/nmfsoftware.php• NMF is better and faster than K-means in clustering
• K-means: W: k cluster centroids, hi: cluster membership indicator• NMF: W: basis vectors for rank-k approx., hi: k-dim rep. of ai• SymNMF (Kuang, Ding, Park, SDM12), Sparse NMF for clustering (Kim and Park, Bioinfo., 07)
• NMF more accurate and faster on document and image data • (Xu et al. 03; Pauca et al. 04; Li et al. 07; Kim & Park, 08; Ding et al. 10 ...)
– Clustering accuracy averaged over 20 runs:
– Problem sizes:
TDT2 Reuters COIL20 NIPS ORL PIE Overall
K-means 0.6734 0.4289 0.6184 0.4650 0.6499 0.7384 0.5957
NMF/ANLS 0.8534 0.3770 0.6312 0.4877 0.7020 0.7912 0.6404
SymNMF 0.8979 0.5305 0.7258 0.5129 0.7798 0.7517 0.6998
NMF for Clustering
K K-means SphKmeans NMF/ANLS GNMF Spectral SymNMF
TDT2
4 .7994 .7978 .9440 .9150 .9093 .9668
8 .6147 .6208 .8292 .8200 .7357 .8819
16 .5286 .5305 .6709 .6812 .5959 .7635
Reuters
4 .5755 .5738 .7737 .7798 .7171 .8077
8 .5170 .5049 .6747 .6758 .6452 .7343
16 .3712 .3687 .4608 .5338 .5001 6688
TDT2 Reuters COIL20 NIPS ORL PIE 20 Newsgroups
m 26618 12998 4096 17583 5796 4096 36982
n 8741 8095 1440 420 400 232 18669
k 20 20 20 9 40 68 20
On average, NMF is faster than k-means by a factor of at least 2
Linear Discriminant Analysis for2D/3D Representation of Clustered Data
(J. Choo, S. Bohn, HP, VAST09)
Max trace (GT Sb G) min trace(GT Sw G)&
max trace (GT (Sw + µI)G)-1 (GT Sb G)
• Regularization in LDA for Computational Zooming-in
Small regularization Large regularization
LDA/GSVDα 2Hb Hb
Tx = β 2Hw HwTx
−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
aa
a
aa
a
aaa
a
aa
a
aaa
aaa
aaaaa
aaa
aaaa
aaaaa
a
aaa
a
a
aaa
aa
a
aa
bbb
bb
b
bbb
b
bbb
bbb
bbb
bbbb
b
bbb
bbb
b
bbbbbb
bbb
b
b
bbb
bbb
bb
cc
c
ccc
c cc
c
cc
c
ccc
c cc
cc
ccc
c cc
ccc
c
ccc
ccc
ccc
c
c
ccc
c
c
c
cc
d d
d
dd d
dd
d
d
ddd
dd d
ddd
dd dd d
ddd
dddd
ddd d
dd
ddd
d
d
ddd
ddd
dd
e
ee
e e
e ee
e
e
eee
ee
e
e ee
ee
e ee
ee
e
ee
ee
ee
e
eee
eee
ee
ee
e
ee
e
eefff
fff
fff
f
fff
fff
fff
ff
fff
fff
fff
f
fff fff
ff
f
f
f
fff
f f
f
ff
g
g
g
gg
gggg
g
gg g
ggg
gg g
gg
g gg
ggg
g ggg
ggg
ggg
ggg
gg
ggg
gg
g
gg
hhh
hh
hh h h
h
hhh
hh h
hhh
hhh
hh
hhh
hhh
h
hh
hhh
h
hhh
h
h
hhh
hhh
hh
ii i
iii
i ii
i
ii i
i i i
iii
ii ii
iii
iiii
i ii i ii
iii
i
i
iii
iii
iii
jjj
jj j
jjj
j
j jj
jjj
jjj
jj jjj
jj j
jjjj
jjj
jjj
jjj
j
j
jjj
jjj
jj
kk
k
kk k kkk
k
kk
k
kk k
kk k
kkk
k
k
kk
k
k
kk
k
kk
k
kkk
kk
k
kk
k
kk
k
kk
k
kll
l
ll
l
ll
l
l
lll
ll
l
ll
l
ll
lll
l l
l
l l
l
l
ll
l
ll
l
ll
l
l
l
l l
l
ll
l
ll
mmm
mm m
mm m
m
mmm
mm m
mmm
mm m
mm
mm
m
mmm
m
mmm m
mm
mmm
m
m
mm
m
mmm
m m
nnn
nnn
nn
n
n
nnn
nnn
nnn
nn
nnn
nnn
nnnn
nnn
nnn
n nn
nn
nnn
nnn
n noo o
ooo
ooo
o
oo o
ooo
ooo
ooooo
ooo
ooo
o
ooo oo o
ooo
oo
oo
o
oo
o
oo
pp
p
pp
p
p pp
p
pp p
ppp
ppp
ppppp
ppp
pppp
ppp
ppp
p pp
p
p
pp p
p
pp
p
p
q
qqq
qqq
q
q
qqq
qqq
q qq
qqq
qqqq
qqq q
qqq
qqqq
r
r
r
rr
r
rr r
r
rr
r
rr r
rr r
rrrr
r
rr r
rr
rr
rr r
rr r
rrr
rr
rr r
rr r
r
r
ss
s
ss
s sss
s
sss
ss s
sss
ss sss
sss
ssss
sss sss
sss
s
s
sss
ss
s
ss
t
tt
ttt tt
t
t
t
t tt
ttt
t tt
ttt
tt
ttt
t tt
t
t
ttt t t
t
t tt
t
t
t tt
ttt
u
uu
uuu
u
uu
u
uu
u
uuu
uuu
u
uuu
u
uuu
uuuu
uuu
uuu
uuu
uu
uuu
uuu
uu
vvv
vvv
vvv
v
vvv
vvv
vvv
vv
vvv
vvv
vv
v
v
vvv
vv v
vvv
vv
vvv
vvv
v
v
www
www
www
w
ww
w
ww
w
ww
w
wwww
w
www
ww
ww
ww
w www
www
w
w
www
www
ww
xx
x
x
xx
x
xx
x
xxx
xxx
xx
x
xx xx
x
xxx
xxx
x
xxx xx
x
xxx
x
x
xxx
xxx
x x
yy
y
yy y
y y y
y
yy y
y yy
yy y
yy
yyy
yyy
yy
y
y
yyy yyy
yyy
yy
yyy
yyy
yy
z z
z
zzz
zzz
z
zz
zz
zzz
zz
zzz
zzz
zzz
z
zzzzzz
z zz
zz
zzz
zzz
z z
z
z
1
1
1
11 1
11 1
1
11
1
111
11
1
11
111
111
111
1
111 1 1
1
111
1
1
1 11
1 11
11
a
bc
d
efgh
ij kl
m
n o
pqr
stuv
w
xy
z1
−0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04
−0.04
−0.02
0
0.02
0.04
0.06
aa
a
aa
aaaa
aaa
a
aa
aaa
a
aaaa
a
aa
a
aa aa
aa a
aaaaa a
aa
aaa
aaa
aa
bbbb bbbbb
b
bbb
bbb
bbb bb
bbb
bbbbbbb
bbbbbb
bbbbbb bb
bbbbb
ccc
ccc
ccccc
cc
cc
cc
cc
cc ccc
c
c
c
c
cc
ccc c
c
cc
ccc
c ccc
ccc
ccc
ddd
dd d
ddddd
ddd
d d
dddd ddd
d
d
dd
dd d
ddd
d
dd
d
dd
ddddd
d
ddd
d
d
ee
ee e
e
ee
e
e ee
e
e e
eee
e
eeeee
ee
e
ee
e
e
ee
eee
e
e
e
eee
eee ee
e
ee
fff
ff
f
ff
f
f
ff
fff
ff
ff
ff
ff
ff
ff
f
f
f
f
ff
fff f
fff
ff
ff
ffff
f fg
gg gg
gggg
g
gg
g
ggggg
ggg
ggg ggggg
g
ggggggg
gg
g
gg ggggg
gg
g
hhh
h hh
h hh
hh
hh h
h
hhhh
hhhhhh
h
hh
hhh h
hh
h
h
h
hh
hh
hh
hhhhh
hh
i i i
i ii
ii iiiiii i i
iiiiiii
i i iiiii i
iii
ii
iii
iiii i
i iii i
i
jjj
jjj
jjj
jjjjj j
jj jj
jj
jj jjj j
jjj j
jjj
jj j
jjj
jj
j jjj j
j
jj
kk kkk
k
k kkkk
k k kkk
k kk
kkkk
kkk
k
k k
k
kk kk
kk
k
k k
k
k kkkk
k
kk
kk
l l
l
lll
ll llll
lll
lll
l
lll l
l lll
lll
l
lll
ll lll ll l ll l
ll l
ll
m mm
m mm
mmm
m mmmm mm
m mm
mmm
m mmm
m
mmm
mmm m
mmmm
mmm
mmm
mmm m
mm
nnn
nnn
n
nn
nnnn
n nn
nnn
nn
nnn
nnn
nnn
n
n nn
n nn
n nnn
n
n nnn nn
n n
oo
oooooo ooo
oo
oo o
oo oooooo
ooo ooo
ooo
ooooooo
oooooo o
o
oo
p
p p
p
p p
pp
p
pp
ppp
p pp
pppp
p
p ppp p
pp p
p p
p pp
p p
pp
ppp
p
pp
pp pp p
qq q
qqq
qqqqq
qqq
qqq
qq qq
qqqq
qqq
qqqqqq
q
qqq
q
qqqr
r
r
rr rrrrr
rrr rr rrr
r rr r
rr
r
rrr
r rr r
rr
r rr
rrr
rrr rrr
rrr
r
sss
ss
s
sss s
ss
sss s
sss s
s
ss s
sss s
ssssss
sss s
sss s
sss
sss ss t
t
t
tt t
t
t tt
t
tt t
t tt
ttt
tt
ttt
ttt
t
tt
t
t
t
tt
t
tt
t
t t
t t
tt
t
t tt
u u
u
uu
u
u
uu
uuu
u
uu uu
u uuuu u
uuuu
uu uu
uuuuuuu
uuuuu uuu
u uu
u
vvv
vvv
vvvv
vvv vvv
vv
vv
v
vvvv
vvvv v
vvvv
v
vv
vvvv
vvvv
vvv
vv
w
ww
wwwww
ww
www
www
ww
www
www
www
ww
w wwww w w
ww ww
wwww
w
ww www
xxx
x
xxx
xx
xxxx
x
xx
x
xx
xxx x
xxxx
xx
x
xxxxx x
x
xx
x
xxxx x
xxx
xx yy
y
yyyyy
y
yyy
y
yyy
y y
yyy yy
yyyy yyy
y yyy yyyyy
y
y
y
yy
y
y
y
yyyzz
z
zzzzzz
z zzz
zzzz
zzzzz
zzzz
zzz
zz
z zz z
zzz
zz
zz
zzzz
zz z
z
11 111
111 1
1 11 1
111
11
1
111
1 11
1
111
1
1
11
1
11 1
111
11 11 111 1
1
1
a
bc
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
st
u
v
w
xy
z1
2D Visualization of Clustered Text, Image, Audio Data
20news Data (Text)
LDA+PCA
PCA
Rank-2 LDA
PCA
−0.04 −0.02 0 0.02 0.04 0.06−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
aa
a
a
a
a
a
a
aa
a
a
aa
a
a
aa
aa
a
aa
a
a
aaa
aa
a aaa
a
a
a
a
aa
a
a
aa
a
a
a
a
a
a
a
a aa
aa
aaaa
b
b
b
bb
b
b
b
b
b
b
b
b b
b
b
b
b
b
bb
b
b
b
b
b
b
b
bb
bbb
bb
b
b
b
b
bb
b
b
b
b
b
bbb
b
b
bb
b
b
bb
b
b
b
c
c
c
c
c
c
cc
c
cc
c
cc
c
c
ccc
cc
cc cccc c
cc
c
c
c
ccc
c
cc
c
c
c
c
cc
c
c
cc
c
cc
cccccc
c
c
dd
dd
dd
dd dd
d
d
d
d
ddd d
dd
d
d d
d
d d
d
d
dd
d
d
dd
d
d
d
dd dd
dd
dd
d
d
d
d
d
ddd
d
dd
d
d
d
d
ee
e
e
e
ee
ee
e
e
e
e
e
e
e
ee
e
ee
e
ee
ee e
e
e
e
ee
ee
e
ee
e
e
eeee
ee
e
ee
ee
ee
ee
ee
e
ee
e
f
f
f
f
f
f
f ff
f
fff
fff
f f
f
ff f
f
f
f
f
f
f
f
ff
ff
f
ff
ff
f
fff
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
f
g
g
gg
g
g
g
g
g
g
gg
g
g
g
g
g
g
g
g
g
g
g
g
g
g
g
gg
g
g
gg
g
g
g
g
g
g
gg
g
gg
g
gg
g
g
g
g
g
gg
g
g
g
gg
g
h
h
h
hh h
h
h
h
h hh
h
h
hh
h hh
h
h hh
h
h
h
h
h
h
h
h
h
h
h
hh
h
h
h
h
h
h
h
hh
hh
h
h
h
h
h
h
h
hh
h
h
h
h
i
i
i
i
i
i
i
i
i
i
ii
i
i
ii
ii
i
i
ii
i
i
i
ii i
i
i
i
i
i ii
i
ii
i
ii i
i
i
ii
i
i
ii
i
i i
i
i
ii i
i
i
j
j
j
jj
j
j
j
j
jj
j
j
jj
jjjj
j
j
j j
jj
j
j
j
j
j
j
jj
j
jjj
j
j
jj
jjj
jj
jj j
j
j
jj
jjj
j
j
jj
k
k
k
k
k
kk
k
k
k
k
k
kk
k k
k
kkk
k k
k
k
k
k
k
k
kk
k
k
k
kk
kk
k
kk
k
k k
k
k
k
k
k
k
k
k
k
k
k kk
k
kk
k
l
l
l
ll
ll
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll l
l
l
ll ll
l
l
l
l
l
l
ll
l
l
l
ll
l l
ll
l
l
l
l
l
ll
ll
l
lll
m
m
m
m
m
m
m
m
m
m
mm
m
m
mm
mm
m m
m
m
m
m
mm
mm
m
m
m
m
m m
m
mm
mm
m
m
m
m
m
m
m
m
m
m
m
mm
m
m
m
m
m
m
mnn
nn
n
n
n
n
nnnn
n
n
n
nn
n
n
n
nnn
n
n
nnn
n
n
n
n
nn
n
n
n n
n
n
nn
n
n
n
n
nnn
n n
n
n
nnn
n
nn
n
o oo
o
o
o
oo
o
o
oo
o
oo
o
oo
o
oo
oo
o
o
o oo
oo
o
oo
oo
oo
oo
o
o
ooo
o
o
o
o
oo
oo
o
o
o
o
ooo
o
pp
p
ppp
p
p
p
p
p
p
p
p
p ppp
p
p
pp
pp
pp
pp
p pp p
pp
p
pp
p
p
p
p
p
p
p
pp
p
p
p
pp
p
p
p
pp
pp
p
p qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qqq
q
q q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
r
r
r
rrr
r
r
rr
r
r
r
rr
rr
r
rrrr
r
rr
r
r
r
r
rr
r
r
rr
r
rr rrr r
r
r
r
r
r
r
r
r
r
rr
r
rr r
rr
rs
s
ss
s sss
s
s
ss
s
s
s
ss
s
s
s ss
s
ss
ss
s
ss
ss
s
s
s
ss
ss
ss
s
ss
s
ss
s
s
s
s
s s
s
ss
s
s s
s
t
t
tt
t
t
t
tt
t
t
t
ttt
t
t
t
t
t
t
t
t
t
t tt
t
tt
t
t
t
t
t
t
t
t
t
tttt t
t
t
t
tt t
t
t
t t
tt
t
t
t
t
u
u
uu
u
u
u
u
uuu
u
u
u u
uu
u
u
uu u
u
u
u u
u
uu
u
u
uu
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u u
u
uu
u
uu
u
v
v
vv
v
v
v
vv
v
v
v
v
v
v
v
v
v
v
vvv
v
v
v
v
v v
v
v v
v
v
v
vv
v
v
vv
v
v
v
v v
v
v v
v v
vv
vvv
v
v
vv
vw
w
w
w
ww
w
w
w
w
w
w
w
w
w
w
w
ww
w
w
w
w
w
w
w
w
ww
w w
w
w
w
w w
w
ww
w
w
w
ww
w
w
w
w
w
w
w
w
w
w
w
ww
ww
w
x
x
x
xx x
xxx
x
xxx
x
x
xx
xx
x
xx
xx
xx
x
x
x
xx
xx
x
xx
xx
x
x
x
x
x
x
x
x
xx
x x
x
xx
x
x
x
x
x
x
x yy
yy
y
y y
y
y
y
yy
y
yy
y
y
y
y
y
y
yy
y
y yy
y
y
yy
y
yy
y
y
y
y
y
y
y
yy
yyy
y
y
y
y
y y y
yy
y y
y
yy
z
z
z
z
zz
z
zzz
z
zz
z
z
z
z
z
z
z z
z
zz
z
z
z
z
zz
z
z
zz
zz
z
z z
z
z
z
zzz
z
z z
zz
z
z
z
z
z
z
zz
z
ab
c
d
e
f
g
h
ij
k
l m
n
op q
rs
t
uv
w
x y
z
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.05
−0.04
−0.02
0
0.02
0.04
0.06
aa
aa a a
aa
a
aa
a
aaa
a
a
a
a a
aa
aaa
a
aaa aaa
a
a aa
aa
aa aa
aa
aa
aa
aa
aaaa
aa
aa
aa
bb
bbbbb
bbbb
b
bb
bb
b
b b
b
b
b b
bb
b
b
bb
bbbb
bb
bb
b
bbbbbb
bbb
b
bb
b
bb
b
b
bb bb
b
c
c
c
ccc
c
c
cc
ccc
c
cc
c
cc
c
c
cc
ccc
c
cc
ccc
c
c
cc c
c
ccc
cc
cc
c c
ccc c
cc
c
ccc
c
c c
dd
dd
ddd
dd ddd
dd
d ddd
d
d
dd
dd
d
dd
d
dd
dd
ddddd
d d dd
dd
dd
ddd
d
d
dd
d ddd
d
d
dd ee
eeee
e
ee
e
e
e
ee
e eee
eeee e
ee
e eee
eee
eeee eeee
ee
ee
ee e
e
e
ee e
ee
e
e
eee
e
ff
f
ff
ff f
ff
fff
f ff f
ff
f
f ff f
fffff
ff
f
f
f
f
ff
ff
f
ff
f f
f
f
ff
f
f
ff
f ff
f
ff
f
f
g
ggg
gggg
g
gg
g ggg
ggg
g
ggg
g
g
gg g
gg
g
gg gg gg
g
gg
g
g
g
ggg
g g g
g
g
gggg
gg
gg
gg
hh
h
h
h
h
h
h
hh
hh
hh
hh
hhhh
hhhh
h
hhh
h
hhh
h h hhh
h
h h
h
hhhh
hh
h
hhh h
hh
h
h h
h
h h
i
iii
ii
i
i
ii
i
ii iii
i
i
i ii i
iiii
ii
iiii i iiii
i
i iii
i ii i iiii
iii
i
ii i
ii
i
j j
jj
jj
jj j j
j
j j
j
jj
j
j
jj
jj jj j
jj
j
jjj
jj j
jjj
jj
jj jj
jj
j
jj
j
jjj
jj jj
jj
j
j
kkk
kk
kkk
kk
kkk kk
kkk k
kk
k
kk
k kkkkkkk kk kkkk
kkk
k
k kk k
kk k
k
k
kkk
kk k
kkk
lll
ll
l
ll
l
lll
ll
lllll
lll ll
ll
l
llll
lll
ll
lll lll
l
lll
ll
l
ll
l ll
lll l
l
l
mmm
m
mmmm
m
mm mm
mmm
m m
m
m
mm mm
mm
mm
m
mm
mmm
mm
mmm m
m mmm
m
mmmm
mmmmm
mm
m
m
m
nnn nn
n nn n
n
nn
nn
nnnn
nnn
nn n
nn
nn
n
nn
n
n
n
n
n
n nnnnn
nn
nn
nnn
n
n
nn nn
n
nnn
n
o
o o
oo
oo
o
oo
ooo o
o
ooo
ooo
oo
ooo
ooo
ooo
oo
oo
o
oo
oo
ooo
o o
oo
o
oo
oo o
o
oo
oo o
p
p
pp p
ppppp pp
p
pp
ppp
ppppp
p
p
p
p
pp
pp
p p ppp pppp
pp
pp
p
pp ppp
p
pppppp
pp
p
qqq
qq q
qq q
q q
qq q
q
q
qqq
qq qq
q
qqq q
qqq qq
q
q
q
qqqq
qqq
q
qq q q
qqq
qqq
r
rrr
rr
rr
rr
rrrrr
r
r
rr rr rr r rr r
rrr
rr
rr rr
r rr
rr
rrrrr
rr r
r
rr
rr
rr
rr
r
r
s
ssss
sss
s
s
s
ss
ss
s
s
s
s
s
sss
sss
s
sss
s
s
ss
sss
s
ssss
ss
sss
ssssss
s
s s
s
s
ss
tt
tt
tt
t tt
t
tt
ttt
ttt t
tt
t tt
tt
tt
tt
t
t t
tt
t
tt tt
ttttt t
ttt tt
t
t
t
ttt
tt t
u uu
uu
uu
uuuu
u
uuu u
u
uu
uuuu
uu u uu
u
uu uuuu
u
uuuu
u
u
uu
u u uuu
u u uuuuuu
u
uu
v
vv
vv vv v
v
v
v
vv v
vv
v
v
vv
v
v
v vv vvv
vvv v
v
vv
vv
vv
v vv
vv
vvv v
vv
v
vv
v
v vv
v
v v
ww
www www
w w wwwww
ww
w ww
ww
ww
w
w
wwww
w
ww
w w
w
ww
w
ww
w
w
w
w
w
w
www www
w w
w
w
ww
w
xxx x
x
xxx
xxx
xxx
x
x
x
x
xxx
x
xxxxx
xx
x
xxx
x
x
xx
x
x
xx
xxxx
x
x
x
xxx
xx
xx
x
x
x
xx
yyy
y
yy
yy
yy
yy
y
y
yy
y
yy
y yy
yyyy
y
y
yy
yy
y
yy
yy
y
yy yy yyyy
y
yy
yyy
y y
y
yy
y
yy
zzz
zz zz
z zz
z
zz
zzz
z
zzz
z
zzz
zzz
zz
zz
zzz
z
z
z
zzz
zz
z
zz zzz
zz
zz
zzz z
z
z
z
a
b
c
d e
f
g
h
i
j k
lmn
o
p
q
r
s
t
u
v
w
x
y
z
Rank-2 LDA
PCA
Facial Data (Image) Spoken Letters (Audio)
Two-stage Dimension Reduction for 2D Vis. of Clustered Data
• LDA + LDA = Rank2 LDA • LDA + PCA• OCM + PCA • OCM + Rank-2 PCA on SF
b = Rank-2 PCA on Sb
Information Fusion and Visual Comparisons based on Space Alignment (J. Choo, S. Bohn, G. Nakamura, A. White, HP)
• Want: Unified visual representations of different results• Assume: Reference correspondence information between data
pairs or cluster correspondence• Two conflicting criteria: maximize alignment and minimize
deformation
Data sets Similarity graph
Fused data
• Procrustes analysismin || (A-µA1T)-kQ(B-µB1T)||FQTQ=I
• Graph embedding approach (MDS)
Fusion and Alignment in Testbed
AlignedReference Un-Aligned
Data set 1
−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
e
e e
ee
e
ee
eee
e ee e
e
e
eee
e
ee
ee
e
ee
e
ee
e
ee
ee
e
ee e
e
e e ee
ee
e
e
e
e
e
e
eee
eee
e ee
e
ee
eee
ee
bbb
bb
b
bb
b
b
b
bb
b
b
bb
b
b
bb
b
b
b
b
b
b
b
b
bb
b
b
b
bb
bb
b
b
b
b b
b
bb
bb
b
bbbb
b
b
bb
b
bb
b
b
b
b
bb
bb
b
b ff
f
ff
ff
ff
f
f
ff
f
f
f
ff
f
f
f
f
f
f
f ff
f
ff
f
ff
f
f
ff
ffff f
f
f ff
f
f
ff
ff
f
f
f f
ff
f
f
f fff
f
f
f
f
f
faa a
a
a
a
a aa
aa
a
aaa
a
aa a
a
a a
aa
a
aa
aa
a
a
a
a
aa
a
aa
a
a
a aa
aa
aaa aa
a
a
a
aa
a
a
a
a
a
a
a
a
aa
aa
aa
a
d
d
d
dd
d
d
dd
dd
d
d
d
ddd
d
d
dd
d
d
ddd
dd
d
d
d
dd
dd
d
d
d
d
dd
dd
d
dd
d
d d
dd
dd
dd
d
d
d
d
dd dd
d
d
dd
d
d
d
rr
r
r
r
r
r
rrr
r
r
r
r
r
r
r
r
r
r
r
rr
r
r
rr
r
r
rrr
r
r
r
rr
r
r
r
r r
r
r
r
r
r
rrr
rr
rrr
r
r
rr
rr
r
r
r
r
rr
rr
r
gg
g
g
gg
g
g
gg
g
gg g
g
g
gg
g
g
g g
g
g g
g gg
g
g
gg
g
g
g
g
g
gg
gg
g
gg
ggg
g
gg g
g
gg
g
g ggg
ggg
g
g
g
gg g
g
g
b
ae
f
g
d
r
−0.06 −0.04 −0.02 0 0.02 0.04 0.06
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
ee
e
e eee
e
e
e
e
ee
e
ee
e
e
ee
ee
ee
e
eee
eee
e
e
e
ee
e
e
e
e
e
ee ee
e
ee e
e
e
e
e
e
e
ee
e
ee
eee
e
ee
e
e ee
b b
bb
b
bb
b
b
b
b
bb
b
bbb
b
bb
b
b
b
b
bb
b b
b
b
b
bb
b
bb
b
b
bb
b
b
bb
bb
b
b
b
bb
bb
b
bb
b
b
b
bb
bb b
b
bbbb
b
f
f f
f
f ff
f
f f
fff
f ff
f ff f
f f
fff
f
f
ff
ff
f
ff
ff
ff
f f
ff
f
f f
f
ff
fff
f
f
ffff
ff
ff
f
f
f
f
f
f
fffp
p
p
p
p
pp
p
p
p
p p
p
p
pp
p
p
p
p
p
p p
p
p p
p
p
pp
p
p p
p
p
p
p
p
p
p
p
p p
p
p
pp
pp
pp p
p
p
p
p
pp
p
p
p
p
p
p
p
p p
pp
p
yy y
yy
y
y
y
y
y
y
y
y
y
y
yy
y yy
y
y
yy
y
y
y
y
y
y
y
y
yy
y y
y
y
y
y
y
y
y
y y
y
y
yy
y
y
y
y y
y
yy
y
y
y
yy
yy
y
y
yy
y
y
cc
c
cc
cc
cc
cc cc
cc
c
c
c
c
c
c
c
c
cc
cc
cc
cc
c
c
c
c
c
c
cc
cc
c
c
c
cc c
ccc
c
c
c c
cc
c
cc
c
c
c
c
c c
ccc
cc m
m
m m
m
m
mm
m
m
m
m
m
m
mm
m
mm
m
m
m
m m
m
mm
mmm
m
m
mm
mm
m
m
m
m
m m m
mm
mmmm
m
m
m mmm
m
m
m
m
m
m
m
m
m
mm
m
mmm e
e
e
e eee
e
e e
e
ee
eee
e
e
eee
ee
ee eeee
e
e
ee
e
e
e
e
e
e e
e
ee eee
ee e
e
e
e
e
e
e
eeee
e
e ee
e
ee
e
eee
bb
bbb
bb
bb
b
b
bb
b
b
bb b
b bb
bb
b
b
b
b b
b
b
b
b
b
b
b
b
bb
b bb
b
b b
bb
b
b
bbb
bb
b
b
bbb
b
bb
b
b b
b
bb bb
b
ff f
f
f ff
f
f f
f
f
f
ff
f
ff
f
f
ff
ff
ff
fff f
f
ff f
fff f
ff
f
f
f
ff
f
ff
ffff
f
fff
f
ff f
f
f
f
f
f
f
f
f
ff
a
a
a
aa
aaa
a
aa
a
a
a
aa
aa
a
a
a
a
a
a
aa
a
a
a
a
a
a
a
aaa
a
a
a
a
a
a
a
a
aa
a
aa
aa
a aa
a
aa
a
a
a
a
a
a
a
aa
aa
a
a
d
d
d
ddd
d d
dd
d
d
d
d
d
d
dd
d
d
dd
d
dd
d
dd
d
d
d
d
dd
dd
d
d
ddd
dd
d
d
ddd
d
dd
d
d
d
d
d
d
d
dd
d
d
dd
d
dd
d
dd
r
r
r
r
r
rrr
rr
r
rr
r
r
rr
r
r r
r
r
r
rr
r
r
rr
rr
r r
r
r
r
r r
rrr
rr
rr
r
r
r
rr
r
rr
r
r
r
r
rr
r r
r
r
rr
rr
rr r
g
g
g
g gg
g
gg
g
g
g
g
gg
g
g
g
g
g
gg
g
g
g
g
gg
gg
g
g
g
g
g
g
gg
gg
g g
g
g
g
gg
g
g
g
gg
gg
g
gg
g
g
g
g
g
g
g
ggg
g
g
g
p
b
c
ay
em
f
g
d
r
−0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08−0.1
−0.08
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
e
e
e
ee
ee
e
e
e
e
eee e
e
e
e
e ee
e
e e
e
e e
e
e ee
ee
e
e
e ee
e
e
ee
e e
eee
ee e
eee
e
e
e
ee
eee e
e
ee
e
e
e
e
e
b
b
bb b
b
bb
bb
b
bb
b
b
b
bbb
b
b bb
b
b
bb
b
b
bb
b
b
b
b
b
b
b
b
b
b
b
b
bb
bb
bb
b
b
bb
bb
b
bb
b
b
b
b
b
b
bb
bb
b
b
f
f fff f
f
f
f
fff
ff f
ffff
ff
ff
f ff
ff
fff
f
ff
f
f
f
f
f
f
ff
ff f
f
f
ff f
ff
f
ff
f
ff
ff
f
f f
ff
f ff ffp
p
p
p
p ppp
p
pp
ppp
pp
pp
p
pp
p p
pp
p
p
p
p
p
p pp ppp
p
pp
pp
p
pp
pp p
p
pp
pp
pp
p
pppppp
pp
p
p
pp
p
pp
yy
y
yy y
y
y
y
yy y
y
y
y
y
yy
yy
y
y
yy y
y
y
y
y
y
y
y
yy
y y
y
y y y
y
y
y
y
yy
yy
y
y
y
y
y
y
y
y
y
y
yy
y y
y
y
y
y
y yy
y
cc
c
c
cc
ccc
cc
c
c
ccc
c
c
c
c
c
c cc
c
c
c
cc
c
c
cc
c
c
c
ccc
c
c
cc
c
ccc c
cc
c
c
c
cc
c
cc
c c
c
cc
c
c
ccc cc
mm
mm m
mm
m
m
m
m
mm
m
m
m
m
m
m
m
m
m
m mm
mm
mmm
m
m m m
mm
mm mm
m
mmm
mmm mm m
m
m
m mm
m
m
m
m
mmm
m
m
m
mm
mmm
p
b
c
y
em f
Data set 2
Fused• Data set 1 only: comp.sys.ibm.pc.hardware (‘p’) , sci_crypt (‘y’), soc.religion.christian (‘c’), talk.politics.misc (‘m’)• Data set 2 only: comp.sys.mac.hardware (‘a’), sci.med (‘d’), talk.religion.misc (‘r’), talk.politics.guns (‘g’)• Shared: rec.sport.baseball (‘b’), sci.electronics (‘e’), misc.forsale (‘f’)
Cluster Alignment: Label Matching and Space Alignment
Reference Un-Aligned
• InfoVis and VAST paper data set• Help refine cluster results and obtain consensus clustering
Graph vis
Social net.
Graph vis
High-dim vis
High-dim vis + applications
Aligned
AlignedReference Un-Aligned
Testbed Overview
Interactive visual analytics system for classification of high-dim. data (image, text, etc) and search space reduction
iVisClassifier(J. Choo, H. Lee, J. Kihm, HP, VAST10)
• Interactive visual document clustering system using topic modeling
• Refines clusters and supports hierarchical cluster structure in an interactive way
iVisClustering(H. Lee, J. Kihm, J. Choo,, J. Stasko, HP, EuroVis12)
Key Interactions with LDA: Topic Refinement and Noisy Data Filtering
After filtering noisy data
After topic refinement
Jigsaw• Combining computational text analysis (text mining) with interactive visualization• Placed system on web in Fall ‘12 where anyone can download it http://www.cc.gatech.edu/gvu/ii/jigsaw
• Created video tutorials• Many sample data sets provided
• Working on opening up architecture
• Interviewed six people who had been using Jigsaw for 2-14 months
• fraud, law enforcement, intelligence analysis, research
• Understand how they are using Jigsaw, different domains• Learn about strengths of system and its limitations
Case Study of System Usage(Y. Kang & J. Stasko, VAST12)
VisIRR: Visual Information Retrieval and Recommendation System for Document Discovery
Improves personalization and understandability viaintegrated visualizations of
document retrieval and recommendation
• Visual IR: beyond Google-like keyword search:– See more documents– See relationships : topical, inter-document– Whole content -based, not keyword-based
• Visual Recommendation: enables discovery– Personalized based on user feedback, persistent– Understand “why” due to visualized relationships
• Only possible due to new/fast ML algorithms
Related workCommercial tools for researchers:�Mendeley - www.mendeley.com – Free reference manager and PDF organizer
– Offline client, personal homepage, social features (Community, follow researchers etc), recommendation engine (people/paper), plug in support. Naive collaborative filtering based recommendations, no visualization.
�Arnetminer - www.arnetminer.com – Academic researcher Social network search
– Metrics (uptrend, longevity, diversity etc), Authorship Network. Author specific website. Research is beyond only authors.
�Microsoft academic – academic.research.microsoft.com – A free academic search engine
– Innovative ways to explore academic publications, authors, conferences, journals, organizations and keywords, connecting millions of scholars, students, librarians, and other users, very rich visualization features. Very limited set of domains (only for computer science), No social features
�Google scholar – scholar.google.com – Search engine for scholarly articles.
– A simple search interface to search all scholarly articles, multiple disciplines, multiple sources (books, patents, articles, university websites, etc). No recommendation, No visualization, Irrelevant search results, Very limited research specific information (number of citations alone).
�Braque.cc - informs researchers of others' research. - Academic launch. Not successful commercially.
�Exlibris - BxRecommenderSystems - http://www.exlibrisgroup.com/category/bXOverview -Discovery, Distribution and management of print/electronic and digital materials. Recommendations for librarians.
– Official librarian tool, encompasses huge repository of data from all the university libraries, Web based tool, multi disciplinary recommendations. No author level information (h-index), journal/conference level (rating of the journal), paper specific information(citation etc). Even though recommends papers, does not provide the statistics that researchers relies up on.
Related workCommercial conference management systems:Web based software that supports organization of scientific conferences.
•Easychair•Confmaster.net
•CMT – Microsoft academic conference management site
•Openconf•PCS
Academic systems:•C. Basu, H. Hirsh, W. Cohen, and C. Nevill-Manning. Technical paper recommendation: a study in combining multiple information sources. Journal of AI Research, pages 231–252, 2001•D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proc. KDD’07, pages 500–509, 2007
•J. Goldsmith and R. Sloan. The conference paper assignment problem. In Proc. AAAI Workshop on Preference Handling for Artificial Intelligence, 2007•C. J. Taylor. On the optimal assignment of conference papers to reviewers. Technical Report MS-CIS-08-30, University of Pennsylvania, 2008
•Don Conry, Yehuda Koren, Naren Ramakrishnan: Recommender systems for the conference paper assignment problem. RecSys 2009: 357-360
•Naveen Garg, Telikepalli Kavitha, Amit Kumar, Kurt Mehlhorn, Julián Mestre: Assigning Papers to Referees. Algorithmica 58(1): 119-136 (2010)•Chong Wang, David M. Blei: Collaborative topic modeling for recommending scientific articles. KDD 2011: 448-456
Our differentiators• Visual big-picture interface
– See more documents: utilize screen space limit better– See relationships : inter-paper, topic/clusters, – Relevance based on full content , not just a few keywords
• Personalized and persistent– Feedback from the user– Machine learning under the hood:
1. 2-d projection2. topical clustering3. recommendation4. neighborhood graphs5. classification
• Interactive speed : Only possible due to fast algorithms
VisIRRAn interactive visual information retrieval and recommender system for large-scale document data
•
Visualization Example of Queried Set
Clear topics
Computational zoom -in
QueryKeyword query, ‘dimension reduction’
Recommendation ExamplePreference-assigned item as ‘highly like’ : ‘Enhancing the visualization process with principal component analysis to support the exploration of trends’
Recommended docswith re-clustering
Recommended docs in existing view (in rectangles)
Features•Dynamic query-retrieval
• Keyword search on contents such as title, abstract, and keywords as well as author and venue fields
• Filtering on year, citation/reference count• Different queries created either separately or jointly with their own
visualization snapshots•Interactive visualization
• Multiple visualizations via dimension reduction (for 2d coordinate) and clustering (for color-coded summary) on dynamically retrieved sets
• Support for easy comparison between views via clustering and dimension reduction alignment
•Preference feedback and recommendation• Document preference assigned by users• Recommendation performed based on document contents, citation,
or co-authorship information• Recommended items projected into the same space along with their
predicted cluster labels
Large -scale Data Collection/Ingestion•Data collection
• Starting with DBLP data set (432,605 data items)• Data cleanup and missing value handling via Microsoft Academic
Search API• Title, author, year, venue, abstract, keywords, citation/reference
count, and citation network info•Data management
• Structured information stored in database• Term-document information pre-computed• Top K Cosine similarity pre-computed• Citation network and co-authorship network pre-built• Scalable streaming data handling with efficient update in O(n)
•Dynamic memory loading• Document information dynamically loaded on the fly depending on
user queries/interactions• Cache-like memory management using “least recently used”
approach
Graph -based Recommendation•Various recommendation schemes
• Content-, co-authorship-, and citation-based recommendation supported
•Heat-kernel-based propagation algorithm• Weighted graphs as an input (for content-based, k-NN cosine-
similarity graph)• User preference propagated efficiently on large-scale sparse graphs
rα
= α ∑k (1- α)kfWk
• rα
is a recommendation score vector with a control parameter α, and f is a user-assigned rating, and W is an input graph.
•Embedding on existing visualization• Out-of-sample embedding into previously computed dimension
reduction• Color-coding using k-NN classification on previous clusters
Summary
• Foundational algorithms for visual representations of high dimensional, large scale, heterogeneous data (dimension reduction, clustering,
space alignment)• Fast algorithms for real time interaction• Development of VA testbed• Development of proof-of-concept VA system