amr ahmed thesis proposal modeling users and content: structured probabilistic representation and...
TRANSCRIPT
Amr AhmedThesis Proposal
Modeling Users and Content:Structured Probabilistic Representation
and Scalable Online Inference Algorithms
This thesis is about
Document collections
they are everywhere
they cover many domains
Research Publications
Social Media
Conferen
ce pro
ceed
ing
Journal transactions
ArXiv
Pubm
ed ce
ntra
l
Yahoo! news
Google news
CNN
BBC
Blogs
Daily KOS
Red state
Ban abortion with Constitutional amendment
Choice is a fundamental,
constitutional right
CS
BioPhy
time
Drillexplosion
time
“BP wasn't prepared for an oil spill at such depths”
BP: “We will make this right."
Temporal Dynamics Structural Correspondence
Thesis Question• How to build a structured representation of
document collections that reveals
– Temporal Dynamics• How ideas/events evolve over time
– Structural Correspondence• How ideas are addressed across modalities and communities
Thesis Approach• Models
– Probabilistic graphical models• Topic models and Non-parametric Bayes
– Principled, expressive and modular
• Algorithms– Distributed
• To deal with large-scale datasets
– Online• To update the representation with new data
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
What is a Good Model for Documents?
• Clustering– Mixture of unigram model
• How to specify a model?• Generative process
– Assume some hidden variables– Use them to generate documents
• Inference– Invert the process
• Given documents hidden variables
ci
wiN
p fK
Mixture of Unigram
ci
wiN
p fK
fkf1
-For Document wi - Sample ci ~ Multi(p)- Sample wi ~ Mult(fci)
wi
p1 pj pk
Generative Process Is this a good model for documents?
When is this a good model for documents?
- When documents are single-topic- Not true in our settings
What Do We Need to Model?
• Q: What is it about?• A: Mainly MT, with syntax, some learning
A Hierarchical Phrase-Based Model for Statistical Machine Translation
We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
SourceTargetSMT
AlignmentScoreBLEU
ParseTreeNoun
PhraseGrammar
CFG
likelihoodEM
HiddenParametersEstimation
argMax
MT Syntax Learning
Unigram over vocabulary
Topi
cs
Mixing Proportion
Topic Models
0.6 0.3 0.1
Mixed-Membership Models
fkf1
wi
q1 qj qk
A Hierarchical Phrase-Based Model for Statistical Machine Translation
We present a statistical phrase-based Translation model that uses hierarchical phrases. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.
q
z
w f N
D
Prior
K
-For each document d- Sample qd ~ Prior- For each word w in d
- Sample z ~ Multi(qd) - Sample w ~ Multi(fz)
Generative Process
Topic Models• Prior over topic Vector
– Latent Dirichlet Allocation (LDA)– Correlated priors (CTM)– Hierarchical priors
• Topics– Unigram, bigrams, etc
• Document structure– Bag of words– Multi-modal– Side information
q
z
w f N
D
Prior
K
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
1900 2009
CS
BioPhy
Research Papers
Topics
Problem Statement
• Potentially infinite number of topics– With time-varying trends– And time-varying distributions– And variable durations
• Topics can die• New topics can be born
given
Discover
The Big PictureTime
Mod
el D
imen
sion
LDA
HDPM
Dynamic clusteringDynamic LDA
q z w
f N
D
a
K
Infinite Dynamic Topic Models
LDA: The Generative Process
Topics’ distributions evolve over time?
Topics’ trends evolve over time?
q
z
w
f
ND
a
K
Number of topics grow with the data?
-For each document d- Sample qd ~ Dirichlet(a)- For each word w in d
- Sample z ~ Multi(qd) - Sample w ~ Multi(fz)
Generative Process
The Big PictureTime
LDA
q z w
b N
D
a
K
Infinite Dynamic Topic Models
HDPM
Mod
el D
imen
sion
Dynamic clusteringDynamic LDA
Dynamic LDA: The Generative Process
q
z
w
f1
ND
a1
K
1900 2009
Research Papers
-For each document d- Sample qd ~ Normal( ,a lI)- For each word w in d- Sample z ~ Multi(L(qd)) - Sample w ~ Multi(L(fz))
Necessary to evolve trends
Logistic transformation:
Dynamic LDA: The Generative Process
q
z
w
f1
ND
a1
K
1900 2009
Research Papers
q
z
w N
D
a2
f2 K
-at ~ Normal(.|at -1,s)- Fk,t ~ Normal(.| Fk,t,r)- For each document d
- Sample qd ~ Normal(at ,lI)- For each word w in d- Sample zd,i ~ Multi(L(qd)) - Sample wd,i ~ Multi(L(fz(d,i)))
Dynamic LDA: The Generative Process
q
z
w
f1
ND
a1
K
1900 2009
Research Papers
q
z
w N
D
a2
f2 K
q
z
w N
D
aT
fT K
Dynamic LDA: The Generative Process
q
z
w
f1
ND
a1
K
q
z
w N
D
a2
f2 K
q
z
w N
D
aT
fT K
Topics’ distributions evolve over time?
Topics’ trends evolve over time?
Number of topics grow with the data?
The Big Picture Time
LDA Dynamic clusteringDynamic LDA
q z w
b N
D
a
K
Infinite Dynamic Topic Models
HDPM
Mod
el D
imen
sion
The Chinese Restaurant Franchise Process
• HDPM automatically determines number of topics in LDA
• We will focus on the Chinese Restaurant Franchise process construction – A set of restaurants that share a global menu
• Metaphor– Restaurant = documents– Customer = word– Dish = topic– Global Menu = Set of topics
The Chinese Restaurant Franchise Process
Restaurant 1 Restaurant 2
m1: Number of tables serving this dish (topic)
TableDish served
CustomersSharing the same dish
CustomersSharing the same dish
f4: distribution for topic 4
f4f3f2f1
Global Menu
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
-For customer w in restaurant 3
- Choose table j Nj
- Choose a new table b a - Sample a new dish for this table
Generative Process
?
f4f3f2f1
a
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
-For customer w in restaurant 3
- Choose table j Nj
- Choose a new table b a - Sample a new dish for this table
Generative Process
?
f4f3f2f1
w ~ Multi(L( f3))
a
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
-For customer w in restaurant 3
- Choose table j Nj
- Choose a new table b a - Sample a new dish for this table
Generative Process
?
f4f3f2f1
a
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
-For customer w in restaurant 3
- Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing dish k mk - A new dish g
Generative Process
f4f3f2f1 new
?
?
g
a
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
-For customer w in restaurant 3
- Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing dish k mk - A new dish g
Generative Process
?
f4f3f2f1 new
w ~ Multi(L( f3))g
a
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
-For customer w in restaurant 3
- Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing dish k mk - A new dish g
Generative Process
?
f4f3f2f1 new
?
f5~ H
f5
w ~ Multi(L( f5))
a
The Chinese Restaurant Franchise ProcessGlobal Menu
Restaurant 1 Restaurant 2 Restaurant 3
f4f3f2f1 f5
Topics’ distributions evolve over time?
Topics’ trends evolve over time?
Number of topics grow with the data?
The Big Picture Time
LDA Dynamic clusteringDynamic LDA
q z w
b N
D
a
K
HDPM
Mod
el D
imen
sion
Infinite Dynamic Topic Models
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Documents in epoch 1 are generated as before
Observations
-Popular topics at epoch 1 are likely to be popular at epoch 2- fk,2 is likely to smoothly evolve from fk,1
Topics at end of epoch 1
- Height (mk,1) represent topic popularity- fk,1 represents topic’s k distribution
Global Menu T=2
= *
Pseudo counts
Decay factor
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
f4,1f3,1f2,1f1,1 f5,1
Global Menu T=2 New real dish served
f3,2f2,2
f3,2 ~ Normal(.| f3,1,r)
Inherited but not yet used
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2
-For customer w in restaurant 1
- [as in static case] Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H
Generative Process
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2
-For customer w in restaurant 1
- [as in static case] Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H
Generative Process
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2
-For customer w in restaurant 1
- [as in static case] Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H
Generative Process
f1,2
f1,2 ~ Normal(.| f1,1,r)
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2
-For customer w in restaurant 1
- [as in static case] Choose table j Nj
- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H
Generative Process
f 6,2
f6,2 ~ H
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2 f 6,2
Epoch 2
f1,2
Global Menu T=3
died out topics Newly born
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2 f 6,2
Epoch 2
f1,2
Global Menu T=3
Topics’ distributions evolve over time?
Topics’ trends evolve over time?
Number of topics grow with the data?
f4,1f3,1f2,1f1,1 f5,1
Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1
Epoch 1
Global Menu T=2
f3,2f2,2 f 6,2
Epoch 2
f1,2
Global Menu T=3
-We just described a first order RCRF process- for a general D-order process
f4,1f3,1f2,1f1,1 f5,1
Inference• Gibbs Sampling
– Sample a table for each word– Sample a topic for each table– Sample the topic parameter over time – Sample hyper-parameters
• How to deal with non-conjugacy– Algorithm 8 in Neal’s 1998 + Metropolis-Hasting
• Efficiency– The Markov blanket contains the previous and
following D epochs
Sampling a Topic for a TableGlobal Menu T=1 Global Menu T=2 Global Menu T=3
f4,1f3,1f2,1f1,1 f5,1
Past FutureEmission
EfficiencyNon-Conjugacy
f3,2f2,2 f 6,2f1,2
Sampling a Topic for a TableGlobal Menu T=1 Global Menu T=2 Global Menu T=3
f4,1f3,1f2,1f1,1 f5,1
Past FutureEmission
EfficiencyNon-Conjugacy
f3,2f2,2 f 6,2
f1,2
~ H= N(0,sI)
/g 3
Sampling a Topic for a TableGlobal Menu T=1 Global Menu T=2
f3,2f2,2 f 6,2f1,2
Global Menu T=3
f4,1f3,1f2,1f1,1 f5,1
Past FutureEmission
Pre-computeAnd update
Non-Conjugacy
Sampling Topic Parameters
• V| ~ f Mult( Logistic(f))• Linear-State space model with non-Gaussian
emission• Use Laplace approximation inside the Forward-
Backward algorithm• Use the resulting distribution as a proposal
f1 f2
fT
v v v
Experiments • Simulated data
– Simulated 20 epochs with 100 data points in each epoch
• Timeline of the NIPS conference– 13 years– 1740 documents– 950 words per document– ~3500 vocabulary
Simulation Experiment
2 4 6 8 10 12 14 16 18 201
2
3
4
5
6
7
8
9Ground truth
Time
Topic
Index
Sample Documents:
Ground Truth
Recovered
0 2 4 6 8 10 12 14 16 18 20
Ground truth
Recovered
1987
speech
Neurosience
NN
Classification
Methods
Control
Prob. Models
image
SOM
RL
Bayesian
Mixtures
Generalizatoin
1990
boosting
1991
Clustering
1995
ICA
Kernels
19961994
Memory
speechKernelsICA
PM
Classification
Mixtures
Control
field code
temperature tree
boltzmann energy
annealing node
probability
field tree level
energy probability
node annealing boltzmann variables
tree variables
node level
probability field
distribution structure
graph energy
variables graph tree
probability field structure
node distribution
energy
1987 1990 1993 1996
probability variables tree
field distribution
graph nodes belief node
inference propagation
1999
em expert mixture
gating missing experts gaussian
parameters density
mixture em
likelihood missing experts
mixtures gaussian
parameters
1990 1994mixture gaussian
em likelihood
parameters analysis
density factor variables
distribution
1999
PM
Mixtures
wavelet natural
separation source
ica coefficients
independent basis
1995
source ica
blind separation coefficients
natural independent basis wavelet
1999ICA
method solution energy values
gradient convergence
equation algorithms
gradient weight method
methods local rate optimal descent solution
gradient matrix weight
algorithms local rate problems
point equation
matrix algorithms
gradient convergence
equation optimal method
parameter
Methods
support kernel
svm regularization
sv vectors feature
regression
kernel support
sv svm
machines regression
vapnik feature solution
Kernels
kernel support
Svm regression
feature machines solution
margin pca
Kernel svm support
regression solution
machines matrix feature regularization
1996 1997 1998 1999
-Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing, V.Vapnik, S. E. Golowich and A.Smola- Support Vector Regression MachinesH. Drucker, C. Burges, L. Kaufman, A. Smola and V. Vapnik-Improving the Accuracy and Speed of Support Vector Machines, C. Burges and B. Scholkopf
- From Regularization Operators to Support Vector Kernels, A. Smola and B. Schoelkopf- Prior Knowledge in Support Vector Kernels, B. Schoelkopf, P. Simard, A. Smola and V.Vapnik
- Uniqueness of the SVM Solution, C. Burges and D.. Crisp- An Improved Decomposition Algorithm for Regression Support Vector Machines, P. Laskov..... Many more
The Big Picture Time
LDA Dynamic clusteringDynamic LDA
q z w
b N
D
a
K
HDPM
Mod
el D
imen
sion
Infinite Dynamic Topic Models
Quantitative Analysis
Analyzing the NIPS CorpusStart state
Posterior sample
(b)
(c)(a)
Ban abortion with Constitutional amendment
Choice is a fundamental,
constitutional right
CS
BioPhy
time
Drillexplosion
time
“BP wasn't prepared for an oil spill at such depths”
BP: “We will make this right."
Temporal Dynamics Structural Correspondence
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
Problem Statement• Rapid growth of social media and news outlets• Lots of redundancy• How to get the big picture?
– What are the stories?– Who are the main entities?– When and how do they develop overtime?– How are they categorized? (sports, economics, etc)
Proposed Solution• Topic models
– Discover long-term high-level themes• Sports• Health• Politics
• Dynamic clustering– Discover short-term ephemeral themes
• Cricket match• Sars epidemic
• Inference– Online algorithm using Sequential Monte Carlo
Preliminary ResultSports
gamesWonTeamFinal
SeasonLeague
held
Politics
GovernmentMinister
AuthoritiesOpposition
OfficialsLeadersgroup
Accidents
PoliceAttach
runman
grouparrested
move
Border-Tension
NuclearBorderDialogueDiplomaticmilitantInsurgencymissile
PakistanIndiaKashmirNew DelhiIslamabadMusharrafVajpayee
UEFA-soccer
ChampionsGoalLegCoachStrikerMidfieldpenalty
Juventus AC Milan Real Madrid Milan Lazio RonaldoLyon
Tax-bills
TaxBillionCutPlanBudgetEconomylawmakers
BushSenateUSCongressFleischerWhite HouseRepublican
Structure Browsing
More Like this StoryMore Like this Story
Middle-east-conflict
PeaceRoadmapSuicideViolenceSettlementsbombing
Israel PalestinianWest bankSharonHamasArafat
Based on topics
Nuclear programs
Nuclearsummitwarningpolicymissileprogram
North KoreaSouth KoreaU.SBushPyongyang
Nuclear+ topics [politics]
- India in any topic- Pakistan in any topic- India and Pakistan in any topic-……
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
Modeling Dynamic User Intent• How to model users’ intents?
– Long-term – Short-term– spurious
• Input– Queries issued by the user– Documents viewed by the user
• Output• Dynamic distribution over intents
The Big PictureCarDealsvan
jobHiringdiet
The Big PictureCarDealsvan
jobHiringdiet
HiringSalaryDietcalories
AutoPriceUsedinception
FlightLondonHotelweather
The Big PictureCarDealsvan
jobHiringdiet
HiringSalaryDietcalories
AutoPriceUsedinception
FlightLondonHotelweather
MoviesTheatreArtgallery
The Big PictureCarDealsvan
jobHiringdiet
HiringSalaryDietcalories
AutoPriceUsedinception
FlightLondonHotelweather
DietCaloriesRecipechocolate
MoviesTheatreArtgallery
SchoolSuppliesLoancollege
The Big PictureCarDealsvan
jobHiringdiet
HiringSalaryDietcalories
AutoPriceUsedinception
FlightLondonHotelweather
DietCaloriesRecipechocolate
MoviesTheatreArtgallery
SchoolSuppliesLoancollege
CARS Art
DietJobs
Travel College
finance
Highlights• Applications
– Behavioral targeting • Matching users to Ads
– But you can match users to • Stories• New research papers
• Challenges– Large scale ~ 35 M users– Incremental data
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
Ban abortion with Constitutional amendment
Choice is a fundamental,
constitutional right
CS
BioPhy
time
Drillexplosion
time
“BP wasn't prepared for an oil spill at such depths”
BP: “We will make this right."
Temporal Dynamics Structural Correspondence
Biological Images
• High throughput devices in recent years• Important source of information for biologists• A pressing need to manage and organize this
information for retrieval and visualization tasks• Embedded within research papers • Pose challenges to mainstream text-image systems
FMI images Gel images
papers
Biological Figures are challenging• Hierarchical Organization
– Multiple panels– Image labels and image pointers
• Scoped Caption• Global caption• Protein annotations• Free text annotations
Marketpeople
Scotlandwater
Bridge skywater
fishwater
Clouds jet plane
Mainstream image retrieval datasets
The Big Picture
Mice + antibodies
Cancer + tubulin
Query Handling Module
Actin
Imageretrieval
Textualretrieval
MMretrieval
Anno-tation
Visualization
High level Overview
• High level overview: summary• Retrieval across modalities
– Image retrieval– Text-based retrieval– Text + protein based retrieval– Annotation
• Mixed Granularity– Input can be either panel or figure– Output can be either panel or figure
Tasks
Why Queries Are Hard?
• What if I only want to retrieve figures that address the role of vha-8 during Larva state – Only addressed in panel E
• How can we compare figures with vastly different number of panels– Same study but with different
time resolution?
The Big Picture
Extraction System
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Mice + antibodies
Cancer + tubulin
Query Handling ModuleAcross Modality and granularity
Actin
Imageretrieval
Textualretrieval
MMretrieval
Anno-tation
Visualization
High level OverviewScoped Caption
Global Caption
Protein entities
- Segment the figure into panels- Detect panel image pointer : a, b- Detect mention of pointer in text like (a)- Match image pointer to text label (CRF)- Detect named entities in text- See paper for reference
The Big Picture
Extraction System
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Mice + antibodies
Cancer + tubulin
Query Handling ModuleAcross Modality and granularity
Actin
Imageretrieval
Textualretrieval
MMretrieval
Anno-tation
Visualization
High level Overview
The Big Picture
Extraction System
Topic Modeling
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Mice + antibodies
Cancer + tubulin
Query Handling ModuleAcross Modality and granularity
Imageretrieval
Textualretrieval
MMretrieval
Anno-tation
Visualization
High level Overview
Actin
The Big Picture
Extraction System
Topic Modeling
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Mice + antibodies
Cancer + tubulin
Query Handling ModuleAcross Modality and granularity
Imageretrieval
Textualretrieval
MMretrieval
Anno-tation Semantic Representation
FigurePanel
Learnt Topics for Visualization
Topic 1 Topic K
Actin
Topic Models• Each topic has triplet distributions
– Multinomial distribution over words– Multinomial distribution over protein
words– Gaussian distribution over image
features– Texture and histograms
• Each topic models correspondence between its facets
Top panels
Feature 1 Feature M
Structured Correspondence LDA
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
q
a
Pf
MN
yp z
gwp
b m sa
wf
yf
F
Nf
x
l
b0
a
W
rv Lf
a
a b
K
Learnt Topics
Topic 1 Topic K
ProteinWord SLIF features
Background Topic
Structured Correspondence LDA
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
q
a
Pf
MN
yp z
gwp
b m sa
wf
yf
F
Nf
x
l
b0
a
W
rv Lf
a
a b
K
Learnt Topics
Topic 1 Topic K
Panel
Number of Panels
Background: annotation ratio
A Sample TopicsTumorigenesis
Top Panels
Known Tumor-suppressors
Codes for protein with tumor-suppressing effect
Member of Caspase familywith role in apoptosis
(cell programmed death)
Figure Embedding
The Big Picture
Extraction System
Topic Modeling
Mice + antibodies
Cancer + tubulin
Query Handling ModuleAcross Modality and granularity
Imageretrieval
Textualretrieval
MMretrieval
Anno-tation Semantic Representation
FigurePanel
Learnt Topics for Visualization
Topic 1 Topic K
Actin
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Protein Annotations
Query Handling ModuleAcross Modality and granularity
Ranked list of proteins
Evaluate ranking
• How to rank • Based on similarity between latent
representation of figure and protein
Latent figure representation
Latent protein representation
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Protein Annotations• How to rank
• Based on similarity between latent representation of figure and protein
• How to evaluate the ranking• Best rank• Average Rank• Rank at full recall
Query Handling ModuleAcross Modality and granularity
Ranked list of proteins
Evaluate ranking
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
ActinmAB
TubulinVhat-8MTP-1cPABP
Protein Annotations
Query Handling ModuleAcross Modality and granularity
Ranked list of proteins
Evaluate ranking
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized
Text-based Image Retrieval
• Input words (w) + protein (r)• Output ranked list of figures
– Use query language model• Measure precision-recall tradeoffs
Latent figure representation
Latent word representation
Latent protein representation
Transfer Learning from Partial Figures
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..
Full figures
Tie the parameters
Partial Figures
Does it Help?
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..
Protein annotation
protein annotation
Transfer Learning from Partial Figures
Full figuresPartial Figures Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..
p ( , Words, Protein )
q (Words, Protein ) Tie the parameters
p ( Words, Protein ) Bettermarginal
Betterdistribution
lifted
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
Ban abortion with Constitutional amendment
Choice is a fundamental,
constitutional right
CS
BioPhy
time
Drillexplosion
time
“BP wasn't prepared for an oil spill at such depths”
BP: “We will make this right."
Temporal Dynamics Structural Correspondence
Problem StatementGiven
Builds a model that couldanswer following
Visualization• How does each ideology view mainstream events?• On which topics do they differ?• On which topics do they agree?
Problem StatementGiven
Builds a model that couldanswer following
Classification•Given a new news article or a blog post, the system should deice:
• From which side it was written• Justify its answer on a topical level
• E.g. because its view on abortion coincides with the pro-choice stance
Problem StatementGiven
Builds a model that couldanswer following
Structured browsing•Given a new news article or a blog post, the user can ask for :
• Examples of other articles from the same ideology about the same topic• Documents that could exemplify alternative views from other ideologies
Approach: Build a Factored Model
W1 W2
b1
b1
bk-1
bk
f1,1
f1,2
f1,k
f2,1
f2,2
f2,k
Ideology 1Views
Ideology 2Views
Topics
Example: Bitterlemons corpus
palestinian israelipeaceyear
political process state end right
government need
conflict way
security
palestinian israeliPeacepolitical
occupation process
end security conflict
way governmen
t people time year
force negotiation
bush US president american sharon
administration prime settlement pressure policy
washington ariel new middle
unit state american george powell minister colin visit internal policy statement
express pro previous package work transfer
european administration
arafat state leader roadmap george election month iraq week peace
june realistic yasir senior involvement clinton
november post mandate terrorism
US role
PalestinianView
IsraelieView
roadmap phase security ceasefire state plan
international step authority final quartet issue map
effort
roadmap end settlement implementation obligation
stop expansion commitment fulfill unit illegal present previou
assassination meet forward
process force terrorism unit road demand provide
confidence element interim discussion want union
succee point build positive recognize present
timetable
Roadmap process
syria syrian negotiate lebanon deal conference
concession asad agreement regional october
initiative relationship
track negotiation official leadership position
withdrawal time victory present second stand
circumstance represent sense talk strategy issue
participant parti negotiator
peace strategic plo hizballah islamic neighbor
territorial radical iran relation think obviou countri
mandate greater conventional intifada affect
jihad time
Arab Involvement
Outline• Background• Temporal Dynamics
– Timelines for research publications– Storylines form news stream– User interest-lines
• Structural Correspondence– Across modalities– Across ideologies
• Summary and Timeline
Summary• Topic models are flexible framework • Very useful if you
– Care about the hidden structure– Want to leverage the hidden structure in tasks for
which you have few labels– Have partially labeled data (many-many)
• Bayesian and Hierarchical models are not slow– It can be scaled– Can be made to work online
Main Contributions• Models
– Time-varying non-parametric framework• Inference
– Distributed incremental inference algorithms– Online SMC algorithms
• Applications– In research publications– Social media
Thanks!
Questions?
Backup slides
Hyper-parameter Sensitivityf1
f2 fT
v v v
Hyper-parameter Sensitivityf1
f2 fT
v v v
Hyper-parameter Sensitivity Global Menu T=3
-14 -12 -10 -8 -6 -4 -2 00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
We
igh
t
Past
0.5
124
6
Structured cLDA and cLDA
Market people
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized Blei and Jordan SIGIR 2003
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..
Can we use cLDA instead?
Market people
Affinity-purified rabbit antir mnp 41 antibodies
Monocolonal anti-cPAPB antibodies
Whole captions replication Scoped captions replication
Lose structure can no longer answer figure queries
Under representationOver representation
Mixtures and MM-models
fkf1
wi
q1 qj qk
fkf1
wi
p1
pj pk
- Two orthogonal dimensions- Mixtures- Membership models
Example Story• Story: Obama’s Controversial pastor
– Topics• Politics• Religion• Race
– Entities: • Obama, Wright, Illinois
Storyline Models• We can use clustering
– Each document belong to a story (cluster)– Lacks global structure
• What is shared across stories?• How about story classification?
• We can use topic models– Ignore the notion of story
• Tightly-focused, Short-term
– Topics are high-level concept• coarse-grained, Long-term