amr ahmed thesis proposal modeling users and content: structured probabilistic representation and...

112
Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Upload: justyn-hodges

Post on 30-Mar-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Amr AhmedThesis Proposal

Modeling Users and Content:Structured Probabilistic Representation

and Scalable Online Inference Algorithms

Page 2: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

This thesis is about

Document collections

they are everywhere

they cover many domains

Page 3: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Research Publications

Social Media

Conferen

ce pro

ceed

ing

Journal transactions

ArXiv

Pubm

ed ce

ntra

l

Yahoo! news

Google news

CNN

BBC

Blogs

Daily KOS

Red state

Page 5: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Thesis Question• How to build a structured representation of

document collections that reveals

– Temporal Dynamics• How ideas/events evolve over time

– Structural Correspondence• How ideas are addressed across modalities and communities

Page 6: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Thesis Approach• Models

– Probabilistic graphical models• Topic models and Non-parametric Bayes

– Principled, expressive and modular

• Algorithms– Distributed

• To deal with large-scale datasets

– Online• To update the representation with new data

Page 7: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

Page 8: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

What is a Good Model for Documents?

• Clustering– Mixture of unigram model

• How to specify a model?• Generative process

– Assume some hidden variables– Use them to generate documents

• Inference– Invert the process

• Given documents hidden variables

ci

wiN

p fK

Page 9: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Mixture of Unigram

ci

wiN

p fK

fkf1

-For Document wi - Sample ci ~ Multi(p)- Sample wi ~ Mult(fci)

wi

p1 pj pk

Generative Process Is this a good model for documents?

When is this a good model for documents?

- When documents are single-topic- Not true in our settings

Page 10: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

What Do We Need to Model?

• Q: What is it about?• A: Mainly MT, with syntax, some learning

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases—phrases that contain sub-phrases. The model is formally a synchronous context-free grammar but is learned from a bitext without any syntactic information. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

SourceTargetSMT

AlignmentScoreBLEU

ParseTreeNoun

PhraseGrammar

CFG

likelihoodEM

HiddenParametersEstimation

argMax

MT Syntax Learning

Unigram over vocabulary

Topi

cs

Mixing Proportion

Topic Models

0.6 0.3 0.1

Page 11: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Mixed-Membership Models

fkf1

wi

q1 qj qk

A Hierarchical Phrase-Based Model for Statistical Machine Translation

We present a statistical phrase-based Translation model that uses hierarchical phrases. Thus it can be seen as a shift to the formal machinery of syntaxbased translation systems without any linguistic commitment. In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 7.5% over Pharaoh, a state-of-the-art phrase-based system.

q

z

w f N

D

Prior

K

-For each document d- Sample qd ~ Prior- For each word w in d

- Sample z ~ Multi(qd) - Sample w ~ Multi(fz)

Generative Process

Page 12: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Topic Models• Prior over topic Vector

– Latent Dirichlet Allocation (LDA)– Correlated priors (CTM)– Hierarchical priors

• Topics– Unigram, bigrams, etc

• Document structure– Bag of words– Multi-modal– Side information

q

z

w f N

D

Prior

K

Page 13: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

Page 14: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

1900 2009

CS

BioPhy

Research Papers

Topics

Problem Statement

• Potentially infinite number of topics– With time-varying trends– And time-varying distributions– And variable durations

• Topics can die• New topics can be born

given

Discover

Page 15: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureTime

Mod

el D

imen

sion

LDA

HDPM

Dynamic clusteringDynamic LDA

q z w

f N

D

a

K

Infinite Dynamic Topic Models

Page 16: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

LDA: The Generative Process

Topics’ distributions evolve over time?

Topics’ trends evolve over time?

q

z

w

f

ND

a

K

Number of topics grow with the data?

-For each document d- Sample qd ~ Dirichlet(a)- For each word w in d

- Sample z ~ Multi(qd) - Sample w ~ Multi(fz)

Generative Process

Page 17: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureTime

LDA

q z w

b N

D

a

K

Infinite Dynamic Topic Models

HDPM

Mod

el D

imen

sion

Dynamic clusteringDynamic LDA

Page 18: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Dynamic LDA: The Generative Process

q

z

w

f1

ND

a1

K

1900 2009

Research Papers

-For each document d- Sample qd ~ Normal( ,a lI)- For each word w in d- Sample z ~ Multi(L(qd)) - Sample w ~ Multi(L(fz))

Necessary to evolve trends

Logistic transformation:

Page 19: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Dynamic LDA: The Generative Process

q

z

w

f1

ND

a1

K

1900 2009

Research Papers

q

z

w N

D

a2

f2 K

-at ~ Normal(.|at -1,s)- Fk,t ~ Normal(.| Fk,t,r)- For each document d

- Sample qd ~ Normal(at ,lI)- For each word w in d- Sample zd,i ~ Multi(L(qd)) - Sample wd,i ~ Multi(L(fz(d,i)))

Page 20: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Dynamic LDA: The Generative Process

q

z

w

f1

ND

a1

K

1900 2009

Research Papers

q

z

w N

D

a2

f2 K

q

z

w N

D

aT

fT K

Page 21: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Dynamic LDA: The Generative Process

q

z

w

f1

ND

a1

K

q

z

w N

D

a2

f2 K

q

z

w N

D

aT

fT K

Topics’ distributions evolve over time?

Topics’ trends evolve over time?

Number of topics grow with the data?

Page 22: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture Time

LDA Dynamic clusteringDynamic LDA

q z w

b N

D

a

K

Infinite Dynamic Topic Models

HDPM

Mod

el D

imen

sion

Page 23: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise Process

• HDPM automatically determines number of topics in LDA

• We will focus on the Chinese Restaurant Franchise process construction – A set of restaurants that share a global menu

• Metaphor– Restaurant = documents– Customer = word– Dish = topic– Global Menu = Set of topics

Page 24: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise Process

Restaurant 1 Restaurant 2

m1: Number of tables serving this dish (topic)

TableDish served

CustomersSharing the same dish

CustomersSharing the same dish

f4: distribution for topic 4

f4f3f2f1

Global Menu

Page 25: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

-For customer w in restaurant 3

- Choose table j Nj

- Choose a new table b a - Sample a new dish for this table

Generative Process

?

f4f3f2f1

a

Page 26: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

-For customer w in restaurant 3

- Choose table j Nj

- Choose a new table b a - Sample a new dish for this table

Generative Process

?

f4f3f2f1

w ~ Multi(L( f3))

a

Page 27: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

-For customer w in restaurant 3

- Choose table j Nj

- Choose a new table b a - Sample a new dish for this table

Generative Process

?

f4f3f2f1

a

Page 28: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

-For customer w in restaurant 3

- Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing dish k mk - A new dish g

Generative Process

f4f3f2f1 new

?

?

g

a

Page 29: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

-For customer w in restaurant 3

- Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing dish k mk - A new dish g

Generative Process

?

f4f3f2f1 new

w ~ Multi(L( f3))g

a

Page 30: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

-For customer w in restaurant 3

- Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing dish k mk - A new dish g

Generative Process

?

f4f3f2f1 new

?

f5~ H

f5

w ~ Multi(L( f5))

a

Page 31: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Chinese Restaurant Franchise ProcessGlobal Menu

Restaurant 1 Restaurant 2 Restaurant 3

f4f3f2f1 f5

Topics’ distributions evolve over time?

Topics’ trends evolve over time?

Number of topics grow with the data?

Page 32: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture Time

LDA Dynamic clusteringDynamic LDA

q z w

b N

D

a

K

HDPM

Mod

el D

imen

sion

Infinite Dynamic Topic Models

Page 33: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Documents in epoch 1 are generated as before

Observations

-Popular topics at epoch 1 are likely to be popular at epoch 2- fk,2 is likely to smoothly evolve from fk,1

Topics at end of epoch 1

- Height (mk,1) represent topic popularity- fk,1 represents topic’s k distribution

Global Menu T=2

= *

Pseudo counts

Decay factor

f4,1f3,1f2,1f1,1 f5,1

Page 34: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

f4,1f3,1f2,1f1,1 f5,1

Global Menu T=2 New real dish served

f3,2f2,2

f3,2 ~ Normal(.| f3,1,r)

Inherited but not yet used

Page 35: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2

-For customer w in restaurant 1

- [as in static case] Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H

Generative Process

f4,1f3,1f2,1f1,1 f5,1

Page 36: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2

-For customer w in restaurant 1

- [as in static case] Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H

Generative Process

f4,1f3,1f2,1f1,1 f5,1

Page 37: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2

-For customer w in restaurant 1

- [as in static case] Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H

Generative Process

f1,2

f1,2 ~ Normal(.| f1,1,r)

f4,1f3,1f2,1f1,1 f5,1

Page 38: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2

-For customer w in restaurant 1

- [as in static case] Choose table j Nj

- Choose a new table b a - Sample a new dish for this table- Existing and inherited dish k m`k,2 + mk,2 - Existing but NOT inherited dish k m`k,2 Then fk,2 ~ Normal(.| fk,1,r)- A new dish g Then fnew ~ H

Generative Process

f 6,2

f6,2 ~ H

f4,1f3,1f2,1f1,1 f5,1

Page 39: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2 f 6,2

Epoch 2

f1,2

Global Menu T=3

died out topics Newly born

f4,1f3,1f2,1f1,1 f5,1

Page 40: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2 f 6,2

Epoch 2

f1,2

Global Menu T=3

Topics’ distributions evolve over time?

Topics’ trends evolve over time?

Number of topics grow with the data?

f4,1f3,1f2,1f1,1 f5,1

Page 41: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Recurrent Chinese Restaurant Franchise ProcessGlobal Menu T=1

Epoch 1

Global Menu T=2

f3,2f2,2 f 6,2

Epoch 2

f1,2

Global Menu T=3

-We just described a first order RCRF process- for a general D-order process

f4,1f3,1f2,1f1,1 f5,1

Page 42: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Inference• Gibbs Sampling

– Sample a table for each word– Sample a topic for each table– Sample the topic parameter over time – Sample hyper-parameters

• How to deal with non-conjugacy– Algorithm 8 in Neal’s 1998 + Metropolis-Hasting

• Efficiency– The Markov blanket contains the previous and

following D epochs

Page 43: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Sampling a Topic for a TableGlobal Menu T=1 Global Menu T=2 Global Menu T=3

f4,1f3,1f2,1f1,1 f5,1

Past FutureEmission

EfficiencyNon-Conjugacy

f3,2f2,2 f 6,2f1,2

Page 44: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Sampling a Topic for a TableGlobal Menu T=1 Global Menu T=2 Global Menu T=3

f4,1f3,1f2,1f1,1 f5,1

Past FutureEmission

EfficiencyNon-Conjugacy

f3,2f2,2 f 6,2

f1,2

~ H= N(0,sI)

/g 3

Page 45: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Sampling a Topic for a TableGlobal Menu T=1 Global Menu T=2

f3,2f2,2 f 6,2f1,2

Global Menu T=3

f4,1f3,1f2,1f1,1 f5,1

Past FutureEmission

Pre-computeAnd update

Non-Conjugacy

Page 46: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Sampling Topic Parameters

• V| ~ f Mult( Logistic(f))• Linear-State space model with non-Gaussian

emission• Use Laplace approximation inside the Forward-

Backward algorithm• Use the resulting distribution as a proposal

f1 f2

fT

v v v

Page 47: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Experiments • Simulated data

– Simulated 20 epochs with 100 data points in each epoch

• Timeline of the NIPS conference– 13 years– 1740 documents– 950 words per document– ~3500 vocabulary

Page 48: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Simulation Experiment

2 4 6 8 10 12 14 16 18 201

2

3

4

5

6

7

8

9Ground truth

Time

Topic

Index

Sample Documents:

Page 49: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Ground Truth

Recovered

0 2 4 6 8 10 12 14 16 18 20

Ground truth

Recovered

Page 50: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

1987

speech

Neurosience

NN

Classification

Methods

Control

Prob. Models

image

SOM

RL

Bayesian

Mixtures

Generalizatoin

1990

boosting

1991

Clustering

1995

ICA

Kernels

19961994

Memory

speechKernelsICA

PM

Classification

Mixtures

Control

Page 51: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

field code

temperature tree

boltzmann energy

annealing node

probability

field tree level

energy probability

node annealing boltzmann variables

tree variables

node level

probability field

distribution structure

graph energy

variables graph tree

probability field structure

node distribution

energy

1987 1990 1993 1996

probability variables tree

field distribution

graph nodes belief node

inference propagation

1999

em expert mixture

gating missing experts gaussian

parameters density

mixture em

likelihood missing experts

mixtures gaussian

parameters

1990 1994mixture gaussian

em likelihood

parameters analysis

density factor variables

distribution

1999

PM

Mixtures

wavelet natural

separation source

ica coefficients

independent basis

1995

source ica

blind separation coefficients

natural independent basis wavelet

1999ICA

method solution energy values

gradient convergence

equation algorithms

gradient weight method

methods local rate optimal descent solution

gradient matrix weight

algorithms local rate problems

point equation

matrix algorithms

gradient convergence

equation optimal method

parameter

Methods

Page 52: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

support kernel

svm regularization

sv vectors feature

regression

kernel support

sv svm

machines regression

vapnik feature solution

Kernels

kernel support

Svm regression

feature machines solution

margin pca

Kernel svm support

regression solution

machines matrix feature regularization

1996 1997 1998 1999

-Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing, V.Vapnik, S. E. Golowich and A.Smola- Support Vector Regression MachinesH. Drucker, C. Burges, L. Kaufman, A. Smola and V. Vapnik-Improving the Accuracy and Speed of Support Vector Machines, C. Burges and B. Scholkopf

- From Regularization Operators to Support Vector Kernels, A. Smola and B. Schoelkopf- Prior Knowledge in Support Vector Kernels, B. Schoelkopf, P. Simard, A. Smola and V.Vapnik

- Uniqueness of the SVM Solution, C. Burges and D.. Crisp- An Improved Decomposition Algorithm for Regression Support Vector Machines, P. Laskov..... Many more

Page 53: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture Time

LDA Dynamic clusteringDynamic LDA

q z w

b N

D

a

K

HDPM

Mod

el D

imen

sion

Infinite Dynamic Topic Models

Page 54: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Quantitative Analysis

Page 55: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Analyzing the NIPS CorpusStart state

Posterior sample

(b)

(c)(a)

Page 57: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

Page 58: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Problem Statement• Rapid growth of social media and news outlets• Lots of redundancy• How to get the big picture?

– What are the stories?– Who are the main entities?– When and how do they develop overtime?– How are they categorized? (sports, economics, etc)

Page 59: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Proposed Solution• Topic models

– Discover long-term high-level themes• Sports• Health• Politics

• Dynamic clustering– Discover short-term ephemeral themes

• Cricket match• Sars epidemic

• Inference– Online algorithm using Sequential Monte Carlo

Page 60: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Preliminary ResultSports

gamesWonTeamFinal

SeasonLeague

held

Politics

GovernmentMinister

AuthoritiesOpposition

OfficialsLeadersgroup

Accidents

PoliceAttach

runman

grouparrested

move

Border-Tension

NuclearBorderDialogueDiplomaticmilitantInsurgencymissile

PakistanIndiaKashmirNew DelhiIslamabadMusharrafVajpayee

UEFA-soccer

ChampionsGoalLegCoachStrikerMidfieldpenalty

Juventus AC Milan Real Madrid Milan Lazio RonaldoLyon

Tax-bills

TaxBillionCutPlanBudgetEconomylawmakers

BushSenateUSCongressFleischerWhite HouseRepublican

Page 61: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Structure Browsing

More Like this StoryMore Like this Story

Middle-east-conflict

PeaceRoadmapSuicideViolenceSettlementsbombing

Israel PalestinianWest bankSharonHamasArafat

Based on topics

Nuclear programs

Nuclearsummitwarningpolicymissileprogram

North KoreaSouth KoreaU.SBushPyongyang

Nuclear+ topics [politics]

- India in any topic- Pakistan in any topic- India and Pakistan in any topic-……

Page 62: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

Page 63: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Modeling Dynamic User Intent• How to model users’ intents?

– Long-term – Short-term– spurious

• Input– Queries issued by the user– Documents viewed by the user

• Output• Dynamic distribution over intents

Page 64: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureCarDealsvan

jobHiringdiet

Page 65: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureCarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinception

FlightLondonHotelweather

Page 66: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureCarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinception

FlightLondonHotelweather

MoviesTheatreArtgallery

Page 67: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureCarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinception

FlightLondonHotelweather

DietCaloriesRecipechocolate

MoviesTheatreArtgallery

SchoolSuppliesLoancollege

Page 68: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big PictureCarDealsvan

jobHiringdiet

HiringSalaryDietcalories

AutoPriceUsedinception

FlightLondonHotelweather

DietCaloriesRecipechocolate

MoviesTheatreArtgallery

SchoolSuppliesLoancollege

CARS Art

DietJobs

Travel College

finance

Page 69: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Highlights• Applications

– Behavioral targeting • Matching users to Ads

– But you can match users to • Stories• New research papers

• Challenges– Large scale ~ 35 M users– Incremental data

Page 70: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

Page 72: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Biological Images

• High throughput devices in recent years• Important source of information for biologists• A pressing need to manage and organize this

information for retrieval and visualization tasks• Embedded within research papers • Pose challenges to mainstream text-image systems

FMI images Gel images

papers

Page 73: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Biological Figures are challenging• Hierarchical Organization

– Multiple panels– Image labels and image pointers

• Scoped Caption• Global caption• Protein annotations• Free text annotations

Marketpeople

Scotlandwater

Bridge skywater

fishwater

Clouds jet plane

Mainstream image retrieval datasets

Page 74: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture

Mice + antibodies

Cancer + tubulin

Query Handling Module

Actin

Imageretrieval

Textualretrieval

MMretrieval

Anno-tation

Visualization

High level Overview

• High level overview: summary• Retrieval across modalities

– Image retrieval– Text-based retrieval– Text + protein based retrieval– Annotation

• Mixed Granularity– Input can be either panel or figure– Output can be either panel or figure

Tasks

Page 75: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Why Queries Are Hard?

• What if I only want to retrieve figures that address the role of vha-8 during Larva state – Only addressed in panel E

• How can we compare figures with vastly different number of panels– Same study but with different

time resolution?

Page 76: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture

Extraction System

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Mice + antibodies

Cancer + tubulin

Query Handling ModuleAcross Modality and granularity

Actin

Imageretrieval

Textualretrieval

MMretrieval

Anno-tation

Visualization

High level OverviewScoped Caption

Global Caption

Protein entities

- Segment the figure into panels- Detect panel image pointer : a, b- Detect mention of pointer in text like (a)- Match image pointer to text label (CRF)- Detect named entities in text- See paper for reference

Page 77: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture

Extraction System

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Mice + antibodies

Cancer + tubulin

Query Handling ModuleAcross Modality and granularity

Actin

Imageretrieval

Textualretrieval

MMretrieval

Anno-tation

Visualization

High level Overview

Page 78: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture

Extraction System

Topic Modeling

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Mice + antibodies

Cancer + tubulin

Query Handling ModuleAcross Modality and granularity

Imageretrieval

Textualretrieval

MMretrieval

Anno-tation

Visualization

High level Overview

Actin

Page 79: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture

Extraction System

Topic Modeling

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Mice + antibodies

Cancer + tubulin

Query Handling ModuleAcross Modality and granularity

Imageretrieval

Textualretrieval

MMretrieval

Anno-tation Semantic Representation

FigurePanel

Learnt Topics for Visualization

Topic 1 Topic K

Actin

Page 80: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Topic Models• Each topic has triplet distributions

– Multinomial distribution over words– Multinomial distribution over protein

words– Gaussian distribution over image

features– Texture and histograms

• Each topic models correspondence between its facets

Top panels

Feature 1 Feature M

Page 81: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Structured Correspondence LDA

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

q

a

Pf

MN

yp z

gwp

b m sa

wf

yf

F

Nf

x

l

b0

a

W

rv Lf

a

a b

K

Learnt Topics

Topic 1 Topic K

ProteinWord SLIF features

Background Topic

Page 82: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Structured Correspondence LDA

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

q

a

Pf

MN

yp z

gwp

b m sa

wf

yf

F

Nf

x

l

b0

a

W

rv Lf

a

a b

K

Learnt Topics

Topic 1 Topic K

Panel

Number of Panels

Background: annotation ratio

Page 83: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

A Sample TopicsTumorigenesis

Top Panels

Known Tumor-suppressors

Codes for protein with tumor-suppressing effect

Member of Caspase familywith role in apoptosis

(cell programmed death)

Page 84: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Figure Embedding

Page 85: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

The Big Picture

Extraction System

Topic Modeling

Mice + antibodies

Cancer + tubulin

Query Handling ModuleAcross Modality and granularity

Imageretrieval

Textualretrieval

MMretrieval

Anno-tation Semantic Representation

FigurePanel

Learnt Topics for Visualization

Topic 1 Topic K

Actin

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Page 86: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Protein Annotations

Query Handling ModuleAcross Modality and granularity

Ranked list of proteins

Evaluate ranking

• How to rank • Based on similarity between latent

representation of figure and protein

Latent figure representation

Latent protein representation

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Page 87: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Protein Annotations• How to rank

• Based on similarity between latent representation of figure and protein

• How to evaluate the ranking• Best rank• Average Rank• Rank at full recall

Query Handling ModuleAcross Modality and granularity

Ranked list of proteins

Evaluate ranking

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

ActinmAB

TubulinVhat-8MTP-1cPABP

Page 88: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Protein Annotations

Query Handling ModuleAcross Modality and granularity

Ranked list of proteins

Evaluate ranking

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized

Page 89: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Text-based Image Retrieval

• Input words (w) + protein (r)• Output ranked list of figures

– Use query language model• Measure precision-recall tradeoffs

Latent figure representation

Latent word representation

Latent protein representation

Page 90: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Transfer Learning from Partial Figures

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..

Full figures

Tie the parameters

Partial Figures

Page 91: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Does it Help?

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..

Protein annotation

protein annotation

Page 92: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Transfer Learning from Partial Figures

Full figuresPartial Figures Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..

p ( , Words, Protein )

q (Words, Protein ) Tie the parameters

p ( Words, Protein ) Bettermarginal

Betterdistribution

lifted

Page 93: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

Page 95: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Problem StatementGiven

Builds a model that couldanswer following

Visualization• How does each ideology view mainstream events?• On which topics do they differ?• On which topics do they agree?

Page 96: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Problem StatementGiven

Builds a model that couldanswer following

Classification•Given a new news article or a blog post, the system should deice:

• From which side it was written• Justify its answer on a topical level

• E.g. because its view on abortion coincides with the pro-choice stance

Page 97: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Problem StatementGiven

Builds a model that couldanswer following

Structured browsing•Given a new news article or a blog post, the user can ask for :

• Examples of other articles from the same ideology about the same topic• Documents that could exemplify alternative views from other ideologies

Page 98: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Approach: Build a Factored Model

W1 W2

b1

b1

bk-1

bk

f1,1

f1,2

f1,k

f2,1

f2,2

f2,k

Ideology 1Views

Ideology 2Views

Topics

Page 99: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Example: Bitterlemons corpus

palestinian israelipeaceyear

political process state end right

government need

conflict way

security

palestinian israeliPeacepolitical

occupation process

end security conflict

way governmen

t people time year

force negotiation

bush US president american sharon

administration prime settlement pressure policy

washington ariel new middle

unit state american george powell minister colin visit internal policy statement

express pro previous package work transfer

european administration

arafat state leader roadmap george election month iraq week peace

june realistic yasir senior involvement clinton

november post mandate terrorism

US role

PalestinianView

IsraelieView

roadmap phase security ceasefire state plan

international step authority final quartet issue map

effort

roadmap end settlement implementation obligation

stop expansion commitment fulfill unit illegal present previou

assassination meet forward

process force terrorism unit road demand provide

confidence element interim discussion want union

succee point build positive recognize present

timetable

Roadmap process

syria syrian negotiate lebanon deal conference

concession asad agreement regional october

initiative relationship

track negotiation official leadership position

withdrawal time victory present second stand

circumstance represent sense talk strategy issue

participant parti negotiator

peace strategic plo hizballah islamic neighbor

territorial radical iran relation think obviou countri

mandate greater conventional intifada affect

jihad time

Arab Involvement

Page 100: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Outline• Background• Temporal Dynamics

– Timelines for research publications– Storylines form news stream– User interest-lines

• Structural Correspondence– Across modalities– Across ideologies

• Summary and Timeline

Page 101: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Summary• Topic models are flexible framework • Very useful if you

– Care about the hidden structure– Want to leverage the hidden structure in tasks for

which you have few labels– Have partially labeled data (many-many)

• Bayesian and Hierarchical models are not slow– It can be scaled– Can be made to work online

Page 102: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Main Contributions• Models

– Time-varying non-parametric framework• Inference

– Distributed incremental inference algorithms– Online SMC algorithms

• Applications– In research publications– Social media

Page 103: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Thanks!

Questions?

Page 104: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Backup slides

Page 105: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Hyper-parameter Sensitivityf1

f2 fT

v v v

Page 106: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Hyper-parameter Sensitivityf1

f2 fT

v v v

Page 107: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Hyper-parameter Sensitivity Global Menu T=3

-14 -12 -10 -8 -6 -4 -2 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

We

igh

t

Past

0.5

124

6

Page 108: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Structured cLDA and cLDA

Market people

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized Blei and Jordan SIGIR 2003

Page 109: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Double immunofluorescence confocal microscopy using mAB against cPABP …….. And the bound antibodies were visualized ..

Can we use cLDA instead?

Market people

Affinity-purified rabbit antir mnp 41 antibodies

Monocolonal anti-cPAPB antibodies

Whole captions replication Scoped captions replication

Lose structure can no longer answer figure queries

Under representationOver representation

Page 110: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Mixtures and MM-models

fkf1

wi

q1 qj qk

fkf1

wi

p1

pj pk

- Two orthogonal dimensions- Mixtures- Membership models

Page 111: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Example Story• Story: Obama’s Controversial pastor

– Topics• Politics• Religion• Race

– Entities: • Obama, Wright, Illinois

Page 112: Amr Ahmed Thesis Proposal Modeling Users and Content: Structured Probabilistic Representation and Scalable Online Inference Algorithms

Storyline Models• We can use clustering

– Each document belong to a story (cluster)– Lacks global structure

• What is shared across stories?• How about story classification?

• We can use topic models– Ignore the notion of story

• Tightly-focused, Short-term

– Topics are high-level concept• coarse-grained, Long-term