adaptation of graph-based semi-supervised methods to large-scale text data

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data

Frank Lin and William W. CohenSchool of Computer Science, Carnegie Mellon University

KDD-MLG 20112011-08-21, San Diego, CA, USA

Overview

• Graph-based SSL• The Problem with Text Data• Implicit Manifolds– Two GSSL Methods– Three Similarity Functions

• A Framework• Results• Related Work

Graph-based SSL

• Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph.

• Naturally they are a good fit for network dataLabeled class A

Labeled Class B

How to label the rest of the nodes in the

graph?

The Problem with Text Data

• Documents are often represented as feature vectors of words:

The importance of a Web page is an inherently subjective matter, which depends on the readers…

In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use…

You're not cool just because you have a lot of followers on twitter, get over yourself…

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

• Feature vectors are often sparse• But similarity matrix is not!

cool web search make over you0 4 8 2 5 30 8 7 4 3 21 0 0 0 1 2

Mostly zeros - any document contains only a small fraction

of the vocabulary

27 125 -

23 - 125

- 23 27

Mostly non-zero - any two

documents are likely to have a

word in common

• A similarity matrix is the input to many GSSL methods

• How can we apply GSSL methods efficiently?

27 125 -

23 - 125

- 23 27

O(n2) time to construct

O(n2) space to store

> O(n2) time to operate on

Too expensive! Does not

scale up to big

datasets!

• Solutions: 1. Make the matrix sparse2. Implicit Manifold

But this is what we’ll talk about!

A lot of cool work has gone into

how to do this…

(If you do SSL on large-scale non-network data, there is a good chance

you are using this already...)

Implicit Manifolds

• A pair-wise similarity matrix is a manifold under which the data points “reside”.

• It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did.

What do you mean by…?

Sounds too good. What’s

the catch?

Implicit Manifolds

• Two requirements for using implicit manifolds on text (-like) data:1. The GSSL method can be implemented with

matrix-vector multiplications2. We can decompose the dense similarity matrix

into a series of sparse matrix-matrix multiplications

Let’s look at some specifics, starting with

two GSSL methods

As long as they are met, we can obtain the exact same results without ever constructing a

similarity matrix!

Two GSSL Methods

• Harmonic functions method (HF)• MultiRankWalk (MRW)

If you’ve never heard of

them…You might

have heard one of these:

Gaussian fields and harmonic functions classifier (Zhu et al. 2003) Weighted-voted relational network classifier (Macskassy & Provost 2007) Weakly-supervised classification via random walks (Talukdar et al. 2008) Adsorption (Baluja et al. 2008) Learning on diffusion maps (Lafon & Lee 2006) and others …

HF’s cousins: Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001) Learning with local and global consistency (Zhou et al. 2004) Graph-based SSL as a generative model (He et al. 2007) Ghost edges for classification (Gallagher et al. 2008) and others …

MRW’s cousins:

Propagation via forward random walks (w/ restart)

Propagation via backward random walks

Two GSSL Methods

• Harmonic functions method (HF)

• MultiRankWalk (MRW)

In both of these iterative implementations, the core

computation are matrix-vector multiplications!

Let’s look at their implicit

manifold qualification…

Three Similarity Functions

• Inner product similarity

• Cosine similarity

• Bipartite graph walk

Simple, good for binary features

Often used for document

categorization

Can be viewed as a bipartite graph

walk, or as feature reweighting by

relative frequency; related to TF-IDF term weighting…Side note: perhaps all the

cool work on matrix factorization could be

useful here too…

Putting Them Together

• Example: HF + inner product similarity:

S = FFT

• Iteration update becomes:

))((11 tTt FFD vv

Construction: O(n) Storage: O(n) Operation: O(n)

How about a similarity function we actually use for

text data?

Diagonal matrix D can be calculated as:

D(i,i)=d(i) where d=FFT1

Parentheses are important!

Putting Them Together

• Example: HF + cosine similarity:

S = NcFFTNc

• Iteration update:

vt+1 = D−1(Nc (F(FT (Ncv

t ))))

Diagonal cosine normalizing matrix

Compact storage: we don’t need a cosine-

normalized version of the feature vectors

Diagonal matrix D can be calculated as: D(i,i)=d(i) where d=NcFFTNc1

Construction: O(n)

Storage: O(n)

Operation: O(n)

A Framework

• So what’s the point?1. Towards a principled way for researchers to apply

GSSL methods to text data, and the conditions under which they can do this efficiently

2. Towards a framework on which researchers can develop and discover new methods (and recognizing old ones)

3. Building a SSL tool set – pick one that works for you, according to all the great work that has been done

A Framework

Choose your SSL Method…

Harmonic Functions

MultiRankWalk

Hmm… I have a large dataset with very

few training labels, what should I try?

How about MultiRankWalk with

a low restart probability?

A FrameworkBut the

documents in my dataset are kinda

long…

Can’t go wrong with cosine similarity!

… and pick your similarity function

Inner Product

Cosine Similarity

Bipartite Graph Walk

Results• On 2 commonly used document categorization datasets• On 2 less common NP classification datasets• Goal: to show they do work on large text datasets, and

consistent with what we know about these SSL methods and similarity functions

Results

• Document categorization

Results

• Noun phrase classification dataset:

Questions?

Additional Information

MRW: RWR for Classification

We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

MRW: Seed Preference

• Obtaining labels for data points is expensive• We want to minimize cost for obtaining labels• Observations:– Some labels inherently more useful than others– Some labels easier to obtain than others

Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more

useful than others?

Seed Preference

• Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label

• The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data

• We consider 3 preferences:– Random– Link Count– PageRank

Nodes with highest counts make the list

Nodes with highest scores make the list

MRW: The Question• What really makes MRW and wvRN different?• Network-based SSL often boil down to label propagation. • MRW and wvRN represent two general propagation methods –

note that they are call by many names:MRW wvRN

Random walk with restart Reverse random walk

Regularized random walk Random walk with sink nodes

Personalized PageRank Hitting time

Local & global consistency Harmonic functions on graphs

Iterative averaging of neighbors

Great…but we still don’t know why the differences in

their behavior on these network datasets!

MRW: The Question• It’s difficult to answer exactly why MRW does better with a smaller

number of seeds.• But we can gather probable factors from their propagation models:

MRW wvRN

1 Centrality-sensitive Centrality-insensitive

2 Exponential drop-off / damping factor No drop-off / damping

3Propagation of different classes done independently

Propagation of different classes interact

MRW: The Question• An example from a political blog dataset – MRW vs. wvRN

scores for how much a blog is politically conservative:1.000 neoconservatives.blogspot.com1.000 strangedoctrines.typepad.com1.000 jmbzine.com0.593 presidentboxer.blogspot.com0.585 rooksrant.com0.568 purplestates.blogspot.com0.553 ikilledcheguevara.blogspot.com0.540 restoreamerica.blogspot.com0.539 billrice.org0.529 kalblog.com0.517 right-thinking.com0.517 tom-hanna.org0.514 crankylittleblog.blogspot.com0.510 hasidicgentile.org0.509 stealthebandwagon.blogspot.com0.509 carpetblogger.com0.497 politicalvicesquad.blogspot.com0.496 nerepublican.blogspot.com0.494 centinel.blogspot.com0.494 scrawlville.com0.493 allspinzone.blogspot.com0.492 littlegreenfootballs.com0.492 wehavesomeplanes.blogspot.com0.491 rittenhouse.blogspot.com0.490 secureliberty.org0.488 decision08.blogspot.com0.488 larsonreport.com

0.020 firstdownpolitics.com0.019 neoconservatives.blogspot.com0.017 jmbzine.com0.017 strangedoctrines.typepad.com0.013 millers_time.typepad.com0.011 decision08.blogspot.com0.010 gopandcollege.blogspot.com0.010 charlineandjamie.com0.008 marksteyn.com0.007 blackmanforbush.blogspot.com0.007 reggiescorner.blogspot.com0.007 fearfulsymmetry.blogspot.com0.006 quibbles-n-bits.com0.006 undercaffeinated.com0.005 samizdata.net0.005 pennywit.com0.005 pajamahadin.com0.005 mixtersmix.blogspot.com0.005 stillfighting.blogspot.com0.005 shakespearessister.blogspot.com0.005 jadbury.com0.005 thefulcrum.blogspot.com0.005 watchandwait.blogspot.com0.005 gindy.blogspot.com0.005 cecile.squarespace.com0.005 usliberals.about.com0.005 twentyfirstcenturyrepublican.blogspot.com

Seed labels underlined

1. Centrality-sensitive: seeds have different

scores and not necessarily the highest

2. Exponential drop-off: much less sure

about nodes further away from seeds

3. Classes propagate independently:

charlineandjamie.com is both very likely a conservative and a

liberal blog (good or bad?)

We still don’t really understand it yet.

MRW: Related Work

• MRW is very much related to– “Local and global consistency” (Zhou et al. 2004)– “Web content categorization using link information”

(Gyongyi et al. 2006)– “Graph-based semi-supervised learning as a generative

model” (He et al. 2007)• Seed preference is related to the field of active learning– Active learning chooses which data point to label next based

on previous labels; the labeling is interactive– Seed preference is a batch labeling method

Similar formulation,

different view

RWR ranking as features to SVM

Random walk

without restart,

heuristic stopping

Authoritative seed preference a good base line for active

learning on network data!

Results• How much better is MRW using authoritative seed preference?

y-axis:MRW F1

score minus wvRN F1

x-axis: number of seed

labels per classThe gap between

MRW and wvRN narrows with authoritative

seeds, but they are still

prominent on some datasets

with small number of seed

labels

A Web-Scale Knowledge Base

• Read the Web (RtW) project:

Build a never-ending system that learns to extract information

from unstructured web pages, resulting in a knowledge base of

structured information.

Noun Phrase and Context Data

• As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced:– NP-context co-occurrence data– NP-NP co-occurrence data

• These datasets can be treated as graphs

• NP-context data:

… know that drinking pomegranate juice may not be a bad …

NPbefore context after context

pomegranate juice know that drinking _

_ may not be a bad

_ is made from

_ promotes responsible

JELL-O

Jagermeister

NP-contextgraph

• NP-NP data:

… French Constitution Council validates burqa ban …

NP context

French Constitution Council

JELL-OJagermeisterNP-NPgraph

burqa ban

French Court hot pants

Context can be used for weighting edges or making

a more complex graph

Noun Phrase Categorization

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

• We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs.

• Challenges:– Large, noisy dataset (10m NPs, 8.6m contexts from 500m

web pages).– What’s the right function for NP-NP categorical similarity?– Which learned category assignment should we “promote”

to the knowledge base?– How to evaluate it?

• Preliminary experiment:– Small subset of the NP-context data• 88k NPs• 99k contexts

– Find category “city”• Start with a handful of seeds

– Ground truth set of 2,404 city NPs created using

adaptation of graph-based semi-supervised methods to large-scale text data

text dataa similarity

sparsebut similarity

dense similarity matrix

text datasolutions

text datadocuments

gssl methodshow

text datafeature vectors

manifold pairwise similarity

Documents

semi{supervised graph embedding approach to dynamic link...

testing directed acyclic graph via structural, supervised

weakly supervised graph based semantic segmentation by...

self-supervised adaptation of high-fidelity face...

graph-based semi-supervised...

semi-supervised domain adaptation with instance...

semi-supervised hierarchical recurrent graph neural

self-supervised graph-level representation learning with

graph convolution for semi-supervised classification

semi-supervised domain adaptation via minimax entropy ·...

graph-based active semi-supervised learning: a new...

graph-based, self-supervised program repair from ... ›...

supervised prediction of graph summaries

parallel graph-based semi-supervised learning1 parallel...

graph neural networks for soft semi supervised …

semi-supervised feature selection for graph classification

self-supervised graph transformer on large-scale ... -...

a quest for structure: jointly learning the graph structure...

a semi-supervised graph attentive network for financial

graph-based semi-supervised...