data mining, search and other stuff

Data Mining, Search and Other Stuff

Amos Fiat

Tel Aviv University

Mostly based on joint work with

Azar, Karlin, McSherry and Saia (STOC 2001), and

Achlioptas,Karlin and McSherry (FOCS 2001)

What this talk is about• Introduce Data Mining and specific problems:

– Document Classification– Collaborative Filtering– Web Search

• Describe LSA• Provable probabilistic generative models:

– Papadimitriou et. al.– Generalizations, capturing document indexing and other problems (collaborative

filtering)

• Web search:– Google, Hits– New Web search algorithm (Smarty Pants)– Generative model from which smarty pants is derived– Sketch of proof

What is Data Mining?

• First SIAM International conference on Data Mining, April 5-7, 2001 (from Call for Papers):– Advances in information technology and data

collection methods have led to the availability of large data sets in commercial enterprises and in a wide variety of scientific and engineering disciplines…

– …The field of data mining draws upon extensive work in areas such as statistics, machine learning, pattern recognition, databases, and high performance computing to discover interesting and previously unknown information in data sets…

From SIAM CFP:

Topics of Interest:– Methods and Paradigms:

• …• Mining high-dimensional data …• Collaborative filtering …• Data cleaning and pre-processing …

– Applications • … • Web data …• Financial and e-commerce data …• Text, document, and multimedia data …

– Human Factors and Social Issues …

Example: Document Classification / Search / Similarity

• Classify documents in some meaningful way

• Find documents by search terms, find similar documents

Example: Collaborative Filtering

• Gather information on Supermarket purchases

• Make good recommendations to customer at checkout. Good = likely to purchase

• With/Without customer identification

2nd Example: Collaborative Filtering

• Movie recommendations

• User gives some input on movies he/she likes/dislikes

• User does not know grade for movies not yet seen

Example: Web Search / Scientific Citation Search

• Classify documents in some meaningful way

• Find documents by search terms, find similar documents

• Find High Quality documents

Latent Semantic Analysis

• Deerwester, Dumais, Landauer, Furnas, Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990.

• Supposed to work rather well in practice. See http://lsa.colorodo.edu -

Latent Semantic Analysis@CU Boulder

Idea of LSA• Embed corpus of documents into a low-dimensional

“semantic” space by computing a good low-rank approximation to the term-document matrix.

• Compute document/keyword relationships in this low-dimensional space.

• Intuition: forcing low-rank representation maintains only usage patterns that correspond to strong linear trends.

• Every action must be due to one or other of seven causes: chance, nature, compulsion, habit, reasoning, anger, or appetite

—Aristotle, Rhetoric, Bk. II

Let’s prove that LSA works• What’s there to prove? • Papdimitriou, Raghavan, Tamaki, Vempala, Latent

Semantic Indexing: A Probabilistic Analysis, PODS, 1997– Introduce probabilistic generational model for documents.

Real documents are an instantiation of this probabilistic process

– LSA effectively reconstructs the probabilistic model (if the model is very very simple – a block matrix). If you know all the probabilistic parameters used to generate documents, classification, similarity, etc. are obvious.

• Our contribution (STOC 2001): – very very simple -> simple– Block matrix -> arbitrary matrix

i,j M

GenerativeData Model

mij

ErrorProcess

mij + zij

Z P

ProbabilisticOmission

mij + zij

?with prob. 1-pij

with prob. pij

A*

Our (somewhat) More General Model

• Entries mij are generated from arbitrary (unknown) distributions with bounded deviation

• Errors are introduced, entries are omitted• Can be used to model documents/terms, customers/preferences,

web sites/links and web sites/terms (We will use yet another model for the web)

• Theorem: Given A*, we can compute the expected values of the mij’s, under certain conditions.

Use for Collaborative Filtering

i,j M

GenerativeData Model

mij

ErrorProcess

mij + zij

Z P

ProbabilisticOmission

mij + zij

?with prob. 1-pij

with prob. pij

A*

0.5 0.7 0.3 0.01 0.2

0.1 0.2 0.3 0.4 0.5

0.15 0.25 0.35 0.45 0.55

0.1 0.3 0.4 0.5 0.5

0.2 0.2 0.2 0.2 0.2

Generative model (customerpreferences)

0 1 1 0 0

0 1 0 0 1

0 0 0 1 1

0 0 1 1 0

1 0 0 0 0

Cart ProductMatrix Or

Movie Like/Dislike Matrix

mij 0 1 1 0 0

0 1 0 0 1

0 0 1 0 1 0 1

0 0 1 1 0

1 0 0 0 0

Errors

ZM

0 ? 1 ? 0

0 1 ? 0 1

0 1 ? ? 1

0 ? 1 1 0

1 0 ? 0 ?

Missing Items

A*

P

Supermarket Collaborative Filtering• Build an n x m matrix 0/1 matrix with one row per cart and

one column per product• Place m/(m-ri) in entry i,j if cart i contained product j, 0

otherwise. – ri = number of items in cart i

• Add a last row for the current cart.• Take the SVD of the matrix, discard all singular values less

than• Read out current customer preferences from last row of low

rank matrix• Theorem: If customer preferences are a low rank matrix

(with large singular values), then this algorithm is guaranteed to give the approximately correct result.

1/2# #carts products

Main Idea:• The matrix of user product preferences can be viewed as

proportional to the sum of the matrix of expected cart content plus an additive “error” matrix.

• The additive error matrix may be (relatively) large in (say) Frobenius norm. However: it will be small (with high probability) in terms of 2-norm.– Requires Furedi, Komlos, (and later) Boppana’s result that a

matrix of independent random variables has small 2-norm. – 2-norm: max{|Mu| | |u|=1}

• Discarding the singular vectors with small singular values effectively removes the “error” contribution from the matrix.

Web Search Issues

• What pages are relevant?

• What pages are high quality?

• Huge amount of research….

Google [Brin &Page]and HITS [Kleinberg]

• Relevance:• Google: Documents are potentially relevant if they contain the

search terms

• HITS: Documents are potentially relevant if they are in a “neighborhood” of pages containing the search terms.

• Quality: • Google: Universal query-independent measure of quality

called PageRank; essentially “normalized popularity”.

• HITS: Quality is a more complex function of the “associated documents”; compute “authority” and “hub” score for each page. Quality= authority.

Determining Quality: Google and HITS

• Google (Simplified): Quality is derived from the quality of the pages linking into a page.Q(p) = Q(q)/outdegree(q) +

Q(s)/outdegree(s)

• HITS: Quality is derived from a subset of “associated pages”, where every pages has 2 quality measures:Authority quality AHub quality H– A(p) = H(q)+H(s)– H(p) = A(r) + A(t)

p q

s

r t

Potential Issues with Google and HITS– Both may have problems with

• Polysemy: “Bug” could be an insect, a software problem, a listening device, to bother someone, etc.

• Synonymy: “Terrorist” and “Freedom Fighter” refer to the same thing.

– What is the basis for the heuristic used to choose “associated pages” in HITS ?

– Does it make sense to determine quality based on contributions from pages on irrelevant topics, as Google does?

Key question: what are the mathematical conditions under which these algorithms work?

Key Questions

• What would the web have to look like for these algorithms to give the “right” answer?

• What are the mathematical conditions under which these algorithms work?

Preview of Answer:

if the web is rank 1.

The Rest of this Talk

• A new (entirely untested) web search algorithm: Smarty Pants

• A unified mathematical model for web link structure, document content, and human generated queries

• Proof that algorithm gives an approximately correct result given the model

Modeling the Web and a new Algorithm (Smarty Pants)

• We define a common probabilistic generative model for :– the link structure of the web– the term content of web documents– the query generation process

• Each component can be generated by the previous “Our (somewhat) more general model”

• Our algorithm is entirely derived from the model. • If the model describes reality, our algorithm is

guaranteed to give the correct answer.

New Algorithm

Inputs:– n, the total number of web pages– l, the total number of terms

– W, the web graph, W(i,j) = #links from page i to page j

– S, the document/term matrix, S(i,j) = #occurrences of term j in document i

– q, the query vectorq(j) = #occurrences of term j in query

v2 v3

v1

Smarty Pants

Query Independent Part:

– Find a “good” low rank approximation to the matrix W, say Wr .

– Find a “good” low rank approximation to the matrix M=(WT|S) , say Mm .

– Compute the pseudo inverse of Mm , say Mm-1 .

Smarty Pants Query Independent Part:

– Compute the Singular Value Decomposition of the matrices

– Let m be such that– Let r be such that– Let Mm be the rank m SVD approx. to M , – Let Wr be the rank r SVD approx. to W, – Let Mm

-1 be the pseudo inverse of Mm

( ); ( | ) n nT SM RW W

; ;T TW W M M MW M U VW U V

1 (( ) ( ))m m nM M

1 (( ) ( ))r r nW W

1 1( )m m m

TM M Mm VM U

Smarty Pants, cont.

Query Dependent Part:

– Let q be the characteristic vector of the query, q(i) = #occurrences of term i in query

– q’T=[0n | qT]

– Compute w=q’T Mm-1 Wr

– Output pages p in order of decreasing w(p)

An Alternative View• This algorithm is provably equivalent to:

– Take query vector and determine the topic the human searcher is interested in

– Output the documents in order of their quality on the specific topic that the user is interested in

• We call this synthesizing a perfect hub for the topic.

• Topics, quality and hubs have not been defined.• In fact, it is provably impossible to determine the topic

from the inputs available

Inspirations for Model

• Latent Semantic Analysis [Deerwester et al] and models thereof [Papadimitriou et al]

• PLSI [Hofmann]• PHITS [Cohn & Chang]• combined model of [Cohn & Hofmann]

• the list goes on and on…

The Model: Concepts and Topics

• There exist k fundamental concepts (latent semantic categories) that can be used to describe reality

• How large k is and what the concepts mean is unknown

• A topic is a k-tuple describing the relative proportion of the fundamental concepts in the topic

• Two k-tuples that are scalar multiples of each other refer to the same topic

The Model:Web Pages• Every web page p has two k-tuples associated with it.

– Its authority topic A(p), captures the content on which this page is an authority, and therefore, influences incoming links to this page.

• E.g., authority on Linux

– Its hub topic H(p), , i.e., the topic of the outgoing links• E.g., hub on Microsoft Bashing

• H is n by k matrix whose p-th row is H(p). • A is n by k matrix whose p-th row is A(p).

1,1 1,2 1,

2,1

,1 ,

k

n n k

H H H

H

H H

H

1,1 1,2 1,

2,1

,1 ,

k

n n k

A A A

A

A A

A

The Model: Link Generation

• Model assumes that the number of links from page p to page q is a random variable with expected value <H(p), A(q)>

• Intuition: the more closely aligned the hub topic of page p is with the topic on which q is an authority, the more likely it is that there will be a link from p to q.

• The web link matrix is a instantiation of the probabilistic process defined by HAT

0,1Wn n

Terms: Authority and Hub

• Model allows general term distributions for any topic.

• Model allows for possibility of different uses of terminology in an authoritative sense and in a hub sense. Example:– Hubs on Microsoft may refer to “Evil Empire” whereas

few of Microsoft’s own sites will use this term.

• For hub terminology, think: anchor text.

Terms and TopicsAssociated with each term t are two distributions:

– Use as authoritative term, given by k-tuple SA(t): i’th

coordinate is the expected number of occurrences of term t in a pure authority on the i’th concept.

– Use as hub term, given by k-tuple SH(t) : i’th

coordinate is the expected number of occurrences of term t in a pure hub on the i’th concept.

• SH (resp. SA) is the l by k matrix whose t-th row is SH

(t) (resp SA(t) )

Document/Term Structure

Terms on page with authority topic A(p) and hub topic H(p) are generated from a distribution where

Expected(# occurrences of term t in p) =

<H (p) , SH (t) > + < A (p),SA

(t) >

Document-term matrix S is instantiation of probabilistic process defined by matrix H SH

T + A SAT

The Model: Query Generation

• The searcher chooses a k-tuple v representing the topic he wants to search, and computes q’T=vT SH

T

• q’[u], is the expected number of occurrences of term u in a pure hub page on topic v.

• Searcher decides whether or not to include term u among search terms by sampling from a distribution with expectation q’[u]

• Result: a query q which is the instantiation of random process.

v = (0.001,0.002,0.001)

A Perfect Hub:

• The correct search results are the pages ordered by their authoritativeness on topic v

• w = vT AT gives the relative authority of all pages on topic v

v = (0.001,0.002,0.001)

Model Summary• Documents have 2 k-tuples, one for the topic on which the

document is an authority, one for the topic on which the document is a hub

• Terms also have 2 k-tuples, one for the use of the term in an authoritative context and one for the use of the term in a hub context

• Humans generate queries by first choosing a k-tuple representing the topic of the query, and then choosing search terms using the hub term dist’n for the topic.

• The correct answer is now well defined: it’s the sites ordered by authoritativeness on the topic, i.e., a perfect hub

• The real web (links and content) and real queries are derived by an instantiation of the probabilistic model

There exist: H, A, SH, SA , v

• Link structure W: instantiation of HAT = E(W)

• Doc-term S: instantiation of HSHT + ASH

T = E(S)

• Query q: instantiation of vT SHT = E(q)

Goal:Given W, S, q, want to compute good approximation to

vT AT, vector of authoritativeness of pages on topic v.

web hub topics

web authority topics

hub term dist’n

authority term dist’n

query topic

such that:

About the Model

• The model is fairly general

• This is an advantage, not a disadvantage– The more powerful the model, the greater the flexibility

in using the model to approximate reality

– If reality is indeed simpler than the full generality that the model allows, then the results still hold, don’t need to use the full flexibility of the model (e.g., the case H=A is particularly easy to deal with).

Main Theorem

If the Web link structure W, and document-term matrix S, are generated according to the model, and some other technical conditions hold,

Then w.h.p. for any query q generated according to the model, with sufficiently many terms,

our algorithm produces an answer w’ (=qT Mm-1 Wr) such that

for 1-o(1) of the entries, the correct answer is produced up to lower order terms,

i.e., |w’(i)-w(i)| = o(|w(i)|), where w (= vT AT) is the correct answer to the query.

Sufficiently Many Terms?

• We assume that k, the number of fundamental concepts, is a constant.

• The number of query terms required to guarantee success with high probability depends on k and the singular values of M =(WT|S) and W.

• If the singular values are sufficiently high then we only require a constant number of terms in the query.

• For example, if they are Zipf, the i’th singular value of the Web is proportional to n/i then we only need a constant number of terms.

• The faster the singular values drop, the worse our guarantee, but the algorithm still works for a wide range (with ever increasing query term requirements).

Proof Techniques:

• Key idea #1: Instantiation of a random variable can be viewed as an additive error process

E(W) = H AT W = H AT + Error

• Key idea #2: In many cases, the effect of a random error can be estimated (as a function of various spectral properties of the underlying matrices)

• Incredibly useful: A’ = A+E, where A is rank k with large singular values => A’k, best rank k approximation to A’ essentially = A

Proof Techniques (Cont.)

• Key idea #3: – Forget that the model has:

• real web link structure derived via a random process, • real document content derived via a random process, • real query derived via a random process.

– Imagine that we had the original (non-instantiated) Expected(web matrix), Expected(document/term matrix), and Expected(query vector)

• Key idea #4: Pray that the errors introduced by this “forgetfulness” are amenable to analysis and some magical matrix perturbation theorems from Stewart and Sun can be applied.

Proof Techniques – Synthesizing a Perfect Hub

• So, imagine that we have:– Expected(web matrix) = E(W) = H AT

– Expected(document/term matrix) =E(S) = H SHT + A SA

T

– Expected(query) = vT SHT

– E(M)= (E(WT )|E(S))

• What the algorithm tries to do is:– Find a linear combination of the rows of E(M), that gives

(0n| vT SHT ).

– Apply the same linear combination to the rows of E(W).

Proof Techniques – Synthesizing a Perfect Hub

What?

I.e., find a hub term distribution (with no authority content) giving the query distribution.Simultaneously derive the required perfect hub on the query topic

| | E( ) | E( ) | E( ) E( ) | E( )

| | Typical row of matr

? Target

ix

| | Typical

linear combination of rows

| 0 | E( ) linear combination of

T TA

T T T TA

T TH

T T T TH

T T TT T T TA

TH

TH H HS

h H h S

x H x

W W S W M

S

q

A A AS

A a a S

A y y S

0 0 As has rank

0

E( ) And

row

as rank

s

( ) =

?

T

T T

T T T TH H

T T

H

T T

A

T T

H H

x S S x

y y

y

k

kvq v S

S

Av

Revisiting Google and HITS• Google’s authority vector is the primary left eigenvector of the stochastic

matrix representing a random walk on the web matrix W (ignoring periodic restarts).

• HITS’s authority vector is the primary right singular vector of the web matrix W (ignoring the “associated pages” issue), which is essentially the same as the primary right singular vector of the expected web matrix, E(W), under our model when k=1.

• But, if rank(E(W))=1, then the primary right singular vector of E(W) and the primary left eigenvector of the stochastic matrix associated with E(W) are one and the same.

• I.e., for a rank one model, our algorithm HITS Google.

A Few Obvious Limitations of the Model

• All correlations are linear

• Entries in various matrices are instantiated independently.

• The inner product measure of quality means that more authoritative pages on a related topic may be chosen over less authoritative pages more closely aligned with the topic. We have a heuristic suggestion as to how to deal with this issue using a recursive approach.

• The Model simply disallows a page like the “50 worst Computer Science Sites”. Such as site has a hub topic of Computer Science and therefore, by the model, will more likely point to good authorities in Computer Science.

Summary

Use of generative probabilistic models to – Prove the correctness of algorithms (LSA)

– Understand when algorithms work (Google, Hits)

– Generate new provably correct algorithms:• Collaborative Filtering

• Web Search: Smarty Pants

• Generative models can have varying complexity, the stronger the better

Future Work

• Test the algorithm….

Some Bibliography (Apologies Extended to all Omitted References)

• Kleinberg, Authoritative Sources in a Hyperlinked Environment, JACM, 1999• Page, Brin, Motwani, Wingrad, The pagerank Citation Ranking: Bringing

Order to the Web, 1998• Brin, Page, The Anatomy of a large-scale Web Search Engine, 1998• Chakrabarti, Dom, Gibson, Kleinberg, Kumar, Raghavan, Rajagopalan,

Tomkins, Hypersearching the Web, Scientific American, June 1999. (Also Computer, 1998).

• Deerwester, Dumais, Landauer, Furnas, Harshman, Indexing by Latent Semantic Analysis. 1990.

• Papdimitriou, Raghavan, Tamaki, Vempala, Latent Semantic Indexing: A Probabilistic Analysis, PODS, 1997.

• Azar, Fiat, Karlin, McSherry, Saia, Spectral Analysis of Data, STOC 2001• Boppana, Eigenvalues and Graph Bisection, 28th FOCS, 1987• Stewart and Sun, Matrix Perturbation Theory, 1990.