help people find what they don ’ t know

26
Help People Find What Help People Find What They Don’t Know They Don’t Know Hao Ma Hao Ma 16-10-2007 16-10-2007 CSE, CUHK CSE, CUHK

Upload: marly

Post on 18-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Help People Find What They Don ’ t Know. Hao Ma 16-10-2007 CSE, CUHK. Questions. ?. How many times do you search every day? Have you ever clicked to the 2 nd page of the search results, or 3 rd , 4 th , … ,10 th page? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Help People Find What They Don ’ t Know

Help People Find What Help People Find What They Don’t KnowThey Don’t Know

Hao MaHao Ma16-10-200716-10-2007CSE, CUHKCSE, CUHK

Page 2: Help People Find What They Don ’ t Know

QuestionsQuestions• How many times do you search every

day?• Have you ever clicked to the 2nd page

of the search results, or 3rd, 4th,…,10th page?

• Do you have problem to choose correct words to represent your search queries?

• ……

Page 3: Help People Find What They Don ’ t Know

• Retrieve the docs that are “relevant” for the user query– Doc: file word or pdf, web page, email, blog,

book,...

– Query: paradigm “bag of words”

– Relevant ? • Subjective and time-varying concept

• Users are lazy

• Selective queries are difficult to be composed

• Web pages are heterogeneous, numerous and changing frequently

Web search is a difficult, cyclic process

Goal of a Search EngineGoal of a Search Engine

Query

Results

Analysis

Page 4: Help People Find What They Don ’ t Know

User NeedsUser Needs• Informational – want to learn about something

(~40%)

• Navigational – want to go to that page (~25%)

• Transactional – want to do something (~35%)

– Access a service

– Downloads – Shop

• Gray areas– Find a good hub– Exploratory search “see what’s there”

SVM

Cuhk

Haifa weatherMars surface images

Canon DC

Car rental in Finland

Page 5: Help People Find What They Don ’ t Know

QueriesQueries

• ill-defined queries– Short

• 2001: 2.54 terms avg• 80% less than 3

terms

– Imprecise terms– 78% are not

modified

• Wide variance in– Needs

– Expectations

– Knowledge

– Patience: 85% look at 1 page

Page 6: Help People Find What They Don ’ t Know

Different CoverageDifferent Coverage

Google vs YahooShare 3.8 results in the top 10 on avg

Share 23% in the top 100 on avg

Page 7: Help People Find What They Don ’ t Know

In summaryIn summaryCurrent search engines incur in many difficulties:

Link-based ranking may be inadequate: bags of words paradigm, ambiguous queries, polarized queries, …

Coverage of one search engine is poor, meta-search engines cover more but “difficult” to fuse multiple sources

User needs are subjective and time-varying

Users are lazy and look to few results

Page 8: Help People Find What They Don ’ t Know

Two “complementary” approaches

Web Search Results Clustering

&

Query Optimization

How important?Totally, at least 8 papers published at SIGIR

and WWW each year, recently!

Page 9: Help People Find What They Don ’ t Know

Web Search Results Clustering

Page 10: Help People Find What They Don ’ t Know

The user has no longer to browse through tedious pages of results,

but (s)he may navigate a hierarchy of folders whose labels give a glimpse of all themes present in the returned results.

An interesting approachAn interesting approachWeb-snippet

Page 11: Help People Find What They Don ’ t Know

The folder hierarchy must be formed “on-the-fly from the snippets”: because it must adapt to the themes of the

results without any costly remote access to the original web pages or documents

“and his folders may overlap”: because a snippet may deal with multiple themes

Canonical clustering is instead persistent and generated only once

The folder labels must be formed

“on-the-fly from the snippets” because labels must capture the potentially unbounded themes of the results without any costly remote access to the original web pages or documents.

“and be intelligible sentences” because they must facilitate the user post-

navigation

It seems a “document organization into topical context”, but snippets are poorly composed, no “structural information” is available for them, and “static classification” into predefined categories would be not appropriate.

Web-Snippet Hierarchical Web-Snippet Hierarchical ClusteringClustering

Page 12: Help People Find What They Don ’ t Know

We may identify four main approaches (ie. taxonomy) Single words and Flat clustering – Scatter/Gather, WebCat, Retriever

Sentences and Flat clustering – Grouper, Carrot2, Lingo, Microsoft

China

Single words and Hierarchical clustering – FIHC, Credo

Sentences and Hierarchical clustering – Lexical Affinities clustering,

Hierarchical Grouper, SHOC, CIIRarchies, Highlight, IBM India

Conversely, we have many commercial proposals: Northerlight (stopped 2002)

Copernic, Mooter, Kartoo, Groxis, Clusty, Dogpile, iBoogie,…

Vivisimo is surely the best !

The LiteratureThe Literature

Page 13: Help People Find What They Don ’ t Know

To be presented atWWW 2005

Page 14: Help People Find What They Don ’ t Know

SnakeTSnakeT’s main features’s main features

2 knowledge bases for ranking/choosing the

labels• DMOZ is used as a feature selection and sentence ranker index• Text anchors are used for snippets enrichment

Labels are gapped sentences of variable

length• Grouper’s extension, to match sentences which are “almost the

same”• Lexical Affinities clustering extension to k-long LAs

Page 15: Help People Find What They Don ’ t Know

SnakeTSnakeT’s main features’s main features• Hierarchy formation deploys the

folder labels and coverage• “Primary” and “secondary” labels for finer/coarser

clustering• Syntactic and covering pruning rules for simplification

and compaction

• 18 engines (Web, news and books) are queried on-the-fly• Google, Yahoo, Teoma, A9 Amazon, Google-news, etc..

• They are used as black-boxes

Page 16: Help People Find What They Don ’ t Know

Generation of the Candidate Generation of the Candidate LabelsLabels

• Extract all word pairs occurring in the snippets within some proximity window

• Rank them by exploiting: KB + frequency within snippets

• Discard the pairs whose rank is below a threshold

• Merge repeatedly the remaining pairs by taking into account their original position, their order, and the sentence boundary within the snippets

Page 17: Help People Find What They Don ’ t Know
Page 18: Help People Find What They Don ’ t Know
Page 19: Help People Find What They Don ’ t Know
Page 20: Help People Find What They Don ’ t Know
Page 21: Help People Find What They Don ’ t Know
Page 22: Help People Find What They Don ’ t Know

Query Optimization

Page 23: Help People Find What They Don ’ t Know
Page 24: Help People Find What They Don ’ t Know

Disadvantages

Page 25: Help People Find What They Don ’ t Know

Can we do BETTER???

Page 26: Help People Find What They Don ’ t Know

Q & AQ & A

The EndThe End