web of people improving search on the web · today we search for persons, videos, music 19/10/10 18...

38
Web of People Improving Search on the Web Wolfgang Nejdl L3S Research Center Hannover, Germany

Upload: others

Post on 07-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Web of People

Improving Search on the Web

Wolfgang Nejdl

L3S Research Center

Hannover, Germany

Page 2: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 2

Overview

• Web Science / Web of People

• Research Questions and Topics

• Web Science @ L3S

• User Generated Content and Search

Page 3: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 3

Web Science / Web of People

The World Wide Web is a Web of People

World Wide Web = 1011 Pages + 109 User

Research Questions:

• What is the Web and how does the Web develop? (Analysis)

• How do I find information on the Web? (Information)

• How do I communicate through the Web? (Communication)

• How do I support new applications over the Web? (Infrastructure)

Answers to these questions are not independent from each other …

Page 4: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Web Science @ L3S

Information – Communication – Infrastructure – Analysis

19/10/10 4

Page 5: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

How can Web Science help?

Dec 08:

Outbreak of swine flu in Mexico

Apr 27 09: (New York Times/WHO)

Pandemic level 4, 81 deaths in Mexico

Jun 11 09: (WHO)

Pandemic level 6, 30.000 cases in 74 countries

Jan 18 10: (CDC/WHO)

55 million cases, 14.000 death in 200 countries

M-Eco-Project 2010-2012

Goal: Recognize Pandemics through Web Intelligence

19/10/10 5

Web of People

Page 6: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 6

Web Science @ L3S

• L3S Research Areas

• Web of People

• Web Search

• Web Information Management

• Middleware for Web Infrastructures

• Future Internet

.

Page 7: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 7

Diversity of Information – “Global Warming”

Information is not neutral

• Schools of thought

• Opinions

• Culture

• Time

• Attractiveness

• Data Source (Corriere della Sera, L‟espresso, Blogs)

AN INCONVENIENT TRUTH

Web of People

Page 8: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Information Diversity on the Web

FET Project LivingKnowledge(FET - Future and Emerging Technologies)

Goals

• Make diversity and

opinion visible

• Improve search and

navigation

• Provide scaleable

solutions

Together with Yahoo! Research

Barcelona and others,

coordinated by Trento

Web of People

19/10/10 8

Page 9: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

LivingKnowledge – Example: Attractiveness of Web Content

Which attributes of Flickr photos are relevant for attractiveness? [World

Wide Web Conference 2009]

Content

(visual features)

Metadata

(textual features)

Community Feedback

(photo’s interestingness)Classification & Regression

Attractiveness Models

Generator

Inputs

Flickr

Photo

Stream

cat, fence, house

#views

#comments

#favorites

...

Web of People

19/10/10 9

Page 10: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Aggregated Search Results – „Barack Obama Inauguration“

19/10/10 10

Web Search

Page 11: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

How important are events?

Events can be used to organize,

agggregate and index content

• World wide: Inauguration

Barack Obama

• For groups:

World Wide Web Conference

• Personal:

my last birthday

What do we have to do?

• Extract relevant event metadata for

indexing and search

• Map between local and global

events

19/10/10 11

Web Search

Page 12: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Data collections in the GLOCAL project

Partner Data

Yahoo!

Barcelona

50 million Flickr photos with meta and GPS data

100 million Flickr photos with meta data

100 million queries

Agence

France-Press

(AFP)

15 million text documents with meta data

10 million photos with meta data

150.000 grafics with meta data

50.000 videos with meta data

EXALEAD Web search engine

ALINARI Picture archive

19/10/10 12

Web Search

Page 13: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Future Information Management

The Web provides access to digital data of all kinds

• Very heterogeneous media and quality

• Structured and semi-structured data

• „Information Overload“

Personalization and Filtering is important

• Provide relevant and hiqh quality information for your current task

• Digital information services for libraries and enterprises

Web Information Management

19/10/10 13

Page 14: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 14

GDI-Grid – Collaborative Research in Geo Science

Investigating three scenarions:

1. Flood Simulation: How can we predict flooded areas along the Elbe

river?

2. Noise Propagation: How does street noise influence urban living and

work?

3. Evacuation: How can an densely populated area be evacuated in case

of alarms?

Middleware for Web Infrastructures

Page 15: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 15

Future Internet Architecture

The Internet is

• extremely scalable

• but only „best-effort“, without guarantees

• whether data reach their destination

• or, with how much dely

Internet as critical infrastructure for new

Web applications

• new requirements

• quality of service

• data traffic

• new solutions

• flexible future proof architecture

• application optimized services

Future Internet

Page 16: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Web Science @ L3S

19/10/10 16

Page 17: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 17

Overview

• Web Science / Web of People

• User Generated Content and Search

• Can all tags be used for search? Bischoff, Firan, Nejdl, Paiu. CIKM 2008.

• How Do You Feel about „Dancing Queen‟? Deriving Mood and Theme

Annotations from User Tags. Bischoff, Firan, Nejdl, Paiu, JCDL 2009.

Page 18: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

The Web consisted of documents …

Today we search for persons, videos, music

19/10/10 18

Web Search

Rich Media SearchEntity Centric Search

Page 19: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Why can we exploit tagging?

The Wisdom of Crowds. James Surowiecki. Random House, 2004.

„under the right circumstances, groups are remarkably intelligent“

„large groups of people are smarter than an elite few, no matter how

brilliant – better at solving problems, fostering innovation, coming

to wise decisions, even predicting the future“

27/02/2009Gabriele Herrmann-Krotz / Thomas

Risse

19

Page 20: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10Wolfgang Nejdl 20

Page 21: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10Wolfgang Nejdl 21

Social Media / Web 2.0 in PHAROS

Extend state of the art search engine

(FAST – now Microsoft) with user

and context personalization derived

from social media interactions

(Last.FM, Flickr, etc.)

L3S directs these efforts, focusing

on user centric searching, robustness

against spam, social network and

blog analysis

Showcase results on France

Telekom and Circom Regional

content and communities

L3S involvement with 1 Mill. Euro

Page 22: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Rich Media Search – Using Tags in Last.FM et al.

Example: Analysis 2008/2009 für songs etc.

– Last.FM, Flickr, Del.icio.us

Which tags are used?

How can they be used for search?

General Research Question:

How can I make search more

efficient and more personalized

by using user generated data?

19/10/10 22

Page 23: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 23

Tags in search

Page 24: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 24

Tagging sites: Datasets

Last.fm – social music platform:

317,058 tracks with (user generated) metadata (05/2007)

21,177 different (popular) tags with frequencies, users, similar tags

Del.icio.us – bookmarking/tagging the web:

323,294 unique tags associated to 2,507,688 bookmarks (07/2007)

recursive crawl seeding from the start page & start page monitoring

Flickr – photo sharing:

Crawl expanded based on popular tags (01/2004-12/2005)

Here, subset of 100,000 pictures with 32,378 unique tags

Anchor text (AT) – visible, clickable text in a hyperlink:

8,453,043 web pages parsed from Stanford WebBase Crawl (01/2006),

10,348,807 unique AT: 7,902,047 internal; 2,756,377 external

Page 25: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 25

Tag distribution across systems

Power law for

all systems

Page 26: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 26

Tag usage: different kinds of tags

Category Last.fm Flickr Del.icio.us Anchor text

1. Topic love flowers linux security

2. Time 80s 2005, July daily previous years

3. Location england toronto newcastle nederlands

4. Type pop, acoustic portrait movies, mp3 pdf, books

5. Author/ Owner the beatles wright alanmoore musicmoz.org

6. Opinion/ Qualities great lyrics scary annoying mobile essentials

7. Usage context workout, lost vacation review.later entertainment

8. Self reference albums I own me, 100views frommyrssfeeds about us, home

Page 27: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 27

Tag type distributions across systems

Manual classification of 300 sample tags per system and for AT

100 top tags, 100 tags starting at 70% of probability density, 100 tags starting at 90%

0.0

10.0

20.0

30.0

40.0

50.0

60.0

TopicTime

LocationType

AuthorOpinions

UsageSelf

reference

Frequency as %

Category

Page 28: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 28

Tag type distributions across samples

Page 29: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 29

Reliability: trust users or experts only?

Tags in music reviews on the web

Google query [ “artist” “track” music review -lyrics ] for 8,130 random tracks

73.01% of track tags in web reviews – linear distribution

Tags in AllMusic reviews

3,600 reviews

46.14% overlap

linear distribution

probably restricted

vocabulary of experts

Page 30: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 30

Added value or redundancy?

Matching tags and AT to the web page text they annotate

44.85% of the Del.icio.us tags appear in the page text

(sample: 20.911 URLs)

44.7% of external AT and 81.24% internal AT already present in the page

(sample 8,614,990 AT)

For 77,756 URLs in both datasets, text matches between tags and AT

4.71% with at least one exact match

42.52% partial match

Numbers of tags and AT for the same page uncorrelated

Page 31: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 31

Added value or redundancy?

Matching tags in track lyrics

For 77,498 songs, search with tags in lyrics

Power law distribution:

3,000 songs have more than 1 tag, 10,000 with 1 tag,

63,000 no matching tag

Avg.: 1,54% of tags to be found lyrics

=> theme/topic tags are rare

Page 32: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

19/10/10 32

Query log: Do users search like they tag?

Overlap: queries consisting of tags found in our datasets

Del.icio.us: 71.22% with at least 1 tag; 30.61% all terms

Flickr: 64.54% and 12.66% resp.

Last.fm: 58.43% and 6% resp.

Query

classification

0

10

20

30

40

50

60

Topic TimeLocation Type

AuthorOpinions Usage

Self reference

Frequency as %

Category

Query types

Web

Images

Music

Page 33: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Which tags are useful for search? [1]

Tag distribution on Last.FM

Genres

(type) >60%

Topics

(usage context) 5%

Query distribution (AOL query log)

Genres 12%

(“acoustic”, “pop”)

Topics 30%

(“party music”, “wedding songs”)

[1] “Can All Tags be Used for Search?”, K. Bischoff, C. Firan, W. Nejdl, R. Paiu, CIKM‟08

0

20

40

60

80

100

1 2 3 4 5 6 7 8

Fre

qu

en

cy a

s

%

Category

Distribution of tags on Last.fm

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8

Frequency as %

Category

Distribution of queries

Web

Images

Music

19/10/10 33

Page 34: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Tags+Lyrics

LyricsTags

19.10.2010 34

• [Clustering]

• Manuell

• Co-occurrence

• WordNet

• ≥30 Lieder

Themen

• 1st, p=*****

• 2nd, p=***

• …

Themen

Classifier

Bayes

Naïve

Verteilung

Semantic Enrichment

Enrichment with metadata to describe usage

context, based on machine learning

algorithms

Input

• 73 topics from allmusic.com

• Tags and Text are good indicators for topics

Output

• suggest the k most probable topics as

additional annotation

Page 35: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Evaluation [2]

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

H@3

R-P

Theme H@3

Slow dance 0.97

Romantic evening 0.89

Autumn 0.89

Party 0.72

Summertime 0.62

Late night 0.52

Examples

[2] “How Do You Feel about „Dancing Queen‟? Deriving

Mood and Theme Annotations from User Tags”, K.

Bischoff, C. Firan, W. Nejdl, R. Paiu, JCDL‟09

19/10/10 35

Page 36: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Web Search – Search for Entities

Extract and aggregate information to persons and other

entities

(entity based) data instead of documents

better overview over information related to an entity

using structured and semi-structured information

on the Web

Research questions

How can we extract, aggregated and reference these

data?

How can we index, search, order and present entity

centric information?

How can we integrate entities and sensor data from the

real world?

19/10/10 36

Page 37: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Future: From Search to Answering Engines

Search becomes more effective

Better indexing of

non textual materials

using „Wisdom of the Crowds“

and „Wisdom of the Machine“

Search becomes more understandable

Entities, attributes and events

as search results instead of

sets of documents

19/10/10 37

Page 38: Web of People Improving Search on the Web · Today we search for persons, videos, music 19/10/10 18 Web Search Entity Centric Search Rich Media Search. ... blog analysis Showcase

Investigating the Future of Information and Communication

Now hiring for TERENCE and ARCOMEM !

19/10/10 38