web of people improving search on the web · today we search for persons, videos, music 19/10/10 18...
TRANSCRIPT
Web of People
Improving Search on the Web
Wolfgang Nejdl
L3S Research Center
Hannover, Germany
19/10/10 2
Overview
• Web Science / Web of People
• Research Questions and Topics
• Web Science @ L3S
• User Generated Content and Search
19/10/10 3
Web Science / Web of People
The World Wide Web is a Web of People
World Wide Web = 1011 Pages + 109 User
Research Questions:
• What is the Web and how does the Web develop? (Analysis)
• How do I find information on the Web? (Information)
• How do I communicate through the Web? (Communication)
• How do I support new applications over the Web? (Infrastructure)
Answers to these questions are not independent from each other …
Web Science @ L3S
Information – Communication – Infrastructure – Analysis
19/10/10 4
How can Web Science help?
Dec 08:
Outbreak of swine flu in Mexico
Apr 27 09: (New York Times/WHO)
Pandemic level 4, 81 deaths in Mexico
Jun 11 09: (WHO)
Pandemic level 6, 30.000 cases in 74 countries
Jan 18 10: (CDC/WHO)
55 million cases, 14.000 death in 200 countries
M-Eco-Project 2010-2012
Goal: Recognize Pandemics through Web Intelligence
19/10/10 5
Web of People
19/10/10 6
Web Science @ L3S
• L3S Research Areas
• Web of People
• Web Search
• Web Information Management
• Middleware for Web Infrastructures
• Future Internet
.
19/10/10 7
Diversity of Information – “Global Warming”
Information is not neutral
• Schools of thought
• Opinions
• Culture
• Time
• Attractiveness
• Data Source (Corriere della Sera, L‟espresso, Blogs)
AN INCONVENIENT TRUTH
Web of People
Information Diversity on the Web
FET Project LivingKnowledge(FET - Future and Emerging Technologies)
Goals
• Make diversity and
opinion visible
• Improve search and
navigation
• Provide scaleable
solutions
Together with Yahoo! Research
Barcelona and others,
coordinated by Trento
Web of People
19/10/10 8
LivingKnowledge – Example: Attractiveness of Web Content
Which attributes of Flickr photos are relevant for attractiveness? [World
Wide Web Conference 2009]
Content
(visual features)
Metadata
(textual features)
Community Feedback
(photo’s interestingness)Classification & Regression
Attractiveness Models
Generator
Inputs
Flickr
Photo
Stream
cat, fence, house
#views
#comments
#favorites
...
Web of People
19/10/10 9
Aggregated Search Results – „Barack Obama Inauguration“
19/10/10 10
Web Search
How important are events?
Events can be used to organize,
agggregate and index content
• World wide: Inauguration
Barack Obama
• For groups:
World Wide Web Conference
• Personal:
my last birthday
What do we have to do?
• Extract relevant event metadata for
indexing and search
• Map between local and global
events
19/10/10 11
Web Search
Data collections in the GLOCAL project
Partner Data
Yahoo!
Barcelona
50 million Flickr photos with meta and GPS data
100 million Flickr photos with meta data
100 million queries
Agence
France-Press
(AFP)
15 million text documents with meta data
10 million photos with meta data
150.000 grafics with meta data
50.000 videos with meta data
EXALEAD Web search engine
ALINARI Picture archive
19/10/10 12
Web Search
Future Information Management
The Web provides access to digital data of all kinds
• Very heterogeneous media and quality
• Structured and semi-structured data
• „Information Overload“
Personalization and Filtering is important
• Provide relevant and hiqh quality information for your current task
• Digital information services for libraries and enterprises
Web Information Management
19/10/10 13
19/10/10 14
GDI-Grid – Collaborative Research in Geo Science
Investigating three scenarions:
1. Flood Simulation: How can we predict flooded areas along the Elbe
river?
2. Noise Propagation: How does street noise influence urban living and
work?
3. Evacuation: How can an densely populated area be evacuated in case
of alarms?
Middleware for Web Infrastructures
19/10/10 15
Future Internet Architecture
The Internet is
• extremely scalable
• but only „best-effort“, without guarantees
• whether data reach their destination
• or, with how much dely
Internet as critical infrastructure for new
Web applications
• new requirements
• quality of service
• data traffic
• new solutions
• flexible future proof architecture
• application optimized services
Future Internet
Web Science @ L3S
19/10/10 16
19/10/10 17
Overview
• Web Science / Web of People
• User Generated Content and Search
• Can all tags be used for search? Bischoff, Firan, Nejdl, Paiu. CIKM 2008.
• How Do You Feel about „Dancing Queen‟? Deriving Mood and Theme
Annotations from User Tags. Bischoff, Firan, Nejdl, Paiu, JCDL 2009.
The Web consisted of documents …
Today we search for persons, videos, music
19/10/10 18
Web Search
Rich Media SearchEntity Centric Search
Why can we exploit tagging?
The Wisdom of Crowds. James Surowiecki. Random House, 2004.
„under the right circumstances, groups are remarkably intelligent“
„large groups of people are smarter than an elite few, no matter how
brilliant – better at solving problems, fostering innovation, coming
to wise decisions, even predicting the future“
27/02/2009Gabriele Herrmann-Krotz / Thomas
Risse
19
19/10/10Wolfgang Nejdl 20
19/10/10Wolfgang Nejdl 21
Social Media / Web 2.0 in PHAROS
Extend state of the art search engine
(FAST – now Microsoft) with user
and context personalization derived
from social media interactions
(Last.FM, Flickr, etc.)
L3S directs these efforts, focusing
on user centric searching, robustness
against spam, social network and
blog analysis
Showcase results on France
Telekom and Circom Regional
content and communities
L3S involvement with 1 Mill. Euro
Rich Media Search – Using Tags in Last.FM et al.
Example: Analysis 2008/2009 für songs etc.
– Last.FM, Flickr, Del.icio.us
Which tags are used?
How can they be used for search?
General Research Question:
How can I make search more
efficient and more personalized
by using user generated data?
19/10/10 22
19/10/10 23
Tags in search
19/10/10 24
Tagging sites: Datasets
Last.fm – social music platform:
317,058 tracks with (user generated) metadata (05/2007)
21,177 different (popular) tags with frequencies, users, similar tags
Del.icio.us – bookmarking/tagging the web:
323,294 unique tags associated to 2,507,688 bookmarks (07/2007)
recursive crawl seeding from the start page & start page monitoring
Flickr – photo sharing:
Crawl expanded based on popular tags (01/2004-12/2005)
Here, subset of 100,000 pictures with 32,378 unique tags
Anchor text (AT) – visible, clickable text in a hyperlink:
8,453,043 web pages parsed from Stanford WebBase Crawl (01/2006),
10,348,807 unique AT: 7,902,047 internal; 2,756,377 external
19/10/10 25
Tag distribution across systems
Power law for
all systems
19/10/10 26
Tag usage: different kinds of tags
Category Last.fm Flickr Del.icio.us Anchor text
1. Topic love flowers linux security
2. Time 80s 2005, July daily previous years
3. Location england toronto newcastle nederlands
4. Type pop, acoustic portrait movies, mp3 pdf, books
5. Author/ Owner the beatles wright alanmoore musicmoz.org
6. Opinion/ Qualities great lyrics scary annoying mobile essentials
7. Usage context workout, lost vacation review.later entertainment
8. Self reference albums I own me, 100views frommyrssfeeds about us, home
19/10/10 27
Tag type distributions across systems
Manual classification of 300 sample tags per system and for AT
100 top tags, 100 tags starting at 70% of probability density, 100 tags starting at 90%
0.0
10.0
20.0
30.0
40.0
50.0
60.0
TopicTime
LocationType
AuthorOpinions
UsageSelf
reference
Frequency as %
Category
19/10/10 28
Tag type distributions across samples
19/10/10 29
Reliability: trust users or experts only?
Tags in music reviews on the web
Google query [ “artist” “track” music review -lyrics ] for 8,130 random tracks
73.01% of track tags in web reviews – linear distribution
Tags in AllMusic reviews
3,600 reviews
46.14% overlap
linear distribution
probably restricted
vocabulary of experts
19/10/10 30
Added value or redundancy?
Matching tags and AT to the web page text they annotate
44.85% of the Del.icio.us tags appear in the page text
(sample: 20.911 URLs)
44.7% of external AT and 81.24% internal AT already present in the page
(sample 8,614,990 AT)
For 77,756 URLs in both datasets, text matches between tags and AT
4.71% with at least one exact match
42.52% partial match
Numbers of tags and AT for the same page uncorrelated
19/10/10 31
Added value or redundancy?
Matching tags in track lyrics
For 77,498 songs, search with tags in lyrics
Power law distribution:
3,000 songs have more than 1 tag, 10,000 with 1 tag,
63,000 no matching tag
Avg.: 1,54% of tags to be found lyrics
=> theme/topic tags are rare
19/10/10 32
Query log: Do users search like they tag?
Overlap: queries consisting of tags found in our datasets
Del.icio.us: 71.22% with at least 1 tag; 30.61% all terms
Flickr: 64.54% and 12.66% resp.
Last.fm: 58.43% and 6% resp.
Query
classification
0
10
20
30
40
50
60
Topic TimeLocation Type
AuthorOpinions Usage
Self reference
Frequency as %
Category
Query types
Web
Images
Music
Which tags are useful for search? [1]
Tag distribution on Last.FM
Genres
(type) >60%
Topics
(usage context) 5%
Query distribution (AOL query log)
Genres 12%
(“acoustic”, “pop”)
Topics 30%
(“party music”, “wedding songs”)
[1] “Can All Tags be Used for Search?”, K. Bischoff, C. Firan, W. Nejdl, R. Paiu, CIKM‟08
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Fre
qu
en
cy a
s
%
Category
Distribution of tags on Last.fm
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8
Frequency as %
Category
Distribution of queries
Web
Images
Music
19/10/10 33
Tags+Lyrics
LyricsTags
19.10.2010 34
• [Clustering]
• Manuell
• Co-occurrence
• WordNet
• ≥30 Lieder
Themen
• 1st, p=*****
• 2nd, p=***
• …
Themen
Classifier
Bayes
Naïve
Verteilung
Semantic Enrichment
Enrichment with metadata to describe usage
context, based on machine learning
algorithms
Input
• 73 topics from allmusic.com
• Tags and Text are good indicators for topics
Output
• suggest the k most probable topics as
additional annotation
Evaluation [2]
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
H@3
R-P
Theme H@3
Slow dance 0.97
Romantic evening 0.89
Autumn 0.89
…
Party 0.72
Summertime 0.62
Late night 0.52
Examples
[2] “How Do You Feel about „Dancing Queen‟? Deriving
Mood and Theme Annotations from User Tags”, K.
Bischoff, C. Firan, W. Nejdl, R. Paiu, JCDL‟09
19/10/10 35
Web Search – Search for Entities
Extract and aggregate information to persons and other
entities
(entity based) data instead of documents
better overview over information related to an entity
using structured and semi-structured information
on the Web
Research questions
How can we extract, aggregated and reference these
data?
How can we index, search, order and present entity
centric information?
How can we integrate entities and sensor data from the
real world?
19/10/10 36
Future: From Search to Answering Engines
Search becomes more effective
Better indexing of
non textual materials
using „Wisdom of the Crowds“
and „Wisdom of the Machine“
Search becomes more understandable
Entities, attributes and events
as search results instead of
sets of documents
19/10/10 37
Investigating the Future of Information and Communication
Now hiring for TERENCE and ARCOMEM !
19/10/10 38