automatic web tagging and person tagging using language models

21
Automatic Web Tagging and Person Tagging Using Language Models - Qiaozhu Mei , Yi Zhang Presented by Jessica Gronski University of Illinois at Urbana- Champaign University of California at Santa Cruz

Upload: newbu

Post on 31-Oct-2014

542 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Automatic Web Tagging and Person Tagging Using Language Models

Automatic Web Tagging and Person Tagging

Using Language Models

- Qiaozhu Mei†, Yi Zhang‡

Presented by Jessica Gronski‡

† University of Illinois at Urbana-Champaign

‡ University of California at Santa Cruz

Page 2: Automatic Web Tagging and Person Tagging Using Language Models

Tagging a Web Document

• The dual problem of search/retrieval:[Mei et al. 2007]

– Retrieval: short description (query) relevant documents

– Tagging: document short description (tag)• To summarize the content of documents• To access the document in the future

2

Text Document

Query/Tag

retrieval

tagging

Page 3: Automatic Web Tagging and Person Tagging Using Language Models

Social Bookmarking of Web Documents

3

Web documentsSocial bookmarks (tags)

Page 4: Automatic Web Tagging and Person Tagging Using Language Models

Existing Work on Social Bookmarking

• Social Bookmarking Systems– Del.icio.us, Digg, Citeulike, etc.

• Enhance Social bookmarking systems– Anti-spam [Koutrika et al 2007]

– Search& ranking tags [Hotho et al 2006]

• Utilize social bookmarks– Visualization [Dubinko et al. 2006]

– Summarization [Boydell et al. 2007]

– Use tags to help web search: [Heymann et al. 2008]; [Zhou et al. 2008]

4

Page 5: Automatic Web Tagging and Person Tagging Using Language Models

Research Questions

• Can we automatically generate tags for web documents?– Meaningful, compact, relevant

• Can we generate tags for other web objects, such as web users?

5

Page 6: Automatic Web Tagging and Person Tagging Using Language Models

Applications of Automatic Tagging

• Summarizing documents/ web objects• Suggest social bookmarks• Refine queries for web search

– Finding good queries to a document

• Suggest good keywords for online advertising

6

Page 7: Automatic Web Tagging and Person Tagging Using Language Models

7

Rest of the Talk

• A probabilistic approach to tag generation– Candidate Tag Selection– Web document representation– Tag ranking

• Experiments– Web documents tagging; – web user tagging

• Summary

Page 8: Automatic Web Tagging and Person Tagging Using Language Models

Our Method

8

data 0.1599statistics 0.0752tutorial 0.0660 analysis 0.0372software 0.0311model 0.0310frequent 0.0233probabilistic 0.0188algorithm 0.0173…

ipod nano, data mining, presidential campaignindex structure, statistics tutorial, computer science… Candidate tag pool

data mining 0.26statistics tutorial 0.19computer science 0.17 index structure 0.01……ipod nano 0.00001presidential campaign 0.0……

Ranking candidate tags

User-Generated Corpus(e.g., Del.icio.us, Wikipedia)

Web DocumentsMultinomial word Distribution

representation

Page 9: Automatic Web Tagging and Person Tagging Using Language Models

Candidate Tag Selection

• Meaningful, compact, user-oriented

• From social bookmarking data– E.g., Del.icio.us– Single tags tags that other people used– “phrases” statistically significant bigrams

• From other user-generated web contents– E.g., Wikipedia– Titles of entries in wikipedia

9

Page 10: Automatic Web Tagging and Person Tagging Using Language Models

Representation of Web Documents

• Multinomial distribution of words

(unigram language models)– Commonly used in retrieval and

text mining

• Can be estimated from the content of the document, or from social bookmarks (our approach)

– What other people used to tag that document

10

text 0.16mining 0.08data 0.07 probabilistic 0.04independence 0.03model 0.03…

Baseline: Use the topwords in that distributionto tag a document

Page 11: Automatic Web Tagging and Person Tagging Using Language Models

Tag Ranking: A Probabilistic Approach

• Web documents d a language model• A candidate tag t a language model from its

co-occurring tags • Score and rank t by KL-divergence of these two

language models

11

w dwp

twpdwptdDdtf

)|(

)|(log)|()||(),(

),|( Ctwp

Social BookmarkCollection

Page 12: Automatic Web Tagging and Person Tagging Using Language Models

Rewriting and Efficient Computation

12

),()||()|()|(

)|,(log)|(

)|(

),|(log)|(

)|(

)|(log)|(

)|(

),|(log)|(

)|(

)|(log)|(),(

CtBiasCdDCtpCwp

Ctwpdwp

twp

Ctwpdwp

Cwp

dwpdwp

Cwp

Ctwpdwp

dwp

twpdwpdtf

w

www

w

)|,( CwtPMIBias of using C to representcandidate tag tBias of using C to represent

document d (e.g., del.icio.us)

)]|,([),( CtwPMIEdtf d

rank

1. Can be pre-computed from corpus;2. Only store those PMI(w,t|C) > 0

Page 13: Automatic Web Tagging and Person Tagging Using Language Models

Tagging Web Users

• Summarize the interests and bias of a user• Web user a pseudo document• Estimate a language model from all tags that he

used• The rest is similar to web document tagging

13

Page 14: Automatic Web Tagging and Person Tagging Using Language Models

Experiments

• Dataset: – Two-week tagging records from Del.icio.us

– Candidate tags:• Top 15,000 Significant 2-grams from del.icio.us; • titles of all wikipedia entries (5,836,166 entries,

around 48,000 appeared in del.icio.us)

14

Time Span Bookmarks Distinct Tags

Distinct Users

02/13/07 ~ 02/26/07 579,652 111,381 20,138

Page 15: Automatic Web Tagging and Person Tagging Using Language Models

Tagging Web Documents

15

Urls LM p(w|d) Tag = Word Tag = bigram Tag = wikipedia title

http://kuler.adobe.com/ (158 bookmarks)

colordesignwebdesigntoolsadobegraphicsflash

colorcolourpalettecolorschemecolourspickercor

adobe colorcolor designcolor colourcolor colorscolour designinspiration palettewebdesign color

colorcolourpaletteweb colorcolourscorrgb

http://www.youtube.com/watch?v=6gmP4nk0EOE (157 bookmarks)

web2.0videoyoutubewebinternetxmlcommunity

youtuberevvervodcastprimercomunidadparticipationethnograpy

xml youtubeweb2.0 youtubevideo web2.0web2.0 xmlonline presentationsocial videoyoutube video

internet videoyoutuberevverresearch videovodcastprimerp2p TV

Too general, sometimes not

relevant

Relevant, preciseMeaningful,

relevant

overfit data, not real phrases

Meaningful, relevant

Meaningful, relevant, real

But partially covers good

tags

But sometimes not meaningful

overfit data, not real phrases

Page 16: Automatic Web Tagging and Person Tagging Using Language Models

Tagging Web Documents (Cont.)

16

Urls LM p(w|d) Tag = Word Tag = bigram Tag = wikipedia title

http://pipes.yahoo.com (386 bookmarks)

yahoorssweb2.0mashupfeedsprogrammingpipes

pipesfeedsyahoomashuprsssyndicationmashups

feeds mashupsmashup pipesweb2.0 yahoorss web2.0mashup rssapi feedspipes prog-ramming

pipesyahoomashupsrsssyndicationmashupsblog feeds

http://www.miniajax.com/ (349 bookmarks)

ajaxjavascriptweb2.0webdesignprogrammingcodewebdev

ajaxdhtmljavascriptmoo.fxdragdropphototypeautosuggest

ajax codecode javascriptjavascript ajaxjavascript web-2.0css ajaxjavascript pro-gramming

ajaxdhtmljavascriptmoo.fxjavascript li-braryjavascript -framework

Too general, sometimes not

relevant

Relevant, precise

But sometimes not meaningful

Meaningful, relevant

overfit data, not real phrases

Meaningful, relevant, real

Page 17: Automatic Web Tagging and Person Tagging Using Language Models

Tagging Web Users

17

Users LM p(w|d) Tag = bigram Tag = wikipedia title

User 1 photographyartportraitstoolswebdesigngeek

art photographyphotography portraitsdigital flickrphotoblog photographyart photoflickr photographyweblog wordpress

art photographyphotoblog portraitsphotographylandscapesflickrart contest

User 2 humorprogrammingphotographyblogwebdesignsecurityfunny

geek hackhumor programminghack hackingnetworking programminggeek htmlgeek hackingreference security

network programmingtweakhackingsecuritygeek humorsysadmindigitalcamera

Partially covers the interest

Meaningful, relevant, real

overfit data, not real phrases

Page 18: Automatic Web Tagging and Person Tagging Using Language Models

Tagging Web Users (Cont.)

18

Users LM p(w|d) Tag = bigram Tag = wikipedia title

User 3 gamesargtoolsprogrammingsudokucryptographysoftware

arg gamesgames puzzlesgames internetarg codegames sudokucode generatorcommunity games

arggames researchgamespuzzlesstorytellingcode generatorcommunity games

User 4 webreferencecssdevelopmentrubyonrailstoolsdesign

rubyonrails webcss developmentbrower developmentdevelopment editordevelopment forumdevelopment firefoxjavascript tools

javascriptcsswebdevxhtmldhtmlcss3dom

Missed many good tags

Page 19: Automatic Web Tagging and Person Tagging Using Language Models

Discussions

• Using top tags: too general, sometimes not relevant

• Ranking tags by labeling language models:– Candidate = Social bookmarking words

• Pros: relevant, compact• Cons: ambiguous, not so meaningful

– Candidate = Social bookmarking bigrams• Pros: more meaningful, relevant• Cons: overfiting the data, sometimes not real phrases

– Candidate = Wikipedia Titles:• Pros: meaningful, relevant real phrases• Cons: biased, missed potential good tags. (Bias(t, C))

19

Page 20: Automatic Web Tagging and Person Tagging Using Language Models

Summary

• Automatic tagging of web documents and web users

• A probabilistic approach based on labeling language models

• Effective when the candidate tags are of high quality

• Future work:– A robust way of generating candidate tags– Large scale evaluation

20

Page 21: Automatic Web Tagging and Person Tagging Using Language Models

Thanks!

21