Download - Web Behavior Analysis
![Page 1: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/1.jpg)
Web Behavior Analysis
![Page 2: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/2.jpg)
Your Last Words? (in 22nd century)
• To family• To your best friend?
![Page 3: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/3.jpg)
Web Behavior Analysis
• Why important?• Why scary?
![Page 4: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/4.jpg)
Part I: Why Important?
![Page 5: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/5.jpg)
Q. In the past six months have you used a search engine to help inform your decisions for the following tasks?
66%of people are using search
more frequently to make
decisions
• We rely more and more on search for our real-life decision– Opportunities for
business– Concerns for
privacy
![Page 6: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/6.jpg)
Length of Sessions by Type
What should be done?
• Focus on new territory
![Page 7: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/7.jpg)
Taxonomy of Web queries
• Navigational (we are good at this)– to reach a particular site
• E.g., Searching for top page of company
• Informational– to acquire pages that provide
knowledge for user’s information need• Conventional ad hoc retrieval
• Transactional– to perform a Web-mediated activity
• E.g., online shopping
![Page 8: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/8.jpg)
Navigational Queries Pseudo- Navigational Queries
Example: Good and Bad
![Page 9: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/9.jpg)
• Car GPS around $300• Four day trip to Bhutan from Delhi to
visit important Buddhist places
Example of “Hard Queries”:Informational/Transactional
![Page 10: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/10.jpg)
Game Consol
es
Party Site
![Page 11: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/11.jpg)
What we want?
![Page 12: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/12.jpg)
Current research directions
• How to classify queries?• Then what?
– Search engines trying to reduce clicks for “hard queries”
– Extracting info from forum
![Page 13: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/13.jpg)
Importance of query classification: “obama”
• Informational: People may search to know more about Barak Obama
• Navigational: visit his official website • Transactional: perhaps the user goal
is to donate money online to support Mr. Obama’s campaign
![Page 14: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/14.jpg)
Yahoo numbers
• ~25 informational content text?• ~40 navigational anchor text?• ~35 transactional site template?
![Page 15: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/15.jpg)
Can you tell if query is “navigational” or not?
![Page 16: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/16.jpg)
Lee et al.[WWW05]: Overview
• Analyzing how query term is used in anchor texts
WWW2008 WWW2008search search
Top page ofWWW2008
Description in Wikipedia
Search engine
Destinations are identical → Navigational
Destinations are diverse → Informational
Q = “search” Q = “WWW2008”
![Page 17: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/17.jpg)
Anchor-link distribution (ALD)
Probability that page linked by t is d
Top page of WWW2008
t = WWW2008
ALD is skewed
)|( tdP
Google Yahoo!Wikipedia
t = search
ALD is uniform
NavigationalInformational
)|( tdP
![Page 18: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/18.jpg)
Lee et al.: Problem• Targeting only anchor texts that are
exactly same as the query– If the same anchor text as the query
does not exist on the Web, ALD cannot be computed
• Problematic queries– Long phrase
• E.g., “information retrieval system research”
– Multiple keywords• E.g., “trec, nist, test collection”
![Page 19: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/19.jpg)
Multi-query solutionQuery Q = “trec, test collection”
t = trec t = test t = collection)|( tdP
Terms T = {trec, test, collection}
destinations D = {d1, d2, …}
Compute ALD on a term-by-term basis and integrate them
![Page 20: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/20.jpg)
Computation of classification score
• Entropy of D
Tt Dd
tdPtdPtPTDH )|(log)|()()|(
Entropy of a single term tWeighted average
![Page 21: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/21.jpg)
Now what?
• For “WWII”– Google: http://www.google.com/search?q=WWII&hl=
en&tbo=1&output=search&tbs=ww:1 – Microsoft: http://
www.bing.com/reference/semhtml/World_War_II?fwd=1&qpvt=wwii&src=abop&q=wwii
– Wolfram: http://www.wolframalpha.com/input/?i=wwII • Can you tell information vs. transactional?
![Page 22: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/22.jpg)
Challenges/Opportunities
• Slightly subtle/interleaved• But huge advertisement revenue (yet to be
explored)!!!!• Classic querylog+Clicks on surface web no
t enough..• Any ideas?
![Page 23: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/23.jpg)
More signals?
• Eye movement? • Brain signal?
![Page 24: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/24.jpg)
More corpus? (social corpus for polls? expert advice?)
![Page 25: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/25.jpg)
More signal
![Page 26: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/26.jpg)
CS: Client Simple
• First representation:– Trajectory length– Horizontal range– Vertical range
Horizontal range
Vertical range
Trajectory length
![Page 27: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/27.jpg)
CF: Client Full
• Second representation: – 5 segments:
initial, early, middle, late, and end
– Each segment: speed, acceleration, rotation, slope, etc.
1
2
3
4
5
![Page 28: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/28.jpg)
Navigational query: “facebook”
![Page 29: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/29.jpg)
Informational query: “spanish wine”
![Page 30: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/30.jpg)
Transactional query: “integrator”
![Page 31: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/31.jpg)
More corpus
• cQA successful, as “additional corpus”, not as “additional means”
• Challenges?
![Page 32: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/32.jpg)
cQA (Yahoo Answers)
![Page 33: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/33.jpg)
![Page 34: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/34.jpg)
![Page 35: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/35.jpg)
![Page 36: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/36.jpg)
How Yahoo Answers works
![Page 37: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/37.jpg)
![Page 38: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/38.jpg)
![Page 39: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/39.jpg)
![Page 40: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/40.jpg)
![Page 41: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/41.jpg)
![Page 42: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/42.jpg)
Good questions draw good answers
![Page 43: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/43.jpg)
Good Q/A? -- Text
Check also: http://www.addedbytes.com/code/readability-score/
![Page 44: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/44.jpg)
Good Q/A? -- Clicks
![Page 45: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/45.jpg)
Good Q/As? -- Community
![Page 46: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/46.jpg)
Why scary?
![Page 47: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/47.jpg)
Useful beyond imagination
• Spell checker: SIGMOD Did you mean “sigmoid”?
• Entity relation: SIGMOD ~ SIGIR• Translation: SIGMOD, 씨그모드 sigmo
d.com• Query suggestion: 영일대 호텔 영일대• Rank learning: top 10 entry is visited all th
e time, what should we do?• Reason of migrain?
![Page 48: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/48.jpg)
Companies need YOUR HELP
• AOL released logs• Guess what happened?
![Page 49: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/49.jpg)
More scientific observations (Yahoo Research)
• X={query1, query2, query3}• Y= age
gender area
XY (how likely?) Validate with ground-truth info (Yahoo
account)
![Page 50: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/50.jpg)
See if you can do it?
• You observe yourself:
http://aolpsycho.com/user/5826-kallemeyn
![Page 51: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/51.jpg)
Gender
• Female: fanfiction, bridal, makeup, women’s, knitting, hair, ecards, glitter, yoga, and diet
• Male: nfl, poker, espn, ufc, railroad, prostate, footb
all, golf, male, wrestling, compusa, as well as a variety of adult terms
Accuracy: 80+%
![Page 52: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/52.jpg)
Age
• YOUNG: myspace, pregnancy, wikipedia, lyrics, quotes, apartments, torrent, baby, wedding, mall, soundtrack;
• OLD: aarp, telephone, lottery, amazon.com, retirement, funeral, senior, mapquest, medicare, newspapers, repair.
![Page 53: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/53.jpg)
Place
• A user’s zip code (US postal code) or other identifier of location may be detectable from place names used in
• Check out YahooGEO Apis
![Page 54: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/54.jpg)
Name?
• 50+% issued their name• (but other names too)
Ref: "Vanity Fair: Privacy in Querylog Bundles"
![Page 55: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/55.jpg)
User Solutions?
TrackMeNot (TMN) Their tool is an extensio
n to the Firefox web browser, and initiates randomized search queries in the background to a number of commercial search engines.
• Tor: change IP/cookie (prevents aggregation)
- Losing services e.g., personalization
![Page 56: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/56.jpg)
Company Solution
• K-anonymity (bundling)Reported to be unsafe for (vanity
search + geo-query, long-tail keywords)
[so far, it is considered to be TOO RISKY]
![Page 57: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/57.jpg)
Summary
• You are leaving trails in the cyber world, which aligns more and more with real-life trails
• Companies are interested in predicting as much as possible of your next behaviors
• More signals? More corpus?• Can you hide as much to protect privacy,
while reveal as much to enable such prediction? (privacy dilemma)
• But it is ok even if we can’t know (product state-of-the-art)
![Page 58: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/58.jpg)
Search UI? Visualization?
![Page 59: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/59.jpg)
What are query aspects?
![Page 60: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/60.jpg)
Challenge
• Intentions are hidden– omission of key information makes intent in q
ueries ambiguous– eg: omitting “reviews” when searching for revie
ws of “Canon EOS 40D SLR”– eg: omitting “location/city” when searching for
“jobs”• Queries are often too broad
![Page 61: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/61.jpg)
Goal
• Mine broad latent aspects from search logs– Formulate the problem based on a real-world m
odel of user interaction with search engine (session = 10 mins)
– Bring interesting aspects to the attention of editors who can then determine saliency and usefulness
![Page 62: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/62.jpg)
User reformulates query by adding
qualifier “reviews”
User reformulates query by selecting “reviews” aspect
User interaction modelUser enters
original query “Canon EOS 40D”
Search engine (SE) returns general
results
SE returns reviews of the camera
User’s query is satisfied. eg: she clicks
on a result.
Search engine (SE) returns general results + query
aspects
Learning of query aspectsfrom reformulations
![Page 63: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/63.jpg)
Results: Examples of aspects found
![Page 64: Web Behavior Analysis](https://reader036.vdocuments.net/reader036/viewer/2022062517/56813ac6550346895da2d946/html5/thumbnails/64.jpg)
New directions might be
• Taking target web page clicked into account while constructing aspects
• Or visualization techniques helping to visually/perceptually/cognitively mine such “aspects”– Visualization/refinement iterations to narrow down
Tomorrow 4:15pm (B2 102)Title:
Using Information Visualization to Understand Data Abstract:
Information Visualization is the art and science of representing abstract information in a visual form that enables users to understand data through their perceptual and cognitive capabilities.
Dr. Bongshin Lee (Microsoft Research)