online advertising open lecture at warsaw university february 25/26, 2011 ingmar weber yahoo!...

Post on 14-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Online AdvertisingOpen lecture at Warsaw University

February 25/26, 2011

Ingmar WeberYahoo! Research Barcelona

ingmar@yahoo-inc.com

Please interrupt me at any point!

Disclaimers & Acknowledgments

• This talk presents the opinions of the author. It does not necessarily reflect the views of Yahoo! Inc. or any other entity.

• Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo! or any other company.

• Many of the slides in this lecture are based on tables/graphs from the referenced papers. Please see the actual papers for more details.

Review from last lecture

• Lots of money– Ads essentially pay for the WWW

• Mostly sponsored search and display ads– Sp. search: sold using variants of GSP– Disp. ads: sold in GD contracts or on the spot

• Many computational challenges– Finding relevant ads, predicting CTRs,

new/tail content and queries, detecting fraud, …

Plan for today and tomorrow

• So far– Mostly introductory, “text book material”

• Now– Mostly recent research papers– Crash course in machine learning, information

retrieval, economics, …

Hopefully more “think-along” (not sing-along) and not “shut-up-and-listen”

But first …

• Third party cookies

www.bluekai.com (many others …)

Efficient Online Ad Serving in a Display Advertising Exchange

Keving Lang, Joaquin Delgado, Dongming Jiang, et al.

WSDM’11

Not so simple landscape for D AAdvertisers

Publishers

32m, likes running

“Buy shoes at nike.com” “Visit asics.com today” “Rolex is great.”

A running blog The legend of Cliff Young Celebrity gossip

Users

50f, loves watches 16m, likes sports

Basic problem:Given a (user, publisher) pair, find a good ad(vertiser)

Ad networks and Exchanges

• Ad networks– Bring together supply (publishers) and

demand (advertisers)– Have bilateral agreements via revenue

sharing to increase market fluidity

• Exchanges– Do the actual real-time allocation– Implement the bilateral agreements

User constraints: no alcohol ads to minorsSupply constraints: conservative network doesn’t want left publishersDemand constraints: Premium blogs don’t want spammy ads

Middle-aged, middle-income New Yorker visits the web site of Cigar Magazine (P1)

D only known at end.

Valid Paths & Objective Function

Algorithm A

Worst case running time?Typical running time?

Depth-first search enumeration

Algorithm B

Worst case running time?Sum vs. product?Optimizations?

D pruning

Upper boundWhy?

US pruning

Reusable Precomputation

What if space limitations?How would you prioritize?

Cannot fully enforce DDepends on reachable sink …… which depends on U

Experiments – artificial data

Experiments – real data

Competing for Users’ Attention: On the Interplay between Organic and

Sponsored Search ResultsChristian Danescu-Niculescu-

Mizil, Andrei Broder, et al.

WWW’10

What would you investigate?What would you suspect?

Things to look at

• General bias for near-identical things– Ads are preferred (as further “North”)– Organic results are preferred

• Interplay between ad CTR and result CTR– Better search results, less ad clicks?– Mutually reinforcing?

• Dependence on type– Navigational query vs. informational query– Responsive ad vs. incidental ad

Data

• One month of traffic for subset of Y! search servers

• Only North ads, served at least 50 times

• For each query qi most clicked ad Ai* and

most clicked organic result Oi*

• 63,789 (qi, Oi*, Ai

*) triples

• Bias?

(Non-)Commercial bias?

• Look at A* and O* with identical domain

• Probably similar quality …

• … but (North) ad is higher

• What do you think?

• In 52% ctrO > ctrA

Correlation

ctrO

av.

ctrA

ctrA

av.

ctrO

For given (range of) ctrO bucket all ads.

Navigational vs. non-navigational

ctrAav

. ct

rOctrO

av.

ctrA

Navigational: antagonistic effectNon-navigational: (mild) reinforcement

Dependence on similarity

Bag of words for title terms

(“Free Radio”, “Pandora Radio – Listen to Free Internet Radio, Find New Music”) = 2/9

Dependence on similarityav

. ct

rA

av.

ctrA

A simple model

Want to model

Also need:

A simple model

Explains basic (quadratic) shape of overlap vs. ad click-through-rate

Improving Ad Relevance in Sponsored Search

Dustin Hillard, Stefan Schroedl, Eren Manavoglu, et al.

WSDM’10

Ad relevance Ad attractiveness

• Relevance– How related is the ad to the search query– q=“cocacola”, ad=“Buy Coke Online”

• Attractiveness– Essentially click-through rate– q=“cocacola”, ad=“Coca Cola Company Job”– q=*, ad=“Lose weight fast and easy”

Hope: decoupling leads to better (cold-start) CTR predictions

Basic setup

• Get relevance from editorial judgments– Perfect, excellent, good, fair, bad– Treat non-bad as relevant

• Machine learning approach– Compare query to the ad– Title, description, display URL– Word overlap (uni- and bigram), character overlap

(uni- and bigram), cosine similarity, ordered bigram overlap

– Query length

• Data– 7k unique queries (stratified sample)– 80k query-ad judged relevant pairs

Basic results – text only

What other features?

Precision = (“said ‘yes’ and was ‘yes’”)/(“said ‘yes’”)Recall = (“said ‘yes’ and was ‘yes’”)/(“was ‘yes’”)Accuracy = (“said the right thing”)/(“said something”)F1-score = 2/(1/P + 1/R) harmonic mean < arithmetic mean

Incorporating user clicks

• Can use historic CTRs– Assumes (ad,query) pair has been seen

• Useless for new ads– Also evaluate in blanked-out setting

Translation Model

In search, translation models are common

Here D = ad

Good translation = ad click

Typical model

Maximum likelihood (for historic data)

Any problem with this?

A query term An ad term

Digression on MLE

• Maximum likelihood estimator– Pick the parameter that‘s most likely to

generate the observed data

Example: Draw a single number from a hat with numbers {1, …, n}.

You observe 7.Maximum likelihood estimator?

Underestimates size (c.f. # of species)Underestimates unknown/impossible

Unbiased estimator?

Remove position bias• Train one model as described before

– But with smoothing

• Train a second model using expected clicks

• Ratio of model for actual and expected clicks

• Add these as additional features for the learner

Filtering low quality ads

Showing fewer ads gave more clicks per search!

• Use to remove irrelevant ads- Don‘t show ads below relevance threshold

Second part of Part 2

Estimating Advertisability of Tail Queries for Sponsored Search

Sandeep Pandey, Kunal Punera, Marcus Fontoura, et al.

SIGIR’10

Two important questions

• Query advertisability– When to show ads at all– How many ads to show

• Ad relevance and clickability– Which ads to show– Which ads to show where

Focus on first problem.Predict: will there be an ad click?Difficult for tail queries!

Word-based Model

s(q) = # instances of q with an ad clickn(q) = # instances of q without an ad click

Query q has words {wi}. Model q‘s click propensity as:

Good/bad?

Variant w/o bias for long queries:

Maximum likelihood attempt to learn these:

Word-based Model

Then give up …each q only one word

Linear regression model

Problem?

Different model: words contribute linearly

Add regularization to avoid overfittingof underdetermined problem

Digression

Taken from: http://www.dtreg.com/svm.htm and http://www.teco.edu/~albrecht/neuro/html/node10.html

Topical clustering

• Latent Dirichlet Allocation– Implicitly uses co-occurrences patterns

• Incorporate the topic distributions as features in the regression model

Evaluation

• Why not use the observed c(q) directly?– “Ground truth” is not trustworthy – tail queries

• Sort things by predicted c(q)– Should have included optimal ordering!

Learning Website Hierarchies for Keyword Enrichment in Contextual Advertising

Pavan Kumar GM, Krishna Leela, Mehul Parsana, Sachin Garg

WSDM’11

The problem(s)

• Keywords extracted for contextual advertising are not always perfect

• Many pages are not indexed – no keywords available. Still have to serve ads

• Want a system that for a given URL (indexed or not) outputs good keywords

• Key observation: use in-site similarity between pages and content

Preliminaries

• Mapping URLs u to key-value pairs

• Represent webpage p as vector of keywords– tf, df, and section where found

Goals:1. Use u to introduce new kw and/or update existing weights2. For unindexed pages get kw via other pages from same site

Latency constraint!

What they do

• Conceptually:– Train a decision tree with keys K as attribute

labels, V as attribute values and pages P as class labels

– Too many classes (sparseness, efficiency)

• What they do:– Use clusters of web pages as labels

Digression: Large scale clustering

• How (and why) to detect mirror pages?– “ls man”

• Want a summarizing “fingerprint”?– Direct hashing won’t work

What would you do?

Syntactic clustering of the Web, Broder et al., 1997

Shingling

w-shingles of a document (say, w=4)– “If you are lonely when you are alone, you are

in bad company.” (Sartre)

{(if you are lonely), (you are lonely when), (are lonely when you), (lonely when you are), …}

Resemblance

rw(A,B) = |S(A,w)ÅS(B,w)|/|S(A,w)[S(B,w)|

Works well, but how to compute efficiently?!

Obtaining a “sketch”

• Fix shingle size w, shingle universe U.• Each indvidual shingle is a number (by hashing)• Let W be a set of shingles. Define:

• MINs(W) = The set of s smallest elements in W, if |W|¸s

W otherwise

Theorem:Let U!U be a permutation of U chosen uniformly at random.Let M(A) = MINs((S(A))) and M(B) = MINs((S(B))).

The value |MINs(M(A)[M(B))ÅM(A)ÅM(B)|/|MINs(M(A)[M(B))|is an unbiased estimate of the resemblance of A and B.

Proof

Note: Mins(M(A)) has a fixed size (namely s).

Back to where we were• They (essentially) use agglomerative

single-linkage clustering with a min similarity stopping threshold

• Splitting criteria– How would you do it?

Do you know agglomerative clustering?

Not the best criterion?

• IG prefers attributes with many values– They claim: high generalization error– They use: Gain Ratio (GR)

Take impressions into account• So far (unweighted) pages

– Class probability = number of pages

More weight for recent visits:

Weight things by impressions.

Stopping criterion

• Stop splitting in tree construction when– All children part of the same class– Too few impressions under the node– Statistically not meaningful (Chi-square test)

• Now we have a decision tree for URLs (leaves)– What about interior nodes?

Obtaining keywords for nodes• Belief propagation – from leaves up …and

back down down

Now we have keywords for nodes.Keywords for matching nodes areused.

Evaluation

• Two state-of-the-art baselines– Both use the content– JIT uses only first 500 bytes, syntactical– “Semantic” uses topical page hierarchies– All used with cosine similarity to find ads

• Relevance evaluation– Human judges evaluated ad relevance

(Some) Results

nDCG… slide

Digression - nDCG• Normalized Discounted Cumulative Gain

• CG: total relevance at positions 1 to p

• DCG: the higher the better

• nDCG: take problem difficulty into account

An Expressive Mechanism for Auctions on the Web

Paul Dütting, Monika Henzinger, Ingmar Weber

WWW’11

More general utility functions

• Usually– ui,j(pj) = vi,j – pj

– Sometimes with (hard) budget bi

• We want to allow– ui,j(pj) = vi,j – ci,j¢ pj, i.e. (i,j)-dependent slopes

– multiple slopes on different intervals– non-linear utilities altogether

Why (i,j)-dependent slopes?

• Suppose mechanism uses CPC pricing …

• … but a bidder has CPM valuation

• Mechanism computes

• Guarantees

• Translating back to impressions …

Why (i,j)-dependent slopes?

Why different slopes over intervals?

• Suppose bidding on a car on ebay– Currently only 1-at-a-time (or dangerous)!– Utility depends on rates of loan

Why non-linear utilities?

• Suppose the drop is supra-linear– The higher the price the lower the profit …– … and the higher the uncertainty

– Maybe log(Ci,j-pj)

– “Risk-averse” bidders

• Will use piece-wise linear for approximation– Give approximation guarantees

Input definition

Set of n bidder I, set of k items J.

Items contain a dummy item j0.

Each bidder i has an outside option oi.

Each item j has a reserve price rj.

• Compute an outcome

• Outcome is feasible if

• Outcome is envy free if for all i and (i,j) 2IxJ

• Bidder optimal if for all other envy free

and for all bidders i (strong!)

Problem statement

Bidder optimality vs. truthfulnessTwo bidders i2{1,2} and two items j2{1,2}.

rj = 0 for j2{1,2}, and oi = 0 for i2{1,2}

What‘s a bidder optimal outcome?What if bidder 1 underreports u1,1(¢)?

Note: “degenerate” input!

Theorem: General position => Truthfulness.[See paper for definition of “general position”.]

Main Results

Definition:

Overdemand-preserving directions

• Basic idea– Algorithm iteratively increases the prices– Price increases required to solve overdemand

• Tricky bits– preserve overdemand (will explain)– show necessity (for bidder optimality)– accounting for unmatching (for running time)

Overdemand-preserving directions

Bidders

1 1

2

3 3

2Items

10-p1

9-p2

12-p1

7-p2

5-p3

3-p3

8-p1

2-p3

11-p2

p1=1

p2=0

p3=0

Explain:First choice graph

Explain:Increase required

The simple case

Explain:Path augmentation

5

4

Overdemand-preserving directions

Bidders

1 1

2

3 3

2Items

11-2p1

9-p2

12-3p1

4-3p2

5-p3

3-p3

8-4p1

2-p3

9-7p2

p1=1

p2=0

p3=0

Explain:ci,j matter!

The not-so-simple case

Finding ov.d-preserv. directions

• Key observation (not ours!):– minimize– or equivalently

• No longer preserves full first choice graph– But alternating tree

• Still allows path augmention

The actual mechanism

Effects of Word-of-Mouth Versus Traditional Marketing: Findings from an Internet Social Networking Site

Michael Trusov, Randolph Bucklin, Koen Pauwels

Journal of Marketing, 2009

The growth of a social network

• Driving factors– Paid event marketing (101 events in 36 wks)– Media attention (236 in 36 wks)– Word-of-Mouth (WOM)

• Can observe– Organized marketing events– Mentions in the media– WOM referrals (through email invites)– Number of new sign-ups

What could cause what?

• Media coverage => new sign-ups?

• New sign-ups => media coverage?

• WOM referrals => new sign-ups?

• ….

Time series modeling

sign-ups

WOM referrals

Media appearances

Promo events

intercept linear trend holidays day of week

Up to 20 daysLots of parameters

Time series modeling

Overfitting?

Granger Causality

• Correlation causality– Regions with more storks have more babies– Families with more TVs live longer

• Granger causality attempts more– Works for time series– Y and (possible) cause X– First, explain (= linear regression) Y by lagged Y– Explain the rest using lagged X– Significant improvement in fit?

What causes what?

Response of Sign-Ups to Shock

• IRF: impulse

response function

New to me …

Digression: Bass diffusion

New “sales” at time t:

Ultimate market potential m is given.

Model comparison

• 197 train (= in-sample)

• 61 test (= out-of-sample)

Monetary Value of WOM

• CPM about $.40 (per ad)

• Impressions visitor/month about 130

• Say 2.5 ads per impression

• $.13 per month per user, or about $1.50/yr

• IRF: 10 WOM = 5 new sign-ups over 3 wk

1 WOM worth approx $.75/yr

top related