hybrid filtering. computational journalism week 6

Frontiers of Computational Journalism

Columbia Journalism School

Week 6: Hybrid Filtering

October 16, 2015

Filtering Comments

Thousands of comments, what are the “good” ones?

Comment voting

Problem: putting comments with most votes at top doesn’t work. Why?

Reddit Comment Ranking (old)

Up – down votes plus time decay

Reddit Comment Ranking (new)

Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes.

N=16 v = 11 p = 11/16 = 0.6875

Reddit Comment Ranking

Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n

n=3 v’ = 1 p’ = 1/3 = 0.333

Reddit Comment Ranking

Limited sampling can rank votes wrong when we don’t have enough data.

p’ = 0.333 p = 0.6875

p’ = 0.75 p = 0.1875

Random error in sampling If we observe p’ upvotes from n random users, what is the distribution of the true proportion p?

Distribution of p’ when p=0.5

Confidence interval Given observed p’, interval that true p has a probability α of lying inside.

Rank comments by lower bound of confidence interval

p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p

Analytic solution for confidence interval, known as “Wilson score”

User-‐‑item matrix

Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...

User-‐‑item matrix •  No content analysis. We know nothing about what is “in” each

item. •  Typically very sparse – a user hasn’t watched even 1% of all

movies. •  Filtering problem is guessing “unknown” entry in matrix. High

guessed values are things user would want to see.

Filtering process

How to guess unknown rating?

Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase.

o  “Users who bought A also bought B...” o  “Users who clicked A also clicked B...” o  “Users who shared A also shared B...”

hybrid filtering. computational journalism week 6

Documents

p upvotes

distribution of p

true proportion p

n users

n random users

o users

true p analytic solution

reddit comments