hybrid filtering. computational journalism week 6

35
Frontiers of Computational Journalism Columbia Journalism School Week 6: Hybrid Filtering October 16, 2015

Upload: jonathan-stray

Post on 05-Dec-2015

17 views

Category:

Documents


1 download

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133

TRANSCRIPT

Page 1: Hybrid Filtering. Computational Journalism week 6

Frontiers  of  Computational  Journalism

Columbia Journalism School

Week 6: Hybrid Filtering

October 16, 2015

Page 2: Hybrid Filtering. Computational Journalism week 6

Filtering  Comments

Thousands of comments, what are the “good” ones?

Page 3: Hybrid Filtering. Computational Journalism week 6

Comment  voting

Problem: putting comments with most votes at top doesn’t work. Why?

Page 4: Hybrid Filtering. Computational Journalism week 6

Reddit  Comment  Ranking  (old)

Up – down votes plus time decay

Page 5: Hybrid Filtering. Computational Journalism week 6

Reddit  Comment  Ranking  (new)

Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes.

N=16 v  =  11 p  =  11/16  =  0.6875

Page 6: Hybrid Filtering. Computational Journalism week 6

Reddit  Comment  Ranking

Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n

n=3 v’  =  1 p’  =  1/3  =  0.333

Page 7: Hybrid Filtering. Computational Journalism week 6

Reddit  Comment  Ranking

Limited sampling can rank votes wrong when we don’t have enough data.

p’  =  0.333 p  =  0.6875  

p’  =  0.75 p  =  0.1875  

Page 8: Hybrid Filtering. Computational Journalism week 6

Random  error  in  sampling If we observe p’ upvotes from n random users, what is the distribution of the true proportion p?

Distribution  of  p’  when  p=0.5

Page 9: Hybrid Filtering. Computational Journalism week 6

Confidence  interval Given observed p’, interval that true p has a probability α of lying inside.

Page 10: Hybrid Filtering. Computational Journalism week 6

Rank  comments  by  lower  bound    of  confidence  interval

p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p

Analytic  solution  for  confidence  interval,  known  as  “Wilson  score”

Page 11: Hybrid Filtering. Computational Journalism week 6

User-­‐‑item  matrix

Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...

Page 12: Hybrid Filtering. Computational Journalism week 6

User-­‐‑item  matrix •  No content analysis. We know nothing about what is “in” each

item. •  Typically very sparse – a user hasn’t watched even 1% of all

movies. •  Filtering problem is guessing “unknown” entry in matrix. High

guessed values are things user would want to see.

Page 13: Hybrid Filtering. Computational Journalism week 6

Filtering  process

Page 14: Hybrid Filtering. Computational Journalism week 6

How  to  guess  unknown  rating?

Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase.

o  “Users who bought A also bought B...” o  “Users who clicked A also clicked B...” o  “Users who shared A also shared B...”

Page 15: Hybrid Filtering. Computational Journalism week 6

Similar  items

Page 16: Hybrid Filtering. Computational Journalism week 6

Item  similarity Cosine similarity!

Page 17: Hybrid Filtering. Computational Journalism week 6

Other  distance  measures “adjusted cosine similarity”

Subtracts  average  rating  for  each  user,  to  compensate  for  general  enthusiasm  (“most  movies  suck”  vs.  “most  movies  are  great”)

Page 18: Hybrid Filtering. Computational Journalism week 6

Generating  a  recommendation

Weighted  average  of  item  ratings  by  their  similarity.

Page 19: Hybrid Filtering. Computational Journalism week 6

Matrix  factorization  recommender

Page 20: Hybrid Filtering. Computational Journalism week 6

Matrix  factorization  recommender

Page 21: Hybrid Filtering. Computational Journalism week 6

Matrix  factorization  plate  model

r

v

u

user  rating of  item

variation  in user  topics

λu

λv

variation  in item  topics

topics  for  user

topics  for  item

i  users

j  items

Page 22: Hybrid Filtering. Computational Journalism week 6

Combining  collaborative  filtering    and  topic  modeling

Page 23: Hybrid Filtering. Computational Journalism week 6

K  topics  

topic  for  word word  in  doc topics  in  doc topic  

concentration parameter

word concentration parameter

Content  modeling  -­‐‑  LDA

D  docs

words  in  topics

N  words in  doc

Page 24: Hybrid Filtering. Computational Journalism week 6

K  topics   topic  for  word word  in  doc topics  in  doc (content)

topic   concentration

weight  of  user selections

variation  in per-­‐‑user  topics topics  for  user

user  rating of  doc topics  in  doc

(collaborative)

Collaborative  Topic  Modeling  

Page 25: Hybrid Filtering. Computational Journalism week 6

content  only

content  +   social

Page 26: Hybrid Filtering. Computational Journalism week 6

Different  Filtering  Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y" No content analysis. Hybrid: Recommend based both on content and user behaviur.

Page 27: Hybrid Filtering. Computational Journalism week 6

Item  Content My  Data Other  Users’  Data

Text  analysis,   topic  modeling,  clustering...

who  I  follow

what  I’ve  read/liked

social  network  structure,

other  users’  likes  

Page 28: Hybrid Filtering. Computational Journalism week 6

How  to  evaluate/optimize?

Page 29: Hybrid Filtering. Computational Journalism week 6

How  to  evaluate/optimize? •  Netflix: try to predict the rating that the user gives a

movie after watching it.

•  Amazon: sell more stuff.

•  Google web search: human raters A/B test every change

Page 30: Hybrid Filtering. Computational Journalism week 6

•  Does the user understand how the filter works? •  Can they configure it as desired? •  Can they correctly predict what they will and won't

see?

How  to  evaluate/optimize?

Page 31: Hybrid Filtering. Computational Journalism week 6

•  Can it be gamed? Spam, "user-generated censorship," etc.

How  to  evaluate/optimize?

Page 32: Hybrid Filtering. Computational Journalism week 6

"ʺDuring  the  2012  election,  The  ~2000  members  of  an  anti-­‐‑Ron  Paul  subreddit  discovered  that  anything  they  posted,  anywhere  on  reddit,  was  being  rapidly,  repeatedly  downvoted.  They  created  a  diagnostic  subreddit  and  began  posting  otherwise  meaningless  text  to  verify  this  otherwise  odd  behavior."ʺ

Page 33: Hybrid Filtering. Computational Journalism week 6

Filter  design  problem Formally, given

U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?)

Define

r(S,U,{P},{B}) in [0...1] relevance of story S to user U

Page 34: Hybrid Filtering. Computational Journalism week 6

Filter  design  problem,  restated When should a user see a story? Aspects to this question:

normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely

Page 35: Hybrid Filtering. Computational Journalism week 6

How  to  evaluate/optimize?

Does it improve the user's life?