social filtering. computational journalism week 5

Post on 05-Dec-2015

11 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Jonathan Stray, Columbia University, Fall 2015Syllabus at http://www.compjournalism.com/?p=133

TRANSCRIPT

Frontiers  of  Computational  Journalism

Columbia Journalism School

Week 5: Social Filtering

October 9, 2015

User

User

stories  not  covered

filtering

x

x

xx

x

x

x

x

x

xx

x

who  user  chooses  to  follow  =   social  filtering

Twi>er  follower  network “We have crawled the entire Twitter site and obtained 41.7 million user profiles, 1.47 billion social relations, 4, 262 trending topics, and 106 million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effective diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks”

- Kwak et. al, What is Twitter, a Social Network or a News Media?

More  “followings”  than  followers

Small  avg  distance  between  nodes

It’s  a  news  network  -­‐‑  hubs

It’s  a  news  network  

Small  number  of  high-­‐‑degree  hubs    

Different  network  structure  than  e.g.  Facebook.    

Different  uses.    

why?

-­‐‑  Zynep  Tufekci,  What  Happens  to  #Ferguson  Affects  Ferguson: Net  Neutrality,  Algorithmic  Filtering  and  Ferguson

John  McDermo>,  Why  Facebook  is  for  ice  buckets,  TwiBer  is  for  Ferguson

data  from SocialReach, who  works  with  many  publishers

-­‐‑  Sunita,  Why  #Ferguson  broke  out  on  TwiBer,  not  Facebook

Information  flow  on  Facebook

Finding  sources  on  social  media

Classify  Users Classic machine learning problem. Classify each user as one of: •  journalist/blogger •  organization •  ordinary individual First, need to encode as a vector / select features...

Features  for  user  classifier •  # of followers / following •  # of posts, favorites •  percentage of posts that are RTs, @replies, links •  presence/absence of named entities •  topic distribution of tweets (IPTC top level topics)

Digression:  IPTC  Media  Topic  Codes International standard hierarchical taxonomy, part of the NewsML markup system. Defined by Reuters, AP, NYTimes...

K-­‐‑nearest  neighbor  classifier

Take K closest training points (in high dimensional feature space), choose majority label.

Creating  the  training  data 1,850 random users 1,532 known organizations 1,490 known journalists and bloggers Hired Mechanical Turk workers to apply labels. Each user labeled by two workers, discarded if disagreement.

Classifier  Accuracy

“Eyewitness”  classifier Goal is to find individual tweets that are eyewitness reports. Started with LIWC (“linguistic inquiry and word count”) dictionary that classifies English words along 70 different dimensions, including emotion, cognition, time, health...

Word  Aspects

Used “perception” category words plus “insight” and “certainty” words

Eyewitness  tweet  classifier It’s an eyewitness tweet if it contains any of these special words! (or their stems) High precision! Low recall. •  89% of tweets classified as eyewitness actually were. •  But only 32% of eyewitness tweets detected.

Other  dimensions Tweet contains URL to photo or video (used table of domain names, e.g. flickr.com = photo) Posted from mobile device (from tweet metadata naming posting app) Geocode user’s stated location (this is painful and unreliable) Distribution of friends’ locations. (Friend = mutual following)

Test  user  reactions “This gives you context… you have the context for whether or not you think they’re reputable or whether or not they’re worth reaching out to.” “It’s giving me a lot of context which is really useful when you’re trying to verify if someone is reputable or not.” “I would tend to focus on the eyewitnesses and journalists/bloggers. Eventually I’d look at everyone else but I’d want to start my search with those two groups because they would normally provide me with the most information.”

Test  user  reactions Popular features:

Eyewitness filtering, user location, image/video filter

Unpopular features:

Entity extraction not helpful, no ability to filter by location and eyewitness status, focus on users instead of content

Social  Software Basic assumption: structure of software influences how groups use it. or: architecture influences behavior

Three  ways  to  influence  behavior Norms: culture, habits, etiquette, the user’s sense of what is “right” or “appropriate” Laws: rules enforced by the administrator Code: what it is actually possible to do

Design  problem... What do we want the users to accomplish together? How do we encourage this? We can write the code, but the culture is a separate issue.

top related