coalmine spie 2012 presentation - jsw -d3

17
Coalmine: An Experience in Building a System for Social Media Analytics Joshua S. White Jeanna N. Matthews, PhD

Upload: securemind

Post on 26-Jan-2015

110 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Coalmine   spie 2012 presentation - jsw -d3

Coalmine:An Experience in Building a System for Social Media Analytics

Joshua S. WhiteJeanna N. Matthews, PhD

Page 2: Coalmine   spie 2012 presentation - jsw -d3

Outline

• Problem• Method Overview• Data Collection• Analysis• Case Studies• Conclusion / Future Work

Page 3: Coalmine   spie 2012 presentation - jsw -d3

Problem

• Social Media Networks – A communications means for good and bad

•Proven cases of malware / botnets use•SPAM medium

• Our Goal– To provide a generalized tool for analysis

of potential threats that use these networks for communications.

Page 4: Coalmine   spie 2012 presentation - jsw -d3

Method Overview

Page 5: Coalmine   spie 2012 presentation - jsw -d3

• Initially (Spring 2011)– Twitter approved oAuth application

• Firehose Subscription with white-listing– ~20% of all Tweets– (No longer available)

» Twitter no longer allows researchers to share datasets

» We needed to develop a new collection method

» Can not violate terms of use

Data Collection

Page 6: Coalmine   spie 2012 presentation - jsw -d3

• Current– Distributed Data Collection Infrastructure

– Geographically dissimilar IP's to simulate multiple users

– Registered Application with Non-authenticated API access

• ~80 – 100% of all Tweets (1 billion+ / week)

Page 7: Coalmine   spie 2012 presentation - jsw -d3

• Storage – Collection in Streaming Gzip Python

Dict. Format (10:1 Compression Ratio)•Converted to JSON on the fly when

needed– Initially Stored in HDFS (Had Issues)

»Recent work uses DDFS

– Indexed using Luceen•New methods are being explored

– Discodex w/ BSON Store– Storing 1.5 TB a Week

Data Collection

Page 8: Coalmine   spie 2012 presentation - jsw -d3

• Two Part Method– Manual Inspection

•Query Panel Front-end

– Automated Inspection

Analysis

Page 9: Coalmine   spie 2012 presentation - jsw -d3

Example AnalysisField Name Description Example Dataname User's REAL Name Text: "Robert Scoble"screen_name User's Twitter username Text: "scobleizer"

profile_image_url Link to users profile image

Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-fanatiguy_normal.jpg"

url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer"followers_count Number of followers user has Number: "185496"friends_count Number of people user follows Number: "31971"utc_offset Offset from GMT (in seconds) Number: "-28800"

geo_enabledWhether user has enabled location Boolean: "True"

statuses_countNumber of statuses user has posted Number: "53522"

Tweet Specific Fields    created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011"

idTweet id (useful for URL creation) Number: "80703603437875201"

textContains the actual text + any embedded URLs

Whatever text the person chooses to enter. <- Could be any language supported.

sourceLinks to Twitter client URL <- not important

HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"

in_reply_to_status_id

Number of status that user replied to Number: "80671170374025220"

in_reply_to_screen_name

Screen name of user the current status replies to Text: "danharmon"

retweet_countNumber of times this status is retweeted Number: "0"

retweetedWhether or not the status has been retweeted Boolean: "false"

'geo' flag specific:    georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939"

urlPoints to a JSON or XML file with further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"

Page 10: Coalmine   spie 2012 presentation - jsw -d3

Case Study: Botnet C2

• One well known case:– Arbor Networks detected first known

incident in 2009•Base 64 encoded control signals

– Soon After:•A number of tools released to do the

same:– ControlMyPC, KreosC2, etc.

Page 11: Coalmine   spie 2012 presentation - jsw -d3

• Sample Manual Detection:

Case Study: Botnet C2

Page 12: Coalmine   spie 2012 presentation - jsw -d3

• Twitter's number one problem, artificially increases traffic and bothers legitimate users

• Easily detected during manual analysis

• Automated detection based on wording and rates at which messages are posted

Case Study: SPAM

Page 13: Coalmine   spie 2012 presentation - jsw -d3

• Coalmine - A tool for Social Media Analysis– Scales well based on initial tests– Useful for both manual and automated

detection

• Future (Current) Work– Rebuild of the tool to fix scaling limitations

• More extensible Map/Reduce method• Inclusion of native multi-threading capability• New storage and distribution method• New algorithms for automated opinion leader

detection

Conclusion / Future Work

Page 14: Coalmine   spie 2012 presentation - jsw -d3

?

Questions

Page 15: Coalmine   spie 2012 presentation - jsw -d3

References

Page 16: Coalmine   spie 2012 presentation - jsw -d3

References

Page 17: Coalmine   spie 2012 presentation - jsw -d3

References