coalmine spie 2012 presentation - jsw -d3
DESCRIPTION
TRANSCRIPT
Coalmine:An Experience in Building a System for Social Media Analytics
Joshua S. WhiteJeanna N. Matthews, PhD
Outline
• Problem• Method Overview• Data Collection• Analysis• Case Studies• Conclusion / Future Work
Problem
• Social Media Networks – A communications means for good and bad
•Proven cases of malware / botnets use•SPAM medium
• Our Goal– To provide a generalized tool for analysis
of potential threats that use these networks for communications.
Method Overview
• Initially (Spring 2011)– Twitter approved oAuth application
• Firehose Subscription with white-listing– ~20% of all Tweets– (No longer available)
» Twitter no longer allows researchers to share datasets
» We needed to develop a new collection method
» Can not violate terms of use
Data Collection
• Current– Distributed Data Collection Infrastructure
– Geographically dissimilar IP's to simulate multiple users
– Registered Application with Non-authenticated API access
• ~80 – 100% of all Tweets (1 billion+ / week)
• Storage – Collection in Streaming Gzip Python
Dict. Format (10:1 Compression Ratio)•Converted to JSON on the fly when
needed– Initially Stored in HDFS (Had Issues)
»Recent work uses DDFS
– Indexed using Luceen•New methods are being explored
– Discodex w/ BSON Store– Storing 1.5 TB a Week
Data Collection
• Two Part Method– Manual Inspection
•Query Panel Front-end
– Automated Inspection
Analysis
Example AnalysisField Name Description Example Dataname User's REAL Name Text: "Robert Scoble"screen_name User's Twitter username Text: "scobleizer"
profile_image_url Link to users profile image
Link: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-fanatiguy_normal.jpg"
url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer"followers_count Number of followers user has Number: "185496"friends_count Number of people user follows Number: "31971"utc_offset Offset from GMT (in seconds) Number: "-28800"
geo_enabledWhether user has enabled location Boolean: "True"
statuses_countNumber of statuses user has posted Number: "53522"
Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011"
idTweet id (useful for URL creation) Number: "80703603437875201"
textContains the actual text + any embedded URLs
Whatever text the person chooses to enter. <- Could be any language supported.
sourceLinks to Twitter client URL <- not important
HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"
in_reply_to_status_id
Number of status that user replied to Number: "80671170374025220"
in_reply_to_screen_name
Screen name of user the current status replies to Text: "danharmon"
retweet_countNumber of times this status is retweeted Number: "0"
retweetedWhether or not the status has been retweeted Boolean: "false"
'geo' flag specific: georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939"
urlPoints to a JSON or XML file with further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"
Case Study: Botnet C2
• One well known case:– Arbor Networks detected first known
incident in 2009•Base 64 encoded control signals
– Soon After:•A number of tools released to do the
same:– ControlMyPC, KreosC2, etc.
• Sample Manual Detection:
Case Study: Botnet C2
• Twitter's number one problem, artificially increases traffic and bothers legitimate users
• Easily detected during manual analysis
• Automated detection based on wording and rates at which messages are posted
Case Study: SPAM
• Coalmine - A tool for Social Media Analysis– Scales well based on initial tests– Useful for both manual and automated
detection
• Future (Current) Work– Rebuild of the tool to fix scaling limitations
• More extensible Map/Reduce method• Inclusion of native multi-threading capability• New storage and distribution method• New algorithms for automated opinion leader
detection
Conclusion / Future Work
?
Questions
References
References
References