techniques for automating quality assessment of context-specific content on social media services

TechniquesforAutomatingQualityAssessmentofContext-specificContentonSocialMediaServices

Prateek DewanPhDThesisDefense

November14,2017

prateekd@iiitd.ac.in

CommitteemembersDr.AlessandraSala

Dr.Sanasam Ranbir Singh

Dr.AdityaTelang

Dr.Ponnurangam Kumaraguru (Advisor)

WhoamI?

• DataScientistatApple• PhDstudentsinceFebruary,2012– IIIT-Delhi• Masters(2010– 2012), IIIT-Delhi

• Collaborations• IBMIRL(DelhiandBengaluru), SymantecResearchLabs(Pune), DublinCityUniversity(Ireland),UFMG(Brazil)

• WorkedinPrivacyandSecurityonOnlineSocialMedia

• Researchinterests• AppliedMachineLearning

• NaturalLanguageProcessing• WebSecurity

OnlineSocialMedia:TheBigPicture

“Withgreatpowercomesgreatresponsibility”

Thesisstatement

• Todesignandevaluateautomatedtechniquesforqualityassessmentofcontext-specificcontentonsocialmediaservicesinrealtime

• Focus:Facebook• BiggestOnlineSocialMediaservice

• 2.01billionmonthlyactiveusers

• Every2outof7humanbeingsontheplanetusesFacebook

• Mostsought-afterOSNfornews

ProposedSolution

Identify Characterize Model

PrototypeDeployEvaluate

FacebookInspector:Demo

• Establishingthedefinitionofpoorqualitycontent•Whatallcontentispoorinquality?• Untrustworthy• Childunsafe• Misleadinginformation

• Hoaxes,scams,clickbait

• Violence,hatespeech• Definitionconformingto• Facebook’scommunitystandards1

• Definitionsofpagespam

81https://www.facebook.com/communitystandards

Approach

•Poorqualityposts published onFacebook• Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages

Characterize

•GroundtruthextractionusingURLblacklists, andhumanannotation

•Experimentswithmultiple supervised learningtechniques

•Two-foldmodeltoidentifymalicious contentinrealtimeModel

•FacebookInspector (FbI)Architecture

• Livedeployment viaRESTAPIandbrowserplug-ins forChromeandFirefox

•3,000+downloads, 180+dailyactiveusers, 1 million+postsanalyzed

•Evaluation intermsofresponse time,performance,andusability

Implement

Approach

• Poorqualityposts publishedonFacebook•Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages

Characterize

Implement

Dataset

DataType Quantity

Uniqueposts 4,465,371

Uniqueentities 3,373,953

Uniqueusers 2,983,707

Uniquepages 390,246

UniqueURLs 480,407

Uniquepostswithoneormore URLs 1,222,137

UniqueentitiespostingURLs 856,758

UniquepostswithoneormoremaliciousURLs 11,217

Uniqueentitiespostingone ormoremaliciousURLs 7,962

Unique maliciousURLs 4,622

EstablishingGroundTruth

• ExtractedpostscontainingoneormoreURLs• 1.2millionoutof4.4millionpostsintotal

• 480kuniqueURLs• UsedsixURLblacklists• GoogleSafebrowsing (malware/phishing)• VirusTotal (spam/malware/phishing)• Surbl (spam)• WebofTrust(trustscore)*

• SpamHaus (spam)• Phishtank (phishing)

• PostcontainingoneormoreblacklistedURLmarkedaspoorqualityposts (11,217inall)

WebofTrust

Reputation:Unsatisfactory/Poor/Verypoor (lessthan60)Confidence:High(greaterthan10)

ORCategory:Negative

Malicious

http://www.domain.com

Findings

• Facebook’scurrenttechniquesdonotsuffice• 65%ofallpoorqualitypostsexistedonFacebookafter4(ormore)months• Gatheredlikes from52,169uniqueusers;comments from8,784uniqueusers

• Facebook’spartnershipwithWebofTrust?• 88%ofallmaliciousURLshadpoorreputationonWOT

• Nowarningpages

Platformsusedtopost

Distributionofpoorqualityposts

Pages Users

Entities Posts

Approach

•Poorqualityposts published onFacebook• Facebook pages publishingpoorqualitycontent•Misinformation spreadonFacebookthroughimages

Characterize

Implement

FacebookPagespostingpoorqualitycontent

HidinginPlainSight:CharacterizingandDetectingMaliciousFacebookPages. Prateek Dewan,Shrey Bagroy,andPonnurangamKumaraguru (Shortpaper).PublishedatIEEE/ACMConferenceonAdvancesinSocialNetworksAnalysisandMining(ASONAM), San

Francisco,USA.2016.

GroundTruthextraction:Facebookpages

4.4millionposts

10,341maliciousposts

(1,557pages;5,868users)

627malicious

1ormoremaliciousURLsin

themostrecent100posts

Datasetofpages postingpoorqualitycontent

WOTresponse No.ofpages No. ofposts

Childunsafe 387 10,891

Untrustworthy 317 8,057

Questionable 312 8,859

Negative 266 5,863

Adult content 162 3,290

Spam 124 4,985

Phishing 39 495

Total 627(31) 20,999

• NumbersinbracketsareVerifiedpages

Contentanalysis(pagenames)

• SentenceTokenizationàWordTokenizationà CasenormalizationàStemmingà Stopword removal

• N-gramanalysis(n=1,2,3)

• Politicallypolarizedentitiesamongstpoorqualitypages• BritishNationalParty(BNP),TheTeaParty,EnglishDefenseLeague,AmericanDefenseLeague,AmericanConservatives,GeertWilderssupporters…

Networkanalysis

• Collusivebehaviorwithinpages postingpoorqualitycontent

Shares LikesComments

Temporalactivity

• Activityratio:"#.#%&'()*"'&+,-&'.)&#&,/"#.#%&'()*"'&+ duringcompleteobservationperiod

• Maliciouspagesaremoreactivethanbenignpages

Approach

•Poorqualityposts published onFacebook• Facebook pages publishing poorqualitycontent• MisinformationspreadonFacebookthroughimages

Characterize

Implement

Why?:TheHumanBrain- Imagesversustext

• Humanbrainprocessesimages60,000timesfasterthantext

Arewedoingenoughto"understand" images?

• Mostresearchtoanalyzesocialmediacontentfocusesontext• Topicmodelling

• Sentimentanalysis

• Doesitcaptureeverything?• Studiesrelatedtoimagesarelimitedtosmallscale• Fewhundred imagesmanuallyannotatedandanalyzed

• Whatcanbedone?• Automated techniquesforimagesummarization;DeepLearningandConvolutionalNeuralNetworks(CNNs)toscaleacrosslargeno.ofimages

• Domaintransferlearning

• OpticalCharacterRecognition

Methodology

• ImagespostedonFacebookduringtheParisAttacks,November2015

• 3-tierpipelineforextractinghighlevelimagedescriptorsfromimages

Uniqueposts 131,548

Unique users 106,275

Postswithimages 75,277

Total imagesextracted 57,748

Totaluniqueimages 15,123

Images

Themes(Inceptionv3)

ImageSentiment(DeCAF trainedon

SentiBank)

OpticalCharacterRecognition

Humanunderstandabledescriptors

TextSentiment(LIWC) +Topics(TF)

Manualcalibration

Tier1:VisualThemes

Tier2:ImageSentiment

Tier3:Textembeddedinimages

TierI:VisualThemes

• ImageNetLargeScaleVisualRecognitionChallenge(ILSVRC),2012• 1.2millionimages,1,000categories

•Winner:Google’sInception-v3(top-1error:17.2%)• 48-layerDeepConvolutionalNeuralNetwork

TierI:VisualThemescontd.

• AllimageslabeledusingInception-v3

• Validation:• Randomsampleof2,545imagesannotatedby3humanannotators

• 38.87%accuracy(majorityvoting)

•Manualcalibration• Renamed7outofthetop30(mostfrequentlyoccurring)labels

• Newaccuracy:51.3%•Whyrename?à

BoloTie

(Inception-v3)

PeaceForParis

(Ourdataset)

TierII:ImageSentiment

• DomainTransferLearning

• Inception-v3’slastlayerretrainedusingSentiBank• SentiBank• ImagescollectedfromFlickrusingAdjectiveNounPairs(ANPs)assearchquery

• ANPs:happydog,adorablebaby,abandonedhouse• Weaklylabeleddatasetofimagescarryingemotion

• Finaltrainingset– 133,108negative+305,100positivesentimentimages

• 10-foldrandomsubsampling

• 69.8% accuracy

TierIII:Textembeddedinimages

• OpticalCharacterRecognition(OCR)• TesseractOCR(Python)

• 31,689imageshadtext

• Manuallyextractedtextfromarandomsampleof1,000images

• ComparedwithOCRoutputusingstringsimilaritymetrics

• ~62%accuracy

Tesseractoutput:

No-onethinksthatthesepeoplearerepresentativeofChristians.SowhydosomanythinkthatthesepeoplearerepresentativeofMuslims?

Imageandposttexthaddifferenttopics

• Textembeddedinimagesdepictedmorenegativesentimentthanusergeneratedtextualcontent

Textembedded inimages Usergeneratedtext

Sentiment:Imagesversustext

• Imagesentimentwasmorepositivethantextsentiment

8 24 40 56 72 88 104 120 136 152 168 184 200 216 232 248 264 280

Sentim

entValue

lumeFractio

No.ofhoursaftertheattacks

PostText ImageTextImage VolumeFraction

Poorqualityimagecontent popularonFacebook

Approach

•Poorqualityposts published onFacebook• Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages

Characterize

Implement

Revisiting-- EstablishingGroundTruth

• ExtractedpostscontainingoneormoreURLs• 1.2millionoutof4.4millionpostsintotal

• 480kuniqueURLs• UsedsixURLblacklists• GoogleSafebrowsing (malware/phishing)• VirusTotal (spam/malware/phishing)• Surbl (spam)• WebofTrust(trustscore)*

• SpamHaus (spam)• Phishtank (phishing)

• PostcontainingoneormoreblacklistedURLmarkedaspoorqualityposts (11,217inall)

GroundTruthextraction– DatasetII

•WhatifapostdoesnothaveaURL?

• 500randomFacebookpostsx17eventsx3annotators

• Definitionofmaliciouspost• “AnyirrelevantorunsolicitedmessagessentovertheInternet,typicallytolargenumbersofusers,forthepurposesofadvertising,phishing,spreadingmalware,etc.arecategorizedasspam.Intermsofonlinesocialmedia,socialspamisanycontentwhichisirrelevant/unrelatedtotheeventunderconsideration,and/oraimedatspreadingphishing,malware,advertisements,selfpromotionetc.,includingbulkmessages,profanity, insults,hatespeech,maliciouslinks,fraudulentreviews,scams,fakeinformationetc.”

• Finaldataset(all3annotatorsagreedonthesamelabel)• 571maliciousposts

• 3,841benignposts

Featureset:FacebookPosts

Source Features

Entity (9) isPage, gender,pageCategory,hasUsername,usernameLength,

nameLength,numWordsInName, locale,pageLikes

Textualcontent

Presenceof!,?,!!,??, emoticons(smile,frown),numWords,

avgWordLength,numSentences,avgSentenceLength,

numDictionaryWords,numHashtags,hashtagsPerWord,numCharacters,

numURLs,URLsPerWord,numUppercaseCharacters,numWords /

numUniqueWords

Metadata(10) Application,Presence offacebook.com URL,Presenceof

apps.facebook.com URL,PresenceofFacebookeventURL,hasMessage,

hasStory,hasPicture,hasLink,type, linkLength

Link(7) http/https,numHyphens, numParameters,avgParameterLength,

numSubdomains, pathLength

Supervisedlearning:DatasetI

Classifier/Features

Entity Text Metadata Link All Top 7

NaïveBayes 54.79 52.41 71.60 69.25 56.15 74.72

DecisionTree 63.02 64.78 80.56 82.34 84.67 86.17

RandomForest 63.47 66.25 80.67 82.56 85.05 86.62

SVMrbf 61.77 64.89 78.75 81.45 75.89 83.66

Supervisedlearning:DatasetII

Classifier/Features

Entity Text Metadata Link All

NaïveBayes 51.67 51.60 72.45 77.58 67.63

DecisionTree 51.66 73.16 79.01 81.04 76.17

RandomForest 52.86 76.56 79.87 81.49 80.56

SVMrbf 53.16 76.52 78.18 80.37 73.79

Featureset:FacebookPages

Pagefeatures Likes,talking about,descriptionlength,bio,category,name,location,check-ins,…

Postingbehavior

Dailyactivityratio,posttypes,postlikes,postcomments,postshares,postengagementratio,postlanguage,averagepostlength,no.ofuniqueURLsinposts,no.ofuniquedomainsinposts,etc.

• Supervised learning• Page+postfeatures• 55featuresfrompageinformation

• 41featuresfrompostingbehavior

• Bagofwords• Contentgeneratedbypages

Supervisedlearning:Page+postfeatures

Classifier Featureset Accuracy(%) ROCAUC

NaïveBayesian

95 0.685

Post 69.61 0.753

Page+Post 70.81 0.776

LogisticRegression

38 0.745

Post 76.55 0.825

Page+Post 76.71 0.846

DecisionTrees

55 0.668

Post 71.37 0.720

Page+Post 70.81 0.758

Random Forest

86 0.750

Post 74.95 0.829

Page+Post 75.27 0.83742

Supervisedlearning:Bagofwords

Classifier Featureset Accuracy (%) ROCAUC

NaïveBayesian

Unigrams 68.27 0.682

Bigrams 69.06 0.690

Trigrams 69.77 0.697

LogisticRegression

Bigrams 74.34 0.791

Decision Trees

Bigrams 67.05 0.678

RandomForest

Bigrams 71.80 0.802

Sparse NN

Bigrams 84.12 0.872

Trigrams 84.13 0.90043

Modelforrealtimedetection

•Modelforpagesdependsonpostspublishedbypages• Can’tbeusedfordetectioninrealtime

• Twofoldsupervisedlearningbasedmodelusingpostfeatures

• Utilizingclassprobabilitiesfordecisionmaking

Decisionboundary

45Classifier1

Classifier2

LowMalicious

Benign

Approach

•Poor qualityposts published onFacebook• Facebook pages publishing poorqualitycontent•Misinformation spreadonFacebookthroughimages

Characterize

Implement

FacebookInspector(FbI):Architecture

FbI stats

Dateofpublic launch August23,2015

Total IncomingRequests 9million+

Total publicpostsanalyzed 3.5million+

Totaldownloads 5,000+

Dailyactiveusers 250+

Totaluniquebrowsers 1,250+

Postsmarkedasmalicious 615,000+

Postsmarkedasbenign 2.9million+

FbI evaluation:Responsetime

• ~80%postsprocessedwithin3seconds

• Averagetimeperpost:2.635seconds

FbI evaluation:Usability

• Usabilitystudywith53participants• SUSscore:81.36(Agrade)• Higherperceivedusabilitythat>90%ofallsystemsevaluatedusingSUSscale

• 98.1%participantsfoundFbI “easytouse”• 67.9%participantswouldlikeuseFbI frequently• Quotesfromusers:• “Savesyourtimespentonspamlinksandhenceenhancesuserexperience.”• “[FacebookInspector]Canbeusefulforminorsandpeoplewholackthejudgementtodecidehowthepostis.”

Contributionssummary

• IdentifiedandcharacterizedpoorqualitycontentspreadonFacebook,withthepurposeofidentifyingpoorqualitypostspublishedduringnews-makingeventsinrealtime

• Evaluated supervisedlearningapproachesforidentifyingpoorqualitypostsonFacebookinrealtime,usingentity,textual,metadata,andURLfeatures

• Deployedandevaluated anovelframeworkandsystemforrealtimedetectionofpoorqualitypostsonFacebookduringnews-makingevents

Howdoesithelp?

• SocialmediaservicesaretheprimarysourceofinformationformajorityofInternetusers• Contentisunmoderatedandcrowd-sourced;everythingyouseemaynotbetrue

• FacebookInspectorprovidesausefulandusablerealworldsolution toassistusers

• Methodologyforfastandaccuratesummarizationofimagedatasetspertainingtoagiventopic• Governmentagencies/brandscanusethismethodology toquicklyproducehigh-levelsummariesofevents/productsandgaugethepulseofthemasses

Realworldimpact

• RealtimesystemFacebookInspectorbuilttoidentifypoorqualitycontentisusedby250+Facebookusers,andhasprocessedover9millionrequests

• AuniquedatasetofFacebookpostscontainingmaliciousURLs,pagespostingmaliciouscontent,andimagesdepictingmisinformationfrom20+news-makingevents

Limitationsandfuturework

• Currentsystemdoesnotincorporateuserfeedback• Wewould liketoenableuserstoprovide feedbacktomakeamorepersonalizeddetectionmodel

• Computervisiontechniqueshavelimitedaccuracyonsocialmediacontent• Objectdetection,sentimentanalysis,andopticalcharacterrecognitiontechniquesweusedarenottestedthoroughlyonsocialmediacontent

• Identifyandrankusersonthebasisofdegreeofmalice• Moremaliciouscontentgenerated,highertheranking

Acknowledgements

• NIXIfortravelsupport(eCRS,2014)• IIIT-Delhi fortravelsupport(ASONAM,2017)

• Govt.ofIndiaforfundingduringPhD• Collaboratorsandco-authors:Dr.Anand Kashyap,Shrey Bagroy,Anshuman Suri,VarunBharadhwaj,AditiMithal

• Monitoringcommittee:Dr.Vinayak andDr.Sambuddho

• Peers:Dr.Niharika Sachdeva,Anupama Aggarwal,Dr.Paridhi Jain,Dr.AditiGupta,Srishti Gupta,Rishabh Kaushal

• MembersofPrecog@IIITD andCERC

• Everyoneelsewhohasbeenpartofmyjourney…

Publications– Partofthesis

• Dewan,P.,Bagroy,S.,andKumaraguru,P.HidinginPlainSight:TheAnatomyofMaliciousPagesonFacebook.Bookchapter,LectureNotesinSocialNetworks,Springer2017(Toappear)

• Dewan,P.,Suri,A.,Bharadhwaj,V.,Mithal,A.,andKumaraguru,P.TowardsUnderstandingCrisisEventsOnOnlineSocialNetworksThroughPictures.IEEE/ACMInternationalConferenceonAdvancesinSocialNetworksAnalysisandMining(ASONAM),2017.

• Dewan,P.,andKumaraguru,P.FacebookInspector(FbI):TowardsAutomaticRealTimeDetectionofMaliciousContentonFacebook.SocialNetworkAnalysisandMiningJournal(SNAM),2017.Volume7,Issue1.

• Dewan,P.,Bagroy,S.,andKumaraguru,P.HidinginPlainSight:CharacterizingandDetectingMaliciousFacebookPages.IEEE/ACMInternationalConferenceonAdvancesinSocialNetworksAnalysisandMining(ASONAM),2016(Shortpaper)

• Dewan,P.,andKumaraguru,P.TowardsAutomaticRealTimeIdentificationofMaliciousPostsonFacebook.ThirteenthAnnualConferenceonPrivacy,SecurityandTrust(PST),2015

• Dewan,P.,Kashyap,A.,andKumaraguru,P.AnalyzingSocialandStylometric FeaturestoIdentifySpearphishingEmails.APWGeCrime ResearchSymposium(eCRS),2014

Publications– Other

• Kaushal,R.,Chandok,S.,JainP., Dewan,P.,Gupta,N.,andKumaraguru,P.NudgingNemo:HelpingUsersControlLinkability acrossSocialNetworks.9thInternationalConferenceonSocialInformatics(SocInfo),2017(Shortpaper).

• Deshpande,P.,Joshi,S., Dewan,P.,Murthy,K.,Mohania,M.,Agrawal,S.TheMaskofZoRRo:preventinginformationleakagefromdocuments.KnowledgeandInformationSystemsJournal,2014

• Mittal,S.,Gupta,N., Dewan,P.,Kumaraguru,P.Pinnedit!AlargescalestudyofthePinterestnetwork.1stACMIKDDConferenceonDataSciences(CoDS),2014

• Dewan,P.,Gupta,M.,Goyal,K.,andKumaraguru,P.MultiOSN:Realtime MonitoringofRealWorldEventsonMultipleOnlineSocialMediaIBMICARE2013

• Magalhães,T.,Dewan,P.,Kumaraguru,P.,Melo-Minardi,R.,andAlmeida,V.uTrack:TrackYourself!MonitoringInformationonOnlineSocialMedia.22ndInternationalWorldWideWebConference(WWW)(2013)

• ConwayM., DewanP.,Kumaraguru P.,McInerney L.'WhitePrideWorldwide':AMeta- analysisofStormfront.orgInternet,Politics,Policy2012:BigData,BigChallenges?,OxfordInternetInstitute,UniversityofOxford.

Thankyou!

prateekd@iiitd.ac.in

http://precog.iiitd.edu.in/people/prateek

techniques for automating quality assessment of context-specific content on social media services

Engineering

context-specific managed offload for mobile data...

context-specific independence parameter learning: mle

context-specific, evidence-based planning for scale-up of...

essays on smallholder farmers in jamaica: context-specific

investigating disagreements through a context-specific...

pooled genome-wide crispr screening for basal and...

clustering context-specific gene regulatory networks

the application of to automating cold vapor mercury...

context-specific acoustic differences between peruvian and...

automating relevance for the age of context (ds-iq)

military leadership- a context specific review

context-specific trade-offs in navigational tool use

context-specific comparison of sleep acquisition systems...

local probabilistic models: context-specific...

toward a context-specific definition of social justice for

context - tapestry partnership · the first masterclass,...

toward automating patient-specific finite element model...

context-specific bayesian clustering for gene expression...

automating cloud application management using management ......

automating class schedule generation in the context of a