an open source gate toolkit for social media analysis · an open source gate toolkit for social...
TRANSCRIPT
GATEfortextengineering• ToolfordevelopinganddeploymentofTextMiningtechnology
• http://gate.ac.uk
• 20yearsold(butnotpastityet!)
• establishedlargedevelopercommunity,incl.industrialcommitters(Ontotext,Intellius,TextMiningSolutions)
• Usedworldwidebymanyorganisations tobuildbespokesolutions,e.g.TNA,PressAssociation,BBC,NHS
• Afreeopensourceframework(LGPL)andGUI
• IncludesInformationExtractiontoolsinmanylanguages• Researchteamof15people• Lotsoftoolsalsoforsocialmediaanalysis
GATEopen-sourcesocialmediatools
Name/Description Description/Domain License
GATE Numeroustoolsfortextanalytics,includingsocialmedia
LGPL+pluginspecific
TwitIE NERfortweets LGPL
GATECloud Variouscloud-basedGATEservices,includingsocialmediaanalytics
SaaSmodel
Sentimentanalysis Generictwitter-basedsentimentanalysis LGPL
TweetCollector Socialmediaanalytics SaaSmodel
YODIE(indevelopment) Cloud-basednamedentitydisambiguationandlinking;versionforsocialmedia
SaaSmodel
Hashtaganalysis
● Weneedtochopupthehashtagsintoindividualwords,andfindanytermswithinthem●#palmoilhumanrights --> palm oil + human rights
GATECloudTools
• YoucantryvariousdemosofourtoolsontheGATECloud
• http://cloud.gate.ac.uk• Youcanprocesssmallamountsoftextforfree• Forlargervolumes,youcansetupanaccountandpayforthetimeyouuse
CollectyourownTweets!
• Twitter providesAPIstodownloadtweetsinrealtimebasedonvariousfilteringconditions
• Theycanbechallengingtosetupifyou'renotaprogrammer
• WehavetoolsontheGATECloudtoletyousetupaquerytocollecttweetsbyparticularpeople,matchingcertainkeywords,geo-taggedwithparticularlocations,andsoon.
• Youdon’tneedtoinstallanysoftware!• Moredetails availableonGATECloudpages
Exploringthecollecteddata• Youcanalsoget6reportsonyourdatacollectedsofar,updatedinrealtime.
1. Viewthetophashtagsasabarchartortagcloud,andaddhashtagstothetrackinglistinthecollector.
2. Viewalinegraphoftweetfrequencyperhour.3. Viewabarchartorcloudofthetoptopicsaccordingtolistsof
terms.Thelistscanbeeditedfromalinkonthetopicsgraphpage.Topicscanbeaddedastermstothetracker.
4. Viewabarchartorcloudofthemostmentionedusers.Thesecanalsobeaddedforthecollectortotrack.
5. Viewabarchartorcloudofthefrequencyofthetermsyouaretracking.
6. Viewabarchartorcloudofthetopwordsofonegrammaticaltype(adjective,verb,nounetc.)(onlyforEnglish)
Demo
Analysing Polarised SocietalDebates
• Public,polarised debatesaffectpolicymaking,thecourseofacountry,votingintents
• Historically,thepublicdebatestookplaceonTVandinvolvedonlypoliticiansormediafigures
Motivation
• Today,polarised debatesalsotakeplacepubliclyontheinternet,andanyonecanparticipate
• Analysing thesepublicstatementstellsus:• Whatparticipantsarethinking• Whatargumentstheyaremaking• Whoisthinkingwhat• Gaina“bigpicture”viewoftheresponsetoeventsorpolicies.
TheStory
• Analyse socialmediasurroundingpolarised societaldebates
• Indextheresultstounderstand,visualise andexplore:
• Whattopicsarebeingdiscussed?• Whatsentimentsarebeingexpressed?• Whoisparticipatinginthedebate?• Howdodebatesevolveovertime?
• EuropeanResearchInfrastructureforBigDataandSocialMining• 7Europeancountries,morethan100researchers• Collectionsofdatasets,creationoftools• Trans-nationalaccessprogramme (comeandvisitus!)• Researchorganised inverticalthemes(exploratories)ontopofSBDinfrastructure
• Performcross-disciplinarycutting-edgesocialmediaminingresearch
• CityofCitizens• Well-being&Economy• SocietalDebates• MigrationStudies
www.sobigdata.eu@sobigdata
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
TweetsaretrackedinrealtimeusingthestreamingAPI
TweetCollection
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Individualtokens(analogoustoterms)areextracted
Tokenisation
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Spellingandabbreviationsarenormalised tohelplinguisticprocessingtools
Normalization
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Parts-of-speechareidentified.Thisisnecessarytosupportlaterprocessing
POStagging
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Wediscovermentionsofentitiessuchaspeople,locations,organisations andproducts
NamedEntities
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Intenttovoteleaveorremainclassifiedby#hashtag
VotingIntent
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Tweetsarematchedagainstadetailedlistoftopics
TopicDetection
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
TweetsarelinkedtoNUTSregionsbasedonplacetagsanduserhomelocations
Geolocation
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
Weclassifyusersintoegjournalist,charity,memberofpublic
UserClassification
Tweets
Tokenization
Normalization POS tagging
NErecognition
Leave/remainclassification
TopicDetection
TweetGeolocation
Userclassification
SemanticSearch
TweettextandannotationsareindexedinsemanticsearchengineMímirforsearchand
visualisation
SemanticSearch
SemanticsearchwithMímir
• Mímir:Multiparadigm IndexingandRetrieval
• Complexqueries– cansearchoverannotationslike
{DocumentTimestamp hour_timestamp >= 2016062223 hour_timestamp < 2016062323} OVER ( {DocumentKind tweet_kind = original} AND {NUTS2NUTS2 = "UKE3"})
• Canalsomixinfulltextandsemanticqueries.Verypowerful!
• WhichpoliticiansfromtheNorthofEnglandovertheageof40talkedmostpositivelyaboutclimatechange?
• Allowsustodrilldownandeasilyseetweetswithcertainproperties
Ego-networkanalysis
• AnalysisusingpreviousworkonselectedBrexitdata• ValerioArnaboldi• InstituteofInformaticsandTelematics(IIT)• ItalianNationalResearchCouncil(CNR)
• Understandingthequantityandqualityofrelationshipsofdebateparticipants
• Remain,leaveandneutraluserswereselectedbySheffieldusingtheirpipelineandprocessedbyCNR-IIT
• Niceexampleofbringingseveralsystemstogether
Remainers aremoresocial
• Theymaintainlargeregonetworks
• Largeractivenetworksize• numberofpeopleactivelycontactedbytheegos
• withafrequencyofatleastonedirecttweetperyear
• consideringmentions,replies,andretweets
• Effectremainsevenafterfilteringoutantisocialaccounts
Activenetworksizeinusers
Leavershadfewersocialcircles
• Socialcircle:• Groupofusersinanego-network
• Withsimilarlevelsofinteractiontooneanother
• Mostfrequentinteractionswithasmallersocialcircle
• Moreoccasionalinteractionswithalargercircle
Averagenumberofsocialcircles
Many“leave”accountsdidn’tsocialise atall
• Alotofaccountswerenotusedsocially
• Thisisreflectedinthenumberofaccountswithsocialcircles
• Theleavetwitterers werewellbelowtherandomsample. %ofaccountswithatleast1socialcircle
MIMIRquery
• Dataset:everytweetbyMP/Candidate/Party,plusallreplies/retweets
• FindalltweetswhereaConservativeMPtalkedabouttheeconomy
Parties/themesco-occurrence
UK economyEurope
Tax and revenueNHS
Borders and ImmigrationScotland
EmploymentCommunity and society
Public healthMedia and communications
Labo
ur P
arty
Can
dida
teLa
bour
Par
ty M
PCo
nser
vativ
e Pa
rty
Cand
idat
eCo
nser
vativ
e Pa
rty
MP
UKIP
Can
dida
teOt
her M
PSN
P Ca
ndid
ate
Gree
n Pa
rty
Cand
idat
eLi
bera
l Dem
ocra
ts C
andi
date
SNP
Othe
r
SemanticqueryingviaDBpedia
• Queryingforinformationnotexplicitinthetext
• “FindalltopicsintweetsbytheMPfromSheffieldHallam”
• Wedon’tneedtoknowwhothisMPis
• IftheSheffieldHallamMPchangedatsomepoint,thiswouldbecaptured
Analysis of the EarthHour campaign
l Analysis of hashtags and topics mentioned
l The main activities and themes of the campaign drove most of the social media conversations
l Users engaged in the campaign but did not necessarily engage with climate change and sustainability issues.
l Lack of correlation between Durex campaign and climate change engagement
BehaviourAnalysis• Basedontheassumptionthatusersindifferentbehaviouralstagescommunicatedifferently(differentemotions,directives,etc.)
Pajarito@lindopajarito .2h
Ourbuildingneeds40%ofallenergyconsumedinSwitzerland!L
DJPajarito@DJPajaritoGenial .12hI'msoproudwhenIremembertosaveenergyandIknowhoweversmallit'shelping.
Desirability:Negativesentiment(expressingpersonalfrustration-anger/sadness)
Buzz:Positivesentiment(happiness/joy).I/we+presenttense
HotelPajarito@HotelPajarito .18hJoinustodaytodaytoswitchofalightforEH!J
Invitation:Positivesentiment(happy)+useofvocativesYou are an organisation, not an individual!
What’snext?
• AnumberoftoolshavealreadybeenbroughttogetheraroundBrexitandotherdebates
• Moretoolsstillbeingdeveloped• ToolsavailablethroughtheSoBigData VREandGATECloud
• Focusonanalysisratherthanpredictingresults• 2017UKelections:imminentresultscomingthisweekonpartymanifestos!
• Othercurrentworkanalysing hatespeecharoundBrexit,andcorrelatingsocialmediaandnewsmediadiscussions
Linksformoreinfo
• GATEathttp://gate.ac.uk• GateCloud athttps://cloud.gate.ac.uk• ComeonaGATEtrainingcourseinJune!• BrexitVisualisations athttp://demos.gate.ac.uk/sobigdata/brexit/
• BrexitstudyblogpostfromNESTAathttp://www.nesta.org.uk/blog/network-analysis-top-eu-referendum-tweeters
• BrexitstudyblogpostsfromSheffieldathttp://gate4ugc.blogspot.co.uk/search/label/Brexit
• UKelectionsmonitorathttp://gate.ac.uk/projects/pft
Publications• DianaMaynard,IanRoberts,MarkA.Greenwood,DominicRoutandKalinaBontcheva.AFrameworkforReal-timeSemanticSocialMediaAnalysis.WebSemantics:Science,ServicesandAgentsontheWorldWideWeb,2017.
• K.Bontcheva,L.Derczynski,A.Funk,M.A.Greenwood,D.Maynard,N.Aswani.TwitIE:AnOpen-SourceInformationExtractionPipelineforMicroblogText.ProceedingsoftheInternationalConferenceonRecentAdvancesinNaturalLanguageProcessing(RANLP2013).
• D.MaynardandK.Bontcheva.Understandingclimatechangetweets:anopensourcetoolkitforsocialmediaanalysis.InProc.ofEnviroInfo 2015,Copenhagen,Sep.2015
• DianaMaynard,Kalina Bontcheva,IsabelleAugenstein.NaturalLanguageProcessingfortheSemanticWeb.MorganandClaypool,December2016.ISBN:9781627059091(containsachapteronsocialmediaanalysis)
• Available(andmore)athttps://gate.ac.uk/gate/doc/papers.html
Acknowledgements
Thisworksupportedby:• the European Union/EU under the Information and Communication
Technologies (ICT) theme of the 7th Framework and H2020 Programmes for R&D
• KNOWMAK (726992)https://gate.ac.uk/projects/knowmak/• DecarboNet (610829) http://www.decarbonet.eu● Pheme (611233) http://www.pheme.eu● SoBigData (654024) http://www.sobigdata.eu● COMRADES (687847) http://www.comrades-project.eu● OpenMinted (654021)http://openminted.eu● Kconnect (644753) http://www.kconnect.eu/
● Nesta http://nesta.org.uk