crime-sensing through social media - exploring the relationship between tweets about disorder and...
Upload: centro-de-investigacion-para-la-gestion-tecnologica-del-riesgo-cigtr
Post on 12-May-2015
324 views
DESCRIPTION
Luke Sloan. Collaborator at COSMOS - Cardiff University. Curso de Verano "Innovación Disruptiva en tecnologías de seguridad". Campus Vicálvaro de la URJC. Summer Course "Disruptive innovation in security technologies". URJC's Vicálvaro Campus.TRANSCRIPT
Crime-Sensing Through Social Media
Dr Luke Sloan [email protected]
@drlukesloan
Collaborative Online Social Media ObServatory (COSMOS)
www.cosmosproject.net @cosmos_project
Matthew Williams, Luke Sloan, Pete Burnap, Jeffrey Morgan, William Housley, Adam Edwards, Omer Rana & Rob Proctor
Outline
• Project Objectives • COSMOS 2 – A Response • Key Literature • Research Question • Data & Sampling • Sensing Crime & Disorder • Exploring Relationships
– Time – Space
• Modelling Strategy • Methodological Considerations
Project Objectives
• To evaluate the utility of crime and disorder related tweets in predicting patterns of crime in six London boroughs
• To develop an automated machine classifier for identifying tweets containing crime and disorder terms
• To develop statistical models that take into account
temporal and spatial variation
• To compare conventional predictive models of crime with models containing social media derived data
COSMOS 2 – A Response I
• Every minute…
– 48 hours of video uploaded to YouTube – 204,166,667 emails are sent – 2,000,000 search queries on Google – 217 users join the mobile web – 571 new website are created – 100,000+ tweets are made – 3,125 photos added to Flickr – 684,478 pieces of content shared on Facebook
Source: http://mashable.com/2012/06/22/data-created-every-minute/ [accessed April 2014]
It’s all data!
COSMOS 2 – A Response II
• Social Media is potentially a rich source of naturally occurring data on beliefs, attitudes, reactions and opinions
• For example, Twitter can be used for… – Brand tracking with sentiment (Scarfi 2012)
– Predicting movie revenue (Asur & Huberman 2010)
– Advance earthquake warning (Sakaki et al. 2010)
– Predicting election results (Tumasjan et al. 2010)
Unlike traditional social science data collection, this can all be gathered for free…
COSMOS 2 – A Response III
• The volume of data is phenomenal compared to a social survey…
– ‘Spritzer’ at 1% approx. 3.5m tweets a day (free)
– ‘Garden Hose’ at 10% approx 35m tweets a day (make
a case)
– ‘Fire Hose’ at 100% approx. 350m tweets a day
(payment only)
COSMOS 2 – A Response IV
• The Collaborative Online Social Media ObServatory (COSMOS) funded by RCUK (ESRC & EPSRC), JISC and NCRM
• Develop a platform for data interoperability between:
– naturally occurring data (e.g. social media) – curated data (e.g. social surveys) – administrative data (e.g. A&E admissions, crime rates)
• Social media augments traditional social scientific investigation – it is not a surrogate! (Edwards et al. 2013)
• A key programme of work within COSMOS is making sense of social media data and extracting ‘useful’ content for social scientific analysis
The COSMOS Platform
The COSMOS Platform I
The COSMOS Platform II
The COSMOS Platform III
The COSMOS Platform IV
The COSMOS Platform V
The COSMOS Platform VI
The COSMOS Platform VII
The COSMOS Platform VIII
Crime-Sensing Case Study
Key Literature I
• The interoperability afforded by COSMOS through spatial linkage enables us to identify associations between online and offline phenomena
• Social media is already being used as a preferential means of updating the public about crime in the US and Europe (Johnson 2012, Crookes 2010, Danef 2012, Philips 2011, Rawlinson 2012)
• Allowing the reporting of emergencies on Twitter is being considered in the UK
• A near ten-fold rise in crime related communication in 2012
(Warrell 2012)
• Behaviour in virtual worlds can map onto real world phenomenon (Williams 2010)
Key Literature II
• Social and computational researchers have already begun to ‘repurpose’ social media data in their ‘predictive’ efforts
• Tumasjan et al. (2010) measured Twitter sentiment in relation to
candidates in the German general election concluding that this source of data was as accurate at predicting voting patterns as poll
• Asur & Huberman (2010) correlated frequency and sentiment related to
movies on Twitter with their revenue, claiming that this method of prediction was more accurate than the Hollywood Stock Market
• Sakaki et al. (2010) found that the analysis of Twitter data produced
estimates of the centres of earthquakes more accurately than conventional methods
• These studies illustrate how social media generates naturally occurring
data that can be used to complement and augment conventional curated and administrative data
Key Literature III
• Another notable example is the association of social media and crime, such as the riots during August 2011 (Procter et al. 2013a)
• Malleson and Andresen (2014) use Twitter to estimate changing populations densities as alternative to Census for identification of violent crime hotspots
• Gerber (2014) looks at the relationship in US between reported crime and the prevalence of multiple topics on Twitter
Research Question
• Can crime and disorder related content on Twitter enhance our understanding of and our ability to predict crime patterns?
• If so, is Twitter content a better predictor of certain major crime types then others?
• Can this form of data be used as an alternative
measure of feelings of insecurity in local communities?
Data & Sampling
• Comparative case study of London and Cardiff (this presentation focuses on London):
– Recorded crime (lat/long, HO crime category), split by
month Aug 2013 to Aug 2014 – Collecting 100% of geotagged UK Tweets (approx 500k per
day)
– Census data including ethnic composition, educational attainment, employment, income, health, religiosity (ONS API)
NOTE: COSMOS archive contains all UK tweets since Sept 2011 (not all of which are geotagged) but potential for identification of higher (mundane) geographies…
Sensing Crime & Disorder I
• We need to identify tweets in our sample that relate to signatures of crime and disorder using key-word detection of ordinary language
• 500K tweets a day means that it is unfeasible to do this manually
• Develop machine classifier to identify tweets referencing crime and disorder
• References to anxiety, environmental deterioration, anti-social behaviour, night-time establishments etc.
• Use crowd-sourcing and human coders to develop a lexicon and algorithm…
Sensing Crime & Disorder II
Reduce sample of UK Tweets to London & Cardiff
Take random subsample (every nth tweet) and send for crowd-sourced human coding
Use 50% of human-annotated dataset to train classifier through machine learning
Validate classifier using remaining 50% of dataset (test precision and recall)
Run classifier over whole London and Cardiff dataset
Human coders identify tweets that contain (and do not contain) crime/disorder terms
INPUT: all UK geocoded tweets
OUTPUT: All London and Cardiff tweets with crime/disorder flag
Exploring Relationships
• Simple correlation between tweets about crime/disorder and occurrence of recorded crime is too simplistic
• At what spatial and temporal level can social media be used to inform operational decision making?
• At what spatial and temporal level do we try to match tweets and crime?
• How to integrate existing curated data?
Exploring Relationships - Time I
• Certain variables are fixed (e.g. socio-economic characteristics of areas)
• Crime and tweets are locomotive (by the second!)
• Investigate relationship between tweets and crime/disorder at different levels of time: – Annual? – Monthly? – Days of the week? – Time of day? – Seasonal variations in crime type?
Exploring Relationships – Time II
Simple frequency of reported crime
commencement time varies depending on
time of day (June 2013 data)…
Exploring Relationships – Time III
Type of crime, as proportion of all crime, differs by time of day…
Exploring Relationships – Time IV
Greater variability across time of day for some crimes more than
others (June 2013 data)…
Exploring Relationships – Time V
• Clearly time of day is important
• More tweets during daytime might mean that we can more accurately predict daytime crime
• Likely that Twitter data is better for predicting some crime types than others (explicit and hidden)
• How to account for ‘lag’ e.g. ‘the house down the road was burgled last night’
Exploring Relationships – Space I
• Size of London results in huge internal variance in crime type and rates
• Crime and tweets are point data that can be located in any geography (from OA to LA)
• Investigate relationship between tweets and crime/disorder at different levels of space: – City wide – Boroughs – Wards
Exploring Relationships – Space II
Borough level geography is
too high, variance largely
due to population
density (plus commuter ands
tourism movement)
Exploring Relationships – Space II
Knightsbridge & Hyde Park
Soho & Covent Garden
Exploring Relationships – Space III
Red = >14% ‘never worked’ or ‘long-term unemployed
Dark Green = <5% ‘never worked’ or
‘long term unemployed’
Exploring Relationships – Space IV
• Commuter and tourism patterns matter, although more people = more crimes = more tweets?
• Reduction in social media use for those living in deprived areas? Less likely to tweet about crime despite being more likely to know about it?
• Could go down to OA, but number of tweets and reported crimes per case/unit is cut
Modeling Strategy I
• A ward-based example: – One month of data – Treated as cross-sectional – Crime and tweets aggregated over month – Single time point allows inc. of ward characteristics – One ward = single case
• Use existing known predictors of crime to specify model, measure success
• Add tweet data to model and see if ‘prediction’ rate is significantly higher
• i.e. does social media data [x] enable better explanation of variance in crime rates [y]?
Modeling Strategy II
• Simple logit model ignores temporal order and spatial data
• Fixed effects model would account for changes over time (but fixed factors such as ward demographics would be excluded)
• Random effects model would enable inclusion of non-time variant predictors (but stringent assumptions)
• Spatial point data allows use to take into account spatial correlation (kernal density estimation?)
• Multilevel model would account for both ward and borough level variance
Modeling Strategy II
• Could control for time and space through dummy variables
• p-values and standard errors can be poorly estimated for dummy variables in single level models (Snijders & Boska 2012)
• Not feasible to have a dummy variable for every hour of the day
• Suggested way forward: – Test for spatial variance (MLM)
• Ward and borough level
– Test for temporal variance (FE/RE) • Time of day, day of week and month
• If amount of spatial and temporal variance is significant then it must be accounted for in a multi-level longitudinal model (Yu et a. 2010)
Methodological Considerations
• Asynchronous relationship between tweeting about crime/disorder and experiencing/witnessing it?
• Commencement and finish time of a crime are rarely the same (e.g. events)
• Difference between when something happened and when it was reported
Discussion
Crime-Sensing Through Social Media
Dr Luke Sloan [email protected]
@drlukesloan
Collaborative Online Social Media ObServatory (COSMOS)
www.cosmosproject.net @cosmos_project
Matthew Williams, Luke Sloan, Pete Burnap, Jeffrey Morgan, William Housley, Adam Edwards, Omer Rana & Rob Proctor
Deriving Demographics
• Development of tools embedded within the COSMOS platform to identify signatures of demographic characteristics (Sloan et al. 2013)
• Location
• Gender
• Language
Deriving Demographics: Location I
• Three primary sources of location: – User profile information – Content of tweets (inc. ‘mundane geography’) – Geo-tagged tweets
• Geo-tagged tweets are the gold standard
• Allows us to locate people at the time they tweeted in existing geographies (output area level!)
• RQ: do people tweet about crime in high crime areas?
Deriving Demographics: Location II
Deriving Demographics: Language
• Two methods of identifying language: – The language of the Twitter interface – The language of the Tweet (Java library)
• Detecting language is important for efficiency of other analytical tools (e.g. sentiment analysis)
• 40% of content is in English
• RQ: do spatial patterns of language use recorded on the Census correspond with Twitter maps?
• Note that ‘hard to reach groups’ may use Twitter…
Deriving Demographics: Gender I
• Use the name field of the Twitter profile
• Clean the data to extract a first name and compare against a large database of first names
• Important to categorise ‘unisex’ and ‘unknown’
• Of those we could identify: 48.8% male and 51.2% female… exactly the same as the 2011 Census
Deriving Demographics: Gender II
Fa
rah
sta
rt
Fa
rah
to 3
rd
En
nis
Med
al
RQ: How does sentiment towards Team GB differ by gender?
Findings: 1) Sentiment peaks reflect
real world events (relationship between social media and real world)
2) Sentiment differs between men and women (difference is so pronounced that gender detection method appears to work)
Current Work
• Identifying age from signature data: – Preliminary analysis suggests usable age data for
0.35% of Twitter users
– Note that 0.35% of 645m is 2.25m (approx 40% of which is English language)
• Identify occupation from signature data: – Linked to SOC2010 codes
– Enables allocation into NS-SEC groups