mobility analysis from twitter data ntts 2015 - satellite workshop on big data

Post on 24-Dec-2015

215 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mobility analysis from Twitter data

NTTS 2015 - satellite Workshop on Big Data

Twitter as data source

NoSQL Database

Filter by: Geo-referenced Only

México

Real-time Tweets

INEGI

TwitterTwitter

Why Tweeter?

• Availability• 1% of Tweets available without cost• Around 12 M accounts in Mexico• 700,000 accounts are geo-referenced• Collection of 150 M of tweets since

January 2014

Devices generatingtweets in Mexico

Andr

oid

iPho

ne

Tweet collection infrastructure

Unix “Red Hat”

NoSql Database “Elasticsearch”

Cluster (Hydra)

Big Data Layers

Test of Concept

General Process

Every DayCollection

StoreGeo-Referenced

Tweets

15M

?

Set an Objective

Filter and Process

Generate outputs

Topics

• Mobility– Internal flows– Tourism– Borders commuting– National Roads Networks: Use of roads (planned)– Urban influence zones (planned)

• Subjective wellness– Based on text– Based on emoticons

Geo-referenced Tweets 2014

DF

Internal mobility (from-to)

Méx

ico St

ate

To Mexico City

From Mexico

City

Where we go when tweeting?

Internal Tourism

Origin of Tourists visiting

Guanajuato (1-3 February 2014)

Internal Tourism

Origin of Tourists visiting

Puebla(1-3 February 2014)

Use of twitter in long weekendsDisplacements to Puebla and Guanajuato before, on and

after 1-3 February period

Border commuting

• México

• USA

National Roads Network

Urban Influence zones

Subjective Wellness• Complement of existing survey

– Subjective perceived wellness (monthly)

• Two approaches– Based on emoticons (possible international

comparability)• Netherlands experiments

– Based on text (diversity of analysis, regionalisms)

• Text analysis infrastructure development

Methods and Tools

• Pioanalisis: Tool for collection of the training set (crowdsourcing)

• Machine learning (supervised and unsupervised), Support Vector Machines, Incremental Learning

• Random forest, Latent Dirchlet Allocation (LDA)• SOM Neuronal Networks (SOM: Self Organizing

Map)• Classification Methods: Naive Bayes, Support

Vector Machines (SVM), KNN, Word Count• Dictionaries:Spanish Emotion Lexicon (SEL), KNN,

AFINN, WordNet, ANEW

Partnerships• International

– UNECE• ICHEC

– UNSD– LAMBDoop– University of Pensylvania

• National– KioNetworks

• Dattlas

– TecMilenioINFOTEC– Centro Geo– CIDE– CIMAT– Sectur

• Internal– INEGI General Directions

Conclusions• We are in a discovery stage:

– Findings going from ‘interesting’ to ‘valuable’

• Lot of research needed: – … but we are getting a lot of knowledge and experience

• Partnerships are a must• Combining other big data sources is an imminent next

step• New challenges and threats will appear

– Costs increase?– Legal issues?– Methodologies and quality frameworks re-engineering)?– Evolution of traditional statistics?

• A lot of etcetera?

New statistics production landscape?

Conociendo México

01 800 111 46 34www.inegi.org.mx

atencion.usuarios@inegi.org.mx

@inegi_informa INEGI Informa

top related