big data visualization
DESCRIPTION
Presentation given at Innovative Approaches to Turn Statistics into Knowledge, 2 - 4 December 2013, Aguascalientes, MexicoTRANSCRIPT
Edwin de Jonge, December 3, 2013
Big Data Visualization
“Turning Statistics into Knowledge”, Aguascalientes
With thanks to Piet Daas, Martijn Tennekes and Alex Priem
Overview
2
• Big Data • Research ‘theme’ at Stat. Netherlands • Data driven approach
• Visualization as a tool •Why? •Examples in our office
•Census •Social Security •Social Media •Not shown: Traffic loops, Mobile phone data
Why Visualization?
October 1st 2013, Statistics Netherlands
Effective Display!
(see Tor Norretranders, “Band width of our senses”)
Anscombes quartet…
5
DS1 x
y DS2 x y
DS3 x y DS4 x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Anscombe’s quartet
Property Value
Mean of x1, x2, x3, x4 All equal: 9
Variance of x1, x2, x3, x4 All equal: 11
Mean of y1, y2, y3, y4 All equal: 7.50
Variance of y1, y2, y3, y4 All equal: 4.1
Correlation for ds1, ds2, ds3, ds4 All equal 0.816
Linear regression for ds1, ds2, ds3, ds4
All equal: y = 3.00 + 0.500x
Looks the same, right?
Lets plot!
Visualization
For Big Data:
Use appropriate:
- Summarization
- Granularity
- Noise filtering
Research: What works for big data?
9
Scatter plot with 100 data points
10
Scatter plot with 100 000 data points
11
Example 1: Census
Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands • Last traditional census was in 1971
‐ Now by (re-)using existing information • Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
11
Making the Tableplot
1. Load file 17 million records 2. Sort record according to 17 million records
key variable • Age in this example
3. Combine records 100 groups (170,000 records each)
• Numeric variables • Calculate average (avg. age)
• Categorical variables • Ratio between categories present (male vs. female)
4. Plot figure of select number of variables • Colours used are important up to 12
12
October 1st 2013, Statistics Netherlands tableplot of the census test file
Tableplot: Monitor data quality
16
– All data in Office passes stages:
‐ Raw data (collected)
‐ Preproccesed (technically correct)
‐ Edited (completed data)
‐ Final (removal of outliers etc.)
Processing of data Raw (unedited) data
Edited data
Final data
Example 2 : Social Security Register
15
Social Security Register
– Contains all financial data on jobs, benefits and
pensions in the Netherlands
‐ Collected by the Dutch Tax office
‐ A total of 20 million records each month
‐ How to obtain insight into so much data? • With a visualisation method: a heat map
19
October 1st 2013, Statistics Netherlands
Heat map: Age vs. ‘Income’
16
Age
Inco
me
(eu
ro)
17
amount
amount
22
Example 3: Social media
Daily Sentiment in Dutch Social Media
Social media: daily sentiment in Dutch messages
23
Granilarity: From day to week
Social media, daily sentiment in Dutch messages Social media: daily & weekly sentiment in Dutch messages
24
Granularity: From day to month
Social media, daily sentiment in Dutch messages Social media: daily, weekly & monthly sentiment in Dutch messages
25
Enter: Consumer confidence!
Social media, daily sentiment in Dutch messages Social media: monthly sentiment in Dutch messages & Consumer confidence
26 Corr: 0.88
Conclusions
Big data is a very interesting data source for
official statistics
Visualisation is a great way of
getting/creating insight
Not only for data exploration, but also for
finding errors
27
The future of statistics?