data service centre and apache spark file• covers all domains: social statistics, business...

29
Data Service Centre and Apache Spark at Statistics Netherlands

Upload: voanh

Post on 28-Jun-2019

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

Data Service Centre and Apache Spark

at Statistics Netherlands

Page 2: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

2

Statistical process and DSC

WebsiteStatlineOpen dataArticlesBooks

DSCMicrodata

services

RIN

RIN

RIN

RIN

Page 3: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

3

• Technical backend: Document management system Documentum (Open Text)

• Only statistical data that you can store in rows and columns (no documents, images etc.)

• Data stored as text files (csv, fixed-width): future proof• Primary focus was archiving, but now more and more on data

exchange• Retrieve data and process data in SPSS, R, Python, custom built

systems• Almost 14.000 datasets, mostly microdata• Covers all domains: social statistics, business statistics, national

accounts, health statistics, energy statistics, agriculturalstatistics etc. etc.

DSC not a traditional datawarehouse

Page 4: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

4

DSC Catalogue

Page 5: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

5

DSC Catalogue

Page 6: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

10

• Subset of data in DSC• Highly coordinated• Mostly based on administrative sources, some surveys• ‘Backbones’ (persons, buildings, households, companies)• Linkable datasets• Widely used for statistical production and research:

longitudinal, small groups, intergenerational, networks• SSD tool set on top of DSC• https://www.cbs.nl/NR/rdonlyres/98BFF618-D7A7-4897-

85D6-6293CFB8EA75/0/systemofsocialstatisticaldatasets.pdf

System of Social statistical Datasets (SSD)

Page 7: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

11

Proof of concept ‘Data lake’

DSCRaw data Big dataOther SN data Other data

Data virtualisation (Denodo)

User User User User

Statistics Netherlands The ‘outside’

Metadata

Governance

Organisation

Governance+ Governance+

Organisation

Page 8: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

14

BIG DATAis of all times

Page 9: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

15ca. 1981–1975 B.C.

Page 10: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

16

Page 11: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

17

Page 12: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

18

Contest: person who could process and tabulate the data fastest would earn a contract for Census 1890

Process:

Participant A: 144 hrs

Participant B: 100 hrs

Participant C: 72 hrs

1888 Hackathon US Census Bureau

Tabulate:

Participant A: 44 hrs

Participant B: 55 hrs

Participant C: 5 hrs

Page 13: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

19

Herman Hollerith

1896 Tabulating Machine Company

1911 Computing-Tabulating-Recording Company

1924 International Business Machines Corporation

1908

Page 14: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

20

2018: DSC contains about 14 thousand datasets (≈5 TB). Retrieving and processing data should go faster.

Can we build a tabulating machine based on contemporary technology?

Page 15: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

21

Apache SPARK

Page 16: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

22

Page 17: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

23

Page 18: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

24

Page 19: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

25

Test case

Page 20: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

26

DSC

Authentication

SPARK

Spark programming (PySpark)

Data control

Authorisation control

meta

data

Page 21: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

27

After a CBS press release about average capital per municipality* a journalist asks whether the top 10 would be the same when one looks at average wage per municipality.

Top 10 average capital per municipality, 2016

Laren (NH.)

Blaricum

Bloemendaal

Wassenaar

Rozendaal

Heemstede

Bergen (NH.)

Alphen-Chaam

De Bilt

Westvoorne

*https://www.cbs.nl/nl-nl/nieuws/2018/06/vermogen-huishoudens-bijna-10-procent-hoger-in-2016

User story

Page 22: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

28

Page 23: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

29

Page 24: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

30

DSC Datasets

SPOLIS2015all jobs in NL in 2015

GBAADRESOBJECT2015all addresses 2015

VSLGWB2015municipality-district-

neighbourhood code of alladdresses

SBASISLOON (wage), SREGULIEREUREN(hours)

Filter:SDATUMAANVANGIKO >= 20150101SDATUMAANVANGIKO <= 20150131

-

Filter:GBADATUMAANVANGADRESHUISHOUDING

<= 20150101GBADATUMEINDEADRESHUISHOUDING

>= 20150101

GEM, derived from GWBCODE2016 [1-4]

Link by:RINPERSOONSRINPERSOON

Link by:RINPERSOONSRINPERSOON

SOORTOBJECTNUMMERRINOBJECTNUMMER

Link by:

SOORTOBJECTNUMMERRINOBJECTNUMMER

10 mln records, 1.74 Gb61 mln records, 3.45 Gb110 mln records, 68.76 Gb

Aggregate on GEM (MUN)

UURLOON (HOURLYWAGE) = Sum(SBASISLOON) / Sum(SREGULIEREUREN)

Page 25: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

31

User interface

Page 26: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

32

Page 27: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

33

Page 28: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

34

Page 29: Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

35

Processing time syntax on Spark cluster: Approx. 1 minute

Other advantages:- Open source- Modern tool set- Syntax based- Sharing code- Visualisations- Commonly used, documentation

Disclaimer: data shown are for demo purposes only, they are not official outcomes