Download - Big data now playing ..... a t the sandbox
Overview
• Context• How CSO got interested in big data• The sandbox• Learning from other industries• Learning from the past• The sandbox – looking to the future• Concluding comments
Keywords – big data, modernisation, sandbox2
Big data – working definition
Data that is difficult to collect, store or process within the conventional systems of statistical organizations.
Either, their volume, velocity, structure or variety requires the adoption of new statistical software processing techniques and/or IT infrastructure to enable cost-effective insights to be made.
3
Do more with less
Mindset - Opportunities exist with secondary data sources
4
Legal environment
Data Protection
Official StatisticsFreedom of Information
Key : 3 Legislative pillars5
Modernisation and big data
2011 Conference of European Statisticians endorse modernisation strategy
2012 Big data on modernisation agenda
2013 ESSC Scheveningen memorandum on Big data and official statistics
2013 International Big data team gets going
2014 Big data on UNSC agenda
2014 The sandbox goes live at MSIS Dublin
7
2013 CSO Project - To determine household composition using smart metering data
Origin of data : Consumer Behaviour Trials in 2009 and 2010
• Over 5000 households in pilot• 3 months baseline data (reading every 30 mins)• Pre-trial survey using CATI
http://www.unece.org/stats/documents/2013.09.coll.html
Project with pilot data brought challenges
Pilot 7 million data points per monthICHEC helped out
Go live 2160 million data points per monthJoe, we need a bigger computer
8
https://www.ichec.ie/
The hardware on which the sandbox system is based is a High Performance Computing cluster called Stoney. The cluster is hosted in the National University of Ireland, Galway since April 2009 and is composed of 60 compute nodes each of which has two 2.8GHz Intel (Nehalem EP) Xeon X5560 quad-core processors, 48GB of RAM and a 1TB local disk. Each node is connected to two networks – an InfiniBand network for accessing the shared Lustre filesystem and for high performance communications as well as a Gigabit Ethernet network for management tasks. In addition, a 20TB shared filesystem is available to all nodes.ICHEC will dedicate 20 compute nodes to enable a Hadoop cluster with 160 cores almost 1TB of RAM and 20TB of HDFS distributed storage.
The sandbox
10
The sandbox provides an environment to
o test feasibility of remote access and processingo test whether existing standards/models/methods
can be applied to big datao evaluate the usefulness of big data software toolso learn by doing with respect to potential uses,
advantages and disadvantages of big datao facilitate further collaboration in the
international community
11
The toys (data sources)
o twitter datao mobile phone data o satellite imagery / aerial photographyo price data/ job vacancy data via scrapingo scanner data/price data sourced via large
vendorso data from road traffic sensorso smart meter data on electricity/gas consumption
Learning from other industries- technical partners can have a role to play
Data Clearing Houses
Exchange of data for billing purposes
ROW Mobile Network Operators
Irish MobileNetwork OperatorsMNOs
14
Learning from the past- think about the bigger picture
Nordbotten, Thygesen and the statistical archive concept
http://www.census.gov/history/pdf/kraus-natdatacenter.pdfhttp://blog.modernmechanix.com/the-national-data-center-and-personal-privacy/
The National Data Center and Personal Privacy By Arthur R Miller
Learning from the past- do not underestimate privacy concerns
16
The sandbox - looking to the future
o Centres for Research and Development
?o Centres of Excellence
?o Partner organisations for collecting, processing or storing data
of a less or non sensitive nature ???
o Significant partner organisations enabling the collection, processing or storing data of a sensitive nature
?????
• Think about bigger picture / broader system• An open mind to the possibility of new partners• Be open and transparent• Don’t underestimate privacy concerns• Continue to collaborate and share
Concluding remarks