big data is today: key issues for big data - dr ben evans

26
nci.org.au nci.org.au @NCInews nci.org.au @NCInews Big Data is today: key issues for big data Dr Ben Evans Associate Director Research Engagements and Initiatives

Upload: australiannationaldataservice

Post on 13-Feb-2017

245 views

Category:

Education


0 download

TRANSCRIPT

Slide 1

nci.org.au@NCInews

Big Data is today: key issues for big data

Dr Ben EvansAssociate DirectorResearch Engagements and Initiatives

nci.org.au@NCInews

nci.org.auImpact of Collaborations around Earth Systems Science Research

Tropical Cyclones Cyclone Winston 20-21 Feb, 2016Volcanic AshManam Eruption31 July, 2015Wye Valley and Lorne Fires25-31 Dec, 2015Bush FiresSocietal impacts requiring cross-domain collaborationModelling Extreme & High Impact events BoMNWP, Climate Coupled Systems & Data Assimilation BoM, CSIRO, Research CollabsHazards - Geoscience Australia, BoM, StatesGeophysics, Potential Fields, Siesmic Geoscience Australia, UniversitiesMonitoring the Environment & Ocean ANU, BoM, CSIRO, GA, Research, Fed/StateInternational research International agencies and Collaborative ProgramsAgriculture - Flooding

St George, QLDFebruary, 2011 National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auEmerging Petascale Geophysics HPC codesAssess priority Geophysics areas3D/4D Geophysics: Magneto-tellurics, AEMHydrology, Groundwater, Carbon SequestrationForward and Inverse Seismic models and analysis (onshore and offshore)Natural Hazard and Risk models: Tsunami, Ash-cloud

IssuesData across domains, data resolution (points, lines, grids), data coverageProvenance capture and queryModel maturity for running at scaleEnsemble, Uncertainty analysis and Inferencing

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auGrowth of Genomics data generation and need for analysis

The arrival of the $1,000 genome

National Computational Infrastructure 2016

c/- Marcel Dinger, Garvin Inst.Ben Evans, Preparing for your data future, July 2016

nci.org.auRef Dinger_IMB_Winter_School_2014.pptx4

Computational need to access big data

http://www.top500.org/statistics/perfdevel/

Current NCI

Next NCIHigh-Performance Data (HPD) (Evans, ISESS 2015, Springer)

HPC turning compute into IO-bound problemsHPD turning IO-bound into ontology + semantic problemsComputational Performance increasingNumber of CPU cores increasingData needs to scaleNeed compute to make full use of data National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auNCI National Platform to enable collaboration/transformationNCI Proposal to NCRIS RDSI (RDS) for a High Performance Data Node to:Enable dramatic increases in the scale and reach of Australian research by providing nationwide access to enabling data collections;Specialise in nationally significant research collections requiring high-performance computational and data-intensive capabilities for their use in effective research methods; Realise synergies with related national research infrastructure programs

As a result, Researchers will be able to:share, use and reuse significant collections of data that were previously either unavailable to them or difficult to accessaccess the data in a consistent manner which will support a general interface as well as discipline specific accessuse the consistent interface established/funded by this project for access to data collections at participating institutions and other locations as well as data held at the Nodes

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.au1. Climate/ESS Model Assets and Data Products2. Earth and Marine Observations and Data Products3. Geoscience Collections4. Terrestrial Ecosystems Collections5. Water Management and Hydrology Collections

NCI National Environment Research Data Collections National Computational Infrastructure 2016

Allocations and Review panelsScience Data CommitteeData Technical committeeBen Evans, Preparing for your data future, July 2016

nci.org.au

Enable global and continental scale and to scale-down to local/catchment/plot Water availability and usage over timeCatchment zoneVegetation changesData fusion with point-clouds and local or other measurementsStatistical techniques on key variables

Preparing for:Better programmatic accessMachine/Deep LearningBetter Integration through Semantic/Linked data technologies National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auSmall Data to calibrate, validate and understand the Big Data

Image Credit: Japan Meteorological Agency (JMA)

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auDiabolical data understanding data complexity

National Computational Infrastructure 2016

Data CollectionsData SubCollectionsData Sets (and granules)Data subsetting and Dynamic data

Versioning, licensing,provenance, citation, sync,linked/semantic dataSocial issues/responsibility mgtBen Evans, Preparing for your data future, July 2016

nci.org.auData ServicesNERDIP Data Platform

Compute IntensiveVirtual LaboratoriesNERDIP simplified viewFast/Deep Data Access

Portal views

Machine Connected

National Computational Infrastructure 2016

ProgramaccessServer-side functionsBen Evans, Preparing for your data future, July 2016

nci.org.au

11

http://geonetwork.nci.org.au/ - access to metadata

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.au Licensing and Access for Earth and EnvironmentalAll metadata must be open and discoverable:through NCI, ANDS, FIND, data.gov.au and partner websitesWhere possible, data will be CC-BYMetadata and landing pages will document any access restrictionsNCI worked with Baden Appleyard QC of AusGOAL (Australian Governments Open Access and Licensing Framework) National Computational Infrastructure 2015

Ben Evans, Preparing for your data future, July 2016

nci.org.auStandards Ensure compliant with Standards

AIMSCSIROMARGeoscience Australia

BOMDept. of Defence AADAust. Ocean Data Centre Joint Facility (AODCJF)

Data Integration eMIIMACDDAPData Generation ARGOSOOPSOTSANFOGAUVANMNAATAMSFAIMMSSRS

NCRIS IMOSAustralian OceanData NetworkPortals and Access

Data Management ComponentsANDSNCIRDSI

Other ComponentsAAFAARNet

Data MangementAustralian ResearchData Commons

VICWAGATASNT

QLDGovt Geoscience Info. Committee (GGIC)SANSW

Data Integration AuScope GridSISSARSDC

Data GenerationVCLGeospatiallSAMEarth Imaging Earth Composition GroundwaterNCRIS AuScopeAuScope PortalGeoscience PortalResearch & DevelopmentGovernment Operational

ANZLIC SpatialInformation Council

Australian Spatial Data DirectoryVICWAOSDMTASNTQLDSANSWACTNZICSM

Data Integration Atlas of Living AustraliaAust Phenomics Network

Data Generation Aust. Plant Phenomics Facility

NCRIS Integrated Biological SystemsAtlas of Living Australia

Australian Govt Water

VICWABOMTASNTQLDSANSWACTCSIROAust Water ResourcesInformation SystemAustralian Spatial ConsortiumASIBASSIPSMA43 Pty LtdCRC for Spatial Information

NCRIS TERNe-MASTBCCVL

TERN.

Climate & WeatherNCRIS CWSLab

Australian Government

AGIMOGov 2.0CSSDPNAMFNSSAGLSMDBCNWCAust. Govt. OnlineService PointGANZ

NT

QLDNSWVICWAACTTASSACSIROBureau of Met

ISO/OGCISO/OGCISO/OGCISO/OGCISO/OGCISO/OGCISO/OGCISOISO/OGC National Computational Infrastructure 2015

nci.org.auTransform data to become transdisciplinary and born-connected

A call to action for a Transdisciplinary approach starting at the conception of data collections Researchers across the science disciplines and broaderThen achieve interoperability and relevant information will be accessible to all sectors

Data moving to Born-Connected, which is part of the semantic and linked data worldImproves quality assurance of the data if linked National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auGetting Serious about Profiling Data Performance Calltree analysisMain General global profiling tools:Scalasca/Score-P; TAU; OpenSpeedShopHPCToolKit; mpiP; ITACIO analysis:Compare to baselinesDarshanGlobal profiling tools focused on IO

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auPerformance Access FactorsData packingVariable orderingChunking/blockingCompressionCachingSubsetting/SievingRead vs WriteParallel IOData conversion

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auData Classified Based On Processing LevelsLevel*Proposed NameDescription*0Raw DataInstrumental data as received from sensor. Includes any and all artefacts.1Instrument DataInstrument data that have been converted to sensor units but are otherwise unprocessed. Data includes appended time and platform georeferencing parameters (e.g., satellite ephemeris).2Calibrated DataData that has undergone corrections or calibrations necessary to convert instrument data into geophysical value. Data includes calculated position.3Gridded DataData that has been gridded and undergone minor processing for completeness and consistency (i.e., replacing missing data).4Value-added Data ProductsAnalytical (modelled) data such as those derived from the application of algorithms to multiple measurements or sensors.5Model-derived Data ProductsData resulting from the simulation of physical processes and/or application of expert knowledge and interpretation.

*The level numbers and descriptions above follow definitions used in satellite data processing, as defined by NASA. (see ; ; ).

HPDpointsgrids National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auQuality Assurance, Conventions & Interoperable Standards

National Computational Infrastructure 2016

O&M ISO standards

CF and ACDD standardsBen Evans, Preparing for your data future, July 2016

nci.org.auBarriers: Like my Coordinate System?

Mercator grid in southTripolar grid in north

Standards on Nested Grids

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auTransforming data on-the-fly

nci.org.auLandsat: A mosaic composed from different scenes for the selected area, using the scenes which are closer to the selected date. An RGB image is composed mapping three different bands into the RGB colours.Himawari: A video corresponding to the selected date and area, 12 frames, corresponding to period around noon where every frame is 30 minutes apart. Each frame is an RGB image which is composed mapping the closest three bands of Himawari to Landsat to have a similar image.ERA interim: A video corresponding to the selected date and 2000 square kilometers around the selected region representing "ERA-Interim Evaporation [m] forecast on surface". 8 frames, corresponding to one day (one every 3 hours). Each frame is an RGB image which is composed using a colormap to represent the different values of evaporation.

21

Examples of Virtual Labs and web tools

eReefs online analysis portal

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.auMatching to the database of events

Input ImageFeature Maps Convolution Layer 1 National Computational Infrastructure 2016

c/- Rahul Ramachandran, NASA / MSFCBen Evans, Preparing for your data future, July 2016

nci.org.auReasonable progressOverall Accuracy = 87.88% MODIS Rapid Response Test Images (Images to Trained scheme)

True PositiveTrue PositiveTrue PositiveFalse NegativeFalse PositiveFalse PositiveHurricaneDustSmoke

National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.au

PROMS v3 uses an extension to the PROV ontology as its data model.Entities ActivitiesAgent

RD-Switchboard http://www.rd-switchboard.org/ National Computational Infrastructure 2016

Enabling transparency, reproducibility & informatics techniquesBen Evans, Preparing for your data future, July 2016

nci.org.auKey Messages for raising a Data Centre in a Big Data WorldScientific Computing scales of today have to be built across collaborations of national facilities around national institutions that both scale up and scale-down Data needs to be born-connected, transdisciplinary, high quality, computationally readyNeeds expertise around usability and performance tuning to ensure getting the most out of the data.No one [insert grouping] can do it alone. No one organisation, no one group, no one country has the required resources or the expertise. Collaborative efforts across disciplines and collaboration across nations

Working Collaboratively in the era of Exascale and Big Data National Computational Infrastructure 2016

Ben Evans, Preparing for your data future, July 2016

nci.org.au