bridging the gap between hpc and high...
TRANSCRIPT
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016nci.org.au@NCInews
BridgingthegapbetweenHPCandHighPerformanceDataAnalysis
BenEvans
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
NCIREITeams(thataredirectlyrelevanttothiswork)• C.Richards,L.Wyborn – stakeholderengagementandmgt• J.Wang,K.Gohar,W.Si– DataCollectionsTeam• J.Antony,P.Larraondo – HighPerformanceDataTeam• D.Roberts,M.Ward,R.Yang– HPCandscalinganalysisTeam• C.Trenham,K.Druken,A.Steer– DataServicesTeam• J.Smillie,C.Allen,S.Pringle– VirtualLabsTeam
NCI: SomeoftheHPCandHPDintegratedactivities
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
DataCollections Approx.CapacityCMIP5,CORDEX,ACCESSModels 5PbytesEarthObs:Himawari-8,LANDSAT,Sentinel,MODIS,INSAR
2Pbytes
DigitalElevation,Bathymetry,Onshore/OffshoreGeophysics
1Pbytes
SeasonalClimate 700TbytesBureauofMeteorologyObservations 350TbytesBureauofMeteorologyOcean-Marine 350TbytesTerrestrialEcosystem 290TbytesReanalysisproducts 100Tbytes
1. Climate/ESSModelAssets andDataProducts2. EarthandMarineObservationsandDataProducts3. Geoscience Collections4. TerrestrialEcosystemsCollections5.WaterManagementandHydrologyCollections
http://geonetwork.nci.org.au/
NCIHighPerformanceDataCollections
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
10PB+ Research Data
Server-side analysis and visualization
Data ServicesTHREDDS
VDI: Cloud scale user desktops on data
Web-time analytics software
NCI’sIntegratedScientificHPC/HPDEnvironment
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Enableglobalandcontinentalscaleaswellasscale-downtolocal/catchment/plotscale
• NWPandForecastsUM,APS3(Global,Regional,City),ACCESS-TC
• CoupledSeasonalandDecadalClimateACCESS-GC2/3(GloSea5)
• DataAssimilation3D-VAR,4D-VAR(Atmosphere),EnKF (Ocean)
• OceanForecastingandResearchOceanMaps,BlueLink,MOM5,CICE/SIS,WW3,ROMS
• Fully-CoupledEarthSystemModelACCESS-CM,ACCESS-ESM,CMIP5/6
• Wateravailabilityandusageovertime• Catchmentzone• Vegetationchanges• Datafusionwithpt-cloudsandlocalor
othermeasurements• Statisticaltechniquesonkeyvariables
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
SomeofthefociforSatelliteImagery
TropicalCyclones
CycloneWinston20-21Feb,2016
VolcanicAsh
Manam Eruption31July,2015
WyeValleyandLorneFires25-31Dec,2015
BushFires
• ModellingExtremeandHighImpactevents– BoM• NWP,ClimateCoupledSystemsandDataAssimilation– BoM,CSIRO,Uni’s.• Hazards- GeoscienceAustralia,BoM• MonitoringtheEnvironmentandOcean– ANU,BoM,CSIRO,GA,IMOS,TERN
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
ExampleofI/OforlargescaleHPC:- UM(atmosphere)withandwithoutparallelIO
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
WriteS
peed(M
B/s)
PackNo.
Writespeedperpack
WriteSpeed(MPI-IO) WriteSpeed(POSIX) WriteSpeed(MPI-IO) WriteSpeed(POSIX)
I/OspeedsperpackforUMfileswithandwithoutMPI-IO.c/- DaleRoberts
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
HDF5MPI-enabled HDF5Serial
netCDF-4Climate/Weather/Ocean
LibgdalEO
DataLibraryLayer1
HPDataLibraryLayer2
SEG-YAirborneGeophysicsLinedata
FITS BAGLASLiDAR
MetadataLayer netCDF-CF HDF-EOS ISO19115,RIF-CS,DCAT,etc.
VGLAGDCVL
ServicesLayer(exposedatamodels&semantics)
Fast“whole-of-library”catalogue
Lustre OtherStorage(options)
NationalEnvironmentalResearchDataInteroperabilityPlatform(NERDIP)
Climate&WeatherSystemsLab
Biodiversity&ClimateChangeVL
OGC
WFS
OGC
W*S
OGC
WPS
OGC
WCS
OGC
WMS
Open
DAP
RDF,LD
VHIRLGlobeClaritas
WorkflowEngines,VirtualLaboratories(VL’s),ScienceGateways
netCDF-4EO
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
AllSkyVirtualObservatory
ANDS/RDAPortal
eReefs
ModelsFortran,C,C++,MPI,OpenMP
Python,R,MatLab,IDL
VisualisationDrishti
Ferret,NCO,GDL,GDAL,GRASS,QGIS
DigitalBathymetry&ElevationPortal
Data.gov.au
OpenNavSurface
DirectAccess
OGC
SOS
NCI’sNationalEnvironmentalDataInteroperabilityResearchPlatform(NERDIP)
Tools DataPortals,MobileApps
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
HDF5MPI-enabled HDF5Serial
netCDF-4Climate/Weather/Ocean
LibgdalEO
DataLibraryLayer1
HPDataLibraryLayer2
SEG-YAirborneGeophysicsLinedata
FITS BAGLASLiDAR
MetadataLayer
netCDF-CF HDF-EOS ISO19115,RIF-CS,DCAT,etc.
VGLAGDCVL
ServicesLayer(exposedatamodels&semantics)
Fast“whole-of-library”catalogue
Lustre OtherStorage(options)
NationalEnvironmentalResearchDataInteroperabilityPlatform(NERDIP)
Climate&WeatherSystemsLab
Biodiversity&ClimateChangeVL
OGC
WFS
OGC
W*S
OGCW
PS
OGCW
CS
OGC
WMS
Open
DAP
RDF,LD
VHIRLGlobeClaritas
WorkflowEngines,VirtualLaboratories(VL’s),ScienceGateways
netCDF-4EO
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
AllSkyVirtualObservatory
ANDS/RDAPortal
eReefs
ModelsFortran,C,C++,MPI,OpenMP
Python,R,MatLab,IDL
VisualisationDrishti
Ferret,NCO,GDL,GDAL,GRASS,QGIS
DigitalBathymetry&ElevationPortal
Data.gov.au
OpenNavSurface
DirectAccess
OGC
SOS
InfrastructuretoLowerBarrierstoEntry
AceUsers
DataPlatform
DataPortals
NERDIP:EnablingMultipleWaystoInteractwiththeData
Tools DataPortals,MobileApps
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
HDF5MPI-enabled HDF5Serial
netCDF-4Climate/Weather/Ocean
LibgdalEO
DataLibraryLayer1
HPDataLibraryLayer2
SEG-YAirborneGeophysicsLinedata
FITS BAGLASLiDAR
MetadataLayer
netCDF-CF HDF-EOS ISO19115,RIF-CS,DCAT,etc.
VGLAGDCVL
ServicesLayer(exposedatamodels&semantics)
Fast“whole-of-library”catalogue
Lustre OtherStorage(options)
NationalEnvironmentalResearchDataInteroperabilityPlatform(NERDIP)
Climate&WeatherSystemsLab
Biodiversity&ClimateChangeVL
OGC
WFS
OGC
W*S
OGCW
PS
OGCW
CS
OGC
WMS
Open
DAP
RDF,LD
VHIRLGlobeClaritas
WorkflowEngines,VirtualLaboratories(VL’s),ScienceGateways
netCDF-4EO
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
AllSkyVirtualObservatory
ANDS/RDAPortal
eReefs
ModelsFortran,C,C++,MPI,OpenMP
Python,R,MatLab,IDL
VisualisationDrishti
Ferret,NCO,GDL,GDAL,GRASS,QGIS
DigitalBathymetry&ElevationPortal
Data.gov.au
OpenNavSurface
DirectAccess
OGC
SOS
InfrastructuretoLowerBarrierstoEntry
DataPlatform
DataPortals
NERDIP:EnablingAceUserstoInteractwiththeData
Tools DataPortals,MobileApps
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
NERDIP:ApplicationsReplicatingWaysofInteractingwiththeData
HDF5MPI-enabled HDF5Serial
netCDF-4Climate/Weather/Ocean
LibgdalEO
DataLibraryLayer1
HPDataLibraryLayer2
SEG-YAirborneGeophysicsLinedata
FITS BAGLASLiDAR
MetadataLayer
netCDF-CF HDF-EOS ISO19115,RIF-CS,DCAT,etc.
VGLAGDCVL
ServicesLayer(exposedatamodels&semantics)
Fast“whole-of-library”catalogue
Lustre OtherStorage(options)
NationalEnvironmentalResearchDataInteroperabilityPlatform(NERDIP)
Climate&WeatherSystemsLab
Biodiversity&ClimateChangeVL
OGC
WFS
OGC
W*S
OGCW
PS
OGCW
CS
OGC
WMS
Open
DAP
RDF,LD
VHIRLGlobeClaritas
WorkflowEngines,VirtualLaboratories(VL’s),ScienceGateways
netCDF-4EO
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
AllSkyVirtualObservatory
ANDS/RDAPortal
eReefs
ModelsFortran,C,C++,MPI,OpenMP
Python,R,MatLab,IDL
VisualisationDrishti
Ferret,NCO,GDL,GDAL,GRASS,QGIS
DigitalBathymetry&ElevationPortal
Data.gov.au
OpenNavSurface
DirectAccess
OGC
SOS
Tools DataPortals,MobileApps
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
NERDIP:LooselycouplingApplicationsandDataviaServices
HDF5MPI-enabled HDF5Serial
netCDF-4Climate/Weather/Ocean
LibgdalEO
DataLibraryLayer1
HPDataLibraryLayer2
SEG-YAirborneGeophysicsLinedata
FITS BAGLASLiDAR
MetadataLayer
netCDF-CF HDF-EOS ISO19115,RIF-CS,DCAT,etc.
VGLAGDCVL
ServicesLayer(exposedatamodels&semantics)
Fast“whole-of-library”catalogue
Lustre OtherStorage(options)
NationalEnvironmentalResearchDataInteroperabilityPlatform(NERDIP)
Climate&WeatherSystemsLab
Biodiversity&ClimateChangeVL
OGC
WFS
OGC
W*S
OGCW
PS
OGCW
CS
OGC
WMS
Open
DAP
RDF,LD
VHIRLGlobeClaritas
WorkflowEngines,VirtualLaboratories(VL’s),ScienceGateways
netCDF-4EO
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
AllSkyVirtualObservatory
ANDS/RDAPortal
eReefs
ModelsFortran,C,C++,MPI,OpenMP
Python,R,MatLab,IDL
VisualisationDrishti
Ferret,NCO,GDL,GDAL,GRASS,QGIS
DigitalBathymetry&ElevationPortal
Data.gov.au
OpenNavSurface
DirectAccess
OGC
SOS
InfrastructuretoLowerBarrierstoEntry
AceUsers
DataPlatform
DataDiscovery
APPLICATION
FOCUSSEDDEVELOPERS
DATAMANAGEMENT
FOCUSSEDDEVELOPERS
Tools DataPortals,MobileApps
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
GridDiversityinCMIP5
Downstream communities may not wish to deal with different grids, but the modelling communities generate data appropriate to them.
Mercator grid in south
Tripolar grid in north
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
• Global profiling tools focused on IO• Compare to baselines
www.hdfgroup.org
How data is stored?
August 7, 2013 Extreme Scale Computing HDF5 17$
Chunked
Chunked & Compressed
Better access time for subsets; extendible
Improves storage efficiency, transmission speed
Contiguous (default)
Data elements stored physically adjacent to each other
Buffer in memory Data in the file
• Dataisstoredinchunksofpredefinedsize
• Two-dimensional instancemaybereferredtoasdatatiling
• Matchedchunking tocachesizeontheprocessor
Datafilelayoutsandperformanceanalysis
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
BenchmarktestingaLandsatsceneinvariousfileformats
• ReadSourceFile• LC80771182015023LGN00_B1.nc• Blocksize:7771*7841• Datatype:Short(2Bytes)• Libraries:GDAL,NetCDF,HDF5• 1~9Variables/Bands
• WriteTargetFiles• Library:Formats
• GDAL:GeoTiff,NetCDF Classic;NetCDF4,NetCDF Classic
• NetCDF:NetCDF Classic,NetCDF4,NetCDF4Classic
• HDF5:HDF5• Data:1~9Bands
• IOLibraries• GDAL2.0.2(GTIFF,NC,NC4,NC4C)2Darray• NetCDF (4.4.0)(NC,NC4,NC4C)2&3Darray(forthisstudy)• HDF5(1.8.16)(NC4,HDF5)2&3Darrayarray(forthisstudy)
c/- Rui Yang
Formats
APIsGDAL NETCDF HDF5
GDAL createdGTIFF
(GDAL_GTIFF)
✔
GDAL_NC ✔ ✔
GDAL_NC4C ✔ ✔ ✔
GDAL_NC4 ✔ ✔ ✔
NC ✔ ✔
NC4C ✔ ✔ ✔
NC4 ✔ ✔ ✔
HDF5 ✔ ✔ ✔
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Fullaccessto2Dcontiguousdatasets
0
200
400
600
800
1000
varnum_1 varnum_2 varnum_3 varnum_4 varnum_5 varnum_6 varnum_7 varnum_8 varnum_9
ReadThroughputs(MB/s)
gdal_gtiff-gdal gdal_nc-gdal gdal_nc4-gdal gdal_nc-netcdf gdal_nc4-netcdf
• Geotiff performanceimpactedbynumberofvariables(readsthewholefileforeachvariable)• GDALcreatesoverheadonNetCDF3Classicfile(requiresadditionalmem_copy op.)• GDALandNetCDF/HDF5libraryaccessNetCDF4filewithsimilarperformance
fulldataset7771*7841
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
PerBlockAccessofcontiguousvschunkeddatasets
• Access isslowerthanfullaccesstothepreviousbenchmarkofcontiguousdatasets.• But…accessing chunked/tiled datasetisfasterthancontiguousdataset
0
50
100
150
200
250
300
350
400
varnum_1 varnum_2 varnum_3 varnum_4 varnum_5 varnum_6 varnum_7 varnum_8 varnum_9
ReadThroughputs(MB/s)
chunked_nc4-netcdf contiguous_nc4-netcdf contiguousnc-netcdf
contiguousgdal_gtiff-gdal tiledgdal_gtiff-gdal
• Subsetsize:2560*2560• ChunkSize:640*640
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
BenchmarkConfigurationswithCompression
SourceFileAttributes WriteParameters ReadParameters
FileNameLC80990772015066LGN00.nc
DatasetBand1
Datatypefloat
Dimension (elements)7701*7591DatasetSize
233,833,164bytesChunk1*7591ShuffleTrue
DeflateLevel1
DataTypeFloat
Chunk1*1*7591
CompressionLevel0-9
ShuffleDisabled/Enable
CompressorAsabove
Hyperslab1*1*7591
ChunkCacheSize1MB
ShuffleBlosc/Byte shuffleBlosc/BitShuffle
CompressionLevel0-9
Library Default DynamicFilter
NetCDF4 Deflate (Zlib) N/A
HDF5 Deflate (Zlib) Bzip2,mafisc,spdp,Blosc(blosclz,lz4hc,lz4,SNAPPY,ZLIB)
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
ReadPerformancevsFileSize
write TP vs File Size
read TP vs File Size
write TP vs Read TP
Defaultinflate Blosc/LZ4
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Anotherbenchmark- internaldataorganisation
• DataLayoutusedtowritefile• Coordinatey,Coordinatex,Timet• Timet,Coordinatey,Coordinate
• Chunking• Along2D(yx)or3D(t,y,x)
• ReadAccess• Alongyx ortimet• Blocksubsets• Chooseappropriatedatalayoutandchunkshapetoprovidesatisfiedperformanceforanysubsetselection
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Layout,ChunkingandSubset
Layout tyx (6,7851,7761) yxt (7851,7761,6)
Chunksize (1,256,256) (3,256,256) (6,256,256) (256,256,1) (256,256,3) (256,256,6)
FullaccessT=6,Y=7851,X=7761 469.52 597.01 691.31 179.58 399.82 783.90
AlongyxT=1,Y=7851,X=7761 483.95 239.92 133.14 217.30 165.77 104.05
AlongtT=6,Y=2048,X=2048 365.16 430.04 493.11 159.82 333.94 539.49
Chunksize (1,512,512) (3,512,512) (6,512,512) (512,512,1) (512,512,3) (512,512,6)
FullaccessT=6,Y=7851,X=7761 647.8 816.0 823.3 185.99 436.95 870.40
AlongyxT=1,Y=7851,X=7761 607.01 267.71 150.78 267.55 164.02 110.47
AlongtT=6,Y=2048,X=2048 408.26 679.47 642.62 173.13 400.51 710.93
Chunksize (1,1024,1024) (3,1024,1024) (6,1024,1024) (1024,10241) (1024,1024,3) (1024,1024,6)
FullaccessT=6,Y=7851,X=7761 776.78 720.51 738.95 191.02 391.51 811.89
AlongyxT=1,Y=7851,X=7761 617.40 263.45 150.13 396.57 163.45 103.57
AlongtT=6,Y=2048,X=2048 560.33 596.83 701.69 163.50 396.87 663.34
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
ParallelIO:HDF5basedonMPI-IO
0
1000
2000
3000
4000
5000
6000
7000
1 8 16 32 64 128
MB/s
Stripecount
IndependentRead
HDF5 MPIIO POSIX
IORBenchmark:MPIsize=16;Stripesize=1M;Blocksize=8G;Transfersize=32M;
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
WebMapTileServers insteadofWMS
Serving Maps
WMS Server Client (Browser)THREDDS Server
12
3 4
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Serving Maps
Dynamic WMTS Server Client (Browser)THREDDS Server
12
3 4
DynamicWebMapTileServers
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Landsat8: • 2015, 25 meters resolution, 11 Bands,revisit period 16 days• UTM projection • Original USGS L1T scenes packed in HDF5 (chunked & compressed)• Local API and CEPH access
Himawari-8:• 500, 1000, 2000 meters (depending on the band), 16 Bands, image every 10 mins• Geostationary projection• BoM NetCDF4 files• Access through NCI TDS (THREDDS) subsetting
ERA Interim:• 2015, 75 km resolution, 45 different atmospheric variables, one field every 3 hours• WGS84 projection• ECMWF netCDF4 files• Local API and CEPH access
Reprojecting rasterdataon-the-flyfrommultiplesatellites
c/- Pablo Larraondo,JosephAntony
nci.org.au© National ComputationalInfrastructure 2016
Ben Evans, WGISS, March 2016
Reprojecting rasterdataon-the-flyfrommultiplesatellites