ag-dc data cube ip speddexes

1. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Welcome to the Cube

2. Why the AG-DC? Amassed huge volumes of Landsat data and derived products through the successful ULA (Unlocking the Landsat Archive project) Needed to deliver urgent, large-scale analyses for the MDBA and NFRIP (National Flood Risk Information Portal) and more Needed to free scientists from having to locate and arrange data prior to performing each analysis Reliant on the successful delivery of a third-party product to conduct analyses suitable product could not be delivered in time Urgently needed to develop a means of leveraging compute power and storage at NCI to conduct temporal analyses. Datacube Presentation for SPEDDEXES Workshop 18/3/14 3. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Skill Level required to Access Compute Power required to Process A Small Piece e.g. Paddock scale 1D Cores (e.g. time series) 2D Slices (e.g. x-y spatial) Composite mosaics The whole cube Full-res continental scale Potential Number of Users Data Volume 4. How to Catch, Cook and Eat an Elephant: Analysing the Landsat Archive (Scary statistics) 15 Years of Landsat Data (1998-2012) processed so far: 15,000 Passes, 133,000 Acquisitions, 636,000 available datasets (all processing levels) 52 x 1012 Pixels (peta-Pixels?) in all available datasets (>11x more if counting bands separately) 0.5PB (and growing rapidly in both directions) Datacube Presentation for SPEDDEXES Workshop 18/3/14 5. Problems with Traditional Monolithic Approaches Remote sensing data is typically both spatially and temporally sparse and irregular, unlike modelling output. GA Wednesday Seminar 30/10/13 - AG-DC Monolithic array approaches (e.g. x-y-t) do not work well too many empty pixels, temporal binning required. 6. Problems with Traditional Monolithic Approaches Remote sensing data is typically both spatially and temporally sparse and irregular, unlike modelling output. GA Wednesday Seminar 30/10/13 - AG-DC Monolithic array approaches (e.g. x-y-t) do not work well too many empty pixels, temporal binning required. XY t 7. Problems with Traditional Monolithic Approaches Remote sensing data is typically both spatially and temporally sparse and irregular, unlike modelling output. GA Wednesday Seminar 30/10/13 - AG-DC Monolithic array approaches (e.g. x-y-t) do not work well too many empty pixels, temporal binning required. XY t 8. Challenges Remote sensing data (especially Landsat) is both spatially and temporally sparse and irregular. Landsat archival data collection is currently dynamic: growing both forwards and backwards in time, and also subject to modification (existing data) and insertion (new data). Some use cases require ancillary data for exact acquisition time (e.g. tides for shallow-water bathymetry) Often have two satellites observing the same area in a given 24h period, so we need a much finer temporal resolution than one day. Datacube Presentation for SPEDDEXES Workshop 18/3/14 9. Challenges (continued) Scene based USGS World Reference System (WRS) for Landsat imagery is only a nominal spatial reference system. Scenes with the same path & row numbers have fuzzy, variable boundaries. Due to orbital inclination, scenes are not orthogonal in any conventional projection Landsat WRS scenes overlap, so some data is duplicated between adjacent N-S scenes in the same pass. Datacube Presentation for SPEDDEXES Workshop 18/3/14 10. Whats Different about the AG-DC approach? The AG-DC arranges 2D (spatial) data temporally and spatially to allow flexible but reasonably efficient large-scale analysis. DicenStack method used to subdivide the data into spatially-regular, time-stamped, band-aggregated tiles which can be managed as dense temporal stacks. Datacube Presentation for SPEDDEXES Workshop 18/3/14 11. Whats Different about the AG-DC approach? The AG-DC arranges 2D (spatial) data temporally and spatially to allow flexible but reasonably efficient large-scale analysis. DicenStack method used to subdivide the data into spatially-regular, time-stamped, band-aggregated tiles which can be managed as dense temporal stacks. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Dice 12. Whats Different about the AG-DC approach? The AG-DC arranges 2D (spatial) data temporally and spatially to allow flexible but reasonably efficient large-scale analysis. DicenStack method used to subdivide the data into spatially-regular, time-stamped, band-aggregated tiles which can be managed as dense temporal stacks. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Dice 13. Whats Different about the AG-DC approach? The AG-DC arranges 2D (spatial) data temporally and spatially to allow flexible but reasonably efficient large-scale analysis. DicenStack method used to subdivide the data into spatially-regular, time-stamped, band-aggregated tiles which can be managed as dense temporal stacks. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Dice and Stack 14. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Data Provenance in the AG-DC Tiles link to their source dataset records in DB for provenance. Tiles have no metadata per-se. Dataset provenance must be provided by lookups to authoritative metadata. Composite dataset outputs can contain pixel-based provenance 15. Datacube Presentation for SPEDDEXES Workshop 18/3/14 Data Provenance in the AG-DC Tiles link to their source dataset records in DB for provenance. Tiles have no metadata per-se. Dataset provenance must be provided by lookups to authoritative metadata. Composite dataset outputs can contain pixel-based provenance e.g. Four-month non-interpolated median NDVI for entire Murray Darling Basin. Each and every pixel can be traced back to its source observation through provenance information layers 16. Current AG-DC Holdings Datacube Presentation for SPEDDEXES Workshop 18/3/14 Landsat Source Scenes (Currently approx. 636,000 scene datasets) 17. Current AG-DC Holdings Datacube Presentation for SPEDDEXES Workshop 18/3/14 Landsat Source Scenes (Currently approx. 636,000 scene datasets) AG-DC Tiles (Currently approx. 4M tiles) 18. Current Tile Contents (for Landsat 5 &7) Level 1 Topographic (ORTHO) 1. LS5-B60 Thermal Infrared or 1. LS7-B61 Thermal Infrared Low Gain 2. LS7-B62 Thermal Infrared High Gain (Byte datatype) ARG-25 (NBAR) 1. LS5/7-B10 Visible Blue 2. LS5/7-B20 Visible Green 3. LS5/7-B30 Visible Red 4. LS5/7-B40 Near Infrared 5. LS5/7-B50 Middle Infrared 1 6. LS5/7-B70 Middle Infrared 2 (Int16 Datatype) Pixel Quality (PQA)* 1. PQ Bit-array of PQ tests (UInt16 Datatype) Fractional Cover (FC)** 1. Photosynthetic Veg. (PV) 2. Non-Photosynthetic Veg. (NPV) 3. Bare Soil (BS) 4. Un-mixing Error (UE) (Int16 Datatype) Digital Surface Model (DSM)*** 1. Elevation 2. Slope 3. Aspect (Float32 Datatype) Datacube Presentation for SPEDDEXES Workshop 18/3/14 * PQA Geoscience Australia ** QDERM, Currently only a 3x2 path/row test area of FC data held in AG-DC. Planned to complete load by end June 2014 *** Single, static source dataset, i.e. not time varying. Resampled from 1 DSM. Licensed for Government Use Only 19. Datacube Presentation for SPEDDEXES Workshop 18/3/14 20. Quality Assured Observations Datacube Presentation for SPEDDEXES Workshop 18/3/14 21. Original DB Schema Table Hierarchy: Acquisition (i.e. satellite / path / row / datetimes) Dataset (e.g. L1T, NBAR, PQA, FC, etc) Tile (i.e. type / x_index / y_index) Datacube Presentation for SPEDDEXES Workshop 18/3/14 22. Original DB Schema Table Hierarchy: Acquisition (i.e. satellite / path / row / datetimes) Dataset (e.g. L1T, NBAR, PQA, FC, etc) Tile (i.e. type / x_index / y_index) Datacube Presentation for SPEDDEXES Workshop 18/3/14 23. Original DB Schema Table Hierarchy: Acquisition (i.e. satellite / path / row / datetimes) Dataset (e.g. L1T, NBAR, PQA, FC, etc) Tile (i.e. type / x_index / y_index) Datacube Presentation for SPEDDEXES Workshop 18/3/14 24. Original DB Schema Table Hierarchy: Acquisition (i.e. satellite / path / row / datetimes) Dataset (e.g. L1T, NBAR, PQA, FC, etc) Tile (i.e. type / x_index / y_index) Datacube Presentation for SPEDDEXES Workshop 18/3/14 25. How it works What the AGDC API does Datacube Presentation for SPEDDEXES Workshop 18/3/14 26. derive_datasets function and data structures Datacube Presentation for SPEDDEXES Workshop 18/3/14 def derive_datasets(self, input_dataset_dict, stack_output_info, tile_type_info): input_dataset_dict: Dict keyedby processing level (e.g. ORTHO, NBAR, PQA, DSM)containing alltileinfo which can beused within the function. input_dataset_dict = { 'NBAR': tile_info_dict (see schema below) 'ORTHO': tile_info_dict (see schema below) 'PQA': tile_info_dict (see schema below) } tile_info_dict = { 'end_datetime': datetime.datetime(2000, 2, 9), 'end_row': 77, 'level_name': 'NBAR', 'nodata_value': -999L, 'path': 91, 'satellite_tag': 'LS7', 'sensor_name': 'ETM+', 'start_datetime': datetime.datetime(2000, 2, 9), 'start_row': 77, 'tile_layer': 1, 'tile_pathname': '/path/to/a/tile.tif', 'x_index': 150, 'y_index': -25 } 27. derive_datasets data structures (Contd) Datacube Presentation for SPEDDEXES Workshop 18/3/14 stack_output_info: Dict containing stack output information.Obtainedfrom stackerobject. stack_output_info = { 'x_index': 144, 'y_index': -36, 'stack_output_dir': '/g/data/v10/tmp/ndvi', 'start_datetime': None, # Datetime object or None 'end_datetime': None, # Datetime object or None 'satellite': None, # String or None 'sensor': None} # String or None tile_type_info: Dict containing tiletypeinformation. Obtained fromstacker object. tile_type_info = { 'crs': 'EPSG:4326', 'file_extension': '.tif', 'file_format': 'GTiff', 'format_options': 'COMPRESS=LZW,BIGTIFF=YES', 'tile_directory': 'EPSG4326_1deg_0.00025pixel', 'tile_type_id': 1L, 'tile_type_name': 'Descriptive Name', 'unit': 'degree', 'x_origin': 0.0, 'x_pixel_size': Decimal('0.00025000000000000000'), 'x_pixels': 4000L, 'x_size': 1.0, 'y_origin': 0.0, 'y_pixel_size': Decimal('0.00025000000000000000'), 'y_pixels': 4000L, 'y_size': 1.0 } 28. What Now? Harden bodgy prototype code. Optimise crufty DB schema. Streamline internal workflow to minimise supporting (Python) logic required. Want complete description of temporal stack from a single SQL query. Open-source all code Move from stacked 2D files to dense, contiguous, indexed NetCDF files (where appropriate). Generalise dimensionality of and parameterise array order, file and chunk size to achieve best performance across all common use cases in a given environment. Datacube Presentation for SPEDDEXES Workshop 18/3/14 29. Large-Scale Analysis Datacube Presentation for SPEDDEXES Workshop 18/3/14 NFRIP water detection 15 Years of data from LS5 & LS7(1998-2012) 25m Nominal Pixel Resolution Approx. 133,000 individual source scenes in approx. 12,400 passes Entire archive of 1,312,087 ARG25 tiles => 21x1012 pixels visited Originally 2 days at NCI (elapsed time) to compute. Now ~6hrs. 30. Menindee Lakes 1998-2012 (Water Management) Datacube Presentation for SPEDDEXES Workshop 18/3/14

ag-dc data cube ip speddexes

Education