cloud technologies for computational sciences (sergey berezin)

1.Cloud Technologies for Computational Sciences Sergey Berezin Dmitry Grechka Moscow State University and Microsoft Research

2. Common tasks for computational scientists Primary Build models Fit models with data Simulate using fitted models Validate and adjust models Share results Reproduce the results of others non-primary Find the useful data Choose the dataset among available Prepare the data for usage Example: Species occurrence probability related to temperature and precipitation 2 3. Fetch function Jan 2013 Jan 2012 Jan 2011 Jan 2010 Jan 2009 3 4. Defensible science Uncertainty reflects incomplete knowledge of the quantity Noise standard deviation Confidence interval Credible interval Reproducibility fetch function always returns same value v for same values of arguments Provenance what type of source data d was used to compute the value 4 5. Data cube fetch 2013 2012 2011 2010 2009 5 6. Demo 1. FetchClimate web interface http://fetchclimate2.cloudapp.net 2. FetchClimate API http://jsfiddle.net/sergey77/swePD 6 7. Common Data Model (CDM) Dataset is a set of constrained variables. Variable is a annotated multidimensional array. prate lon lat t Dimension: lon Dimension:time http://www.unidata.ucar.edu/software/netcdf-java/CDM/ We use Dmitrov package as CDM layer http://research.microsoft.com/en-us/um/cambridge/groups/science/tools/dmitrov/ 7 8. Data sets variety Time series and scattered points Long-term averaged grid Time series grids 8 9. Data sets variety Global Historical Climatology Network (GHCN v2) 21310 stations, monthly averages [6] time of type DateTime (time:3732) [5] id of type UInt64 (stations:21310) [4] lon of type Single (stations:21310) [3] lat of type Single (stations:21310) [2] prate of type Int32 (stations:21310) (time:3732) [1] temp of type Int32 (stations:21310) (time:3732) Peterson, Thomas C. and Russell S. Vose (1997). "An overview of the Global Historical Climatology Network temperature data base". Bulletin of the American Meteorological Society 78 (12): 28372849 Time series and scattered points Long-term averaged grid Time series grids 9 10. Data sets variety CRU CL 2.0, World Clim 1.4, WorldClim 1.4 (~34.7 Gb) High spatial resolution (~ 1km resolution at equator) 50 years average, 12 separate months [5] time of type Int32 (time:12) [4] lat of type Single (lat:18000) [3] lon of type Single (lon:43200) [2] prec of type Int16 (time:12) (lat:18000) (lon:43200) [1] tmean of type Int16 (time:12) (lat:18000) (lon:43200) Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978. Time series and scattered points Long-term averaged grid Time series grids 10 11. Data sets variety NCEP/NCAR Reanalysis 1 High temporal resolution (6 hours average) ~200 km at equator [4] lat of type Single (lat:94) [3] lon of type Single (lon:192) [2] time of type Double (time:92044) [1] prate of type Int16 (lat:94) (lon:192) (time:92044) http://www.cpc.ncep.noaa.gov/products/wesley/reanalysis.html Time series and scattered points Long-term averaged grid Time series grids 11 12. Fetch function logic 12 13. Uncertainty evaluation Statistical approach For each dataset: Find an external set of reference values at reference sites (considered to be exact) Generate a sample of corresponding pairs: (computed using the dataset, reference value) Discover the dependencies between the value difference and the spatiotemporal location of a reference site Uncertainty propagation Sequential usage of methods that consider uncertainty 13 14. Data handling 14 15. Chunked array storage Linear array storage 1 3 4 3 7 5 4 3 2 1 3 3 7 1 3 3 7 4 3 1 3 4 3 7 5 4 3 2 1 3 4 3 7 5 4 3 2 Chunked array storage (HDF5) 15 16. Choosing right part size 16 17. Md Array Storage for Azure DataSet Interface (a) 17 18. FetchClimate requests Very different processing time: from second to hours Low latency request scheduler Node restart protection Partitioning & round-robin scheduler Generate large datasets Repeatable requests Server-side cache 18 19. FetchClimate top level diagram Frontend role Frontend role Frontend role Frontend role Frontend role Azureloadbalancer Request hash Status=Pending|Running| Completed|Failed Submit time Touch time Part count/Total parts Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Worker role Cache/Queue Azure blob storage IDataHandler Configuration database Azure chunked array storage 19 20. Data source handler interface IRequestContext { FetchRequest Request { get; } Task GetMaskAsync(Array uncertainty); DataStorageDefinition StorageDefinition { get; } Task GetDataAsync(StorageRequest[] requests); Task FetchDataAsync(FetchRequest[] requests); } abstract class DataSourceHandler { abstract Task ProcessRequestAsync(IRequestContext ctx); } 20 21. Partitioning Splitting one big request into many small Take advantage of parallel processing Protect system from huge tasks Who will join multiple partitions Concurrent-safe datase How to choose partition size? 21 22. Future work Improving uncertainty handling Automatic uncertainty generation Improving request scheduler Different uncertainty computation complexity Choosing part size Predicting processing time Use off-the-shelf solution Averaging large temporal-spatial regions Data pyramid? 22 23. Questions? 23 [email protected] [email protected]

cloud technologies for computational sciences (sergey berezin)

Education

time of type int32 time

time of type datetime

time of type double

time touch time

time http

prec of type int16 time

tmean of type int16

lon of type single lon