anatomy of a climate science-centric workflow

24
Anatomy of a Climate Science- centric Workflow Harinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes (CASCADE Team) Kevin Bensema, Surendra Byna, Soyoung Jeon, Karthik Kashinath, Burlen Loring, Pardeep Pall, Prabhat, Alexandru Romosan, Oliver Ruebel, Daithi Stone, Travis O'Brien, Christopher Paciorek, Michael Wehner, Wes Bethel, William Collins

Upload: abbott

Post on 22-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Anatomy of a Climate Science-centric Workflow. Harinarayan Krishnan, CA librated and S ystematic C haracterization, A ttribution, and D etection of E xtremes (CASCADE Team) - PowerPoint PPT Presentation

TRANSCRIPT

Home Energy Saver

Anatomy of a Climate Science-centric WorkflowHarinarayan Krishnan, CAlibrated and Systematic Characterization, Attribution, and Detection of Extremes (CASCADE Team)

Kevin Bensema, Surendra Byna, Soyoung Jeon, Karthik Kashinath, Burlen Loring, Pardeep Pall, Prabhat, Alexandru Romosan, Oliver Ruebel, Daithi Stone, Travis O'Brien, Christopher Paciorek, Michael Wehner, Wes Bethel, William Collins#ChallengesScale of data already at TBs and will only grow larger.

Processing Three to Six hours of intervals frequently.

Foci now is on High resolution 1/4th to 1/8th degree. Extensible to higher.

High resolution and high frequency analysis add several orders of magnitude.Proposed StrategyIdentification of use cases, extraction of common computational algorithms, scaling & optimization of current work.

Template workflow configurations of common use cases.

Abstraction of services to HPC environments.

Easy to use archiving, distribution, and verification strategies.

Standardization of parallel work environment.

What it is/What it is notWhat it is not Not a general workflow Not a general infrastructure Balancing between performance & exploratory science.What it is

For Example:t = cascade.Teca() t['filename'] = myfilewriter = cascade.Writer(cascade.ESGF)writer[input] = t[out]n = workflow.NERSC(, writer)n.execute()

Note: Active Work in progress & ongoing#

What it is/What it is notWhat it is not Not a general workflow Not a general infrastructure Balancing between performance & exploratory science.What it is A highly customized climate-centric API (Zonal Mean Averages, GEV, etc) Workflow Verification/Validations, Job scheduling, Staging, Deployment, etcModules Performance & Timing Support, Calendar Support, etc Template workflows

Climate Science-centric Workflow Workspace A collaboration environment to share, track documents, visualize status, update issues.

One-on-one Identify use cases that require implementing new features or scaling & performance optimization of existing ones.

Software tools Development and Deployment of algorithms & software packages as well as building & maintaining packages for target environments.

Workflow components Connecting it all together.Communication Infrastructure

Quick Note: Software EnvironmentInfrastructure - cascade.lbl.gov/esg02.nersc.gov

Confluence Portal to publish and collaborate with team members

Jira Bug & Issue tracking portal.

CDash/Jenkins Infrastructure to report status of software build & regression tests.

BitBucket Main software repository.

ESGF service Service for distribution of data generated by CASCADE.CASCADE TeamDetection & Attribution Team Characterization, detection, and attribution of simulated and observed extremes in a variety of different contexts -- Analysis Algorithms

Model Fidelity . Evaluation and improvement of model fidelity in simulating extremes

Statistics Development of statistical frameworks for extremes analysis, uncertainty quantification, and model evaluation

Formulation of highly parallel software for analysis and uncertainty quantification of extremes

Analysis Infrastructure TasksDevelopment of new climate-centric algorithms and evaluation of current ones. Implement scalable, parallel versions as needed.

Performance analysis and data management.

Deployment and Maintenance on HPC environments.

Creating a standardized environment Provide same execution environment on all deployed platforms, and seamless bridges different technologies (Python R).

User Support.

Detection & AttributionSingle Program Multiple Data SPMD scripts refactoring current algorithms to work in parallel.

Distribution/Staging Functionality to distribute data generated through ESGF also stage data at NERSC.

TECA Active development of Parallel Toolkit for Extreme Climate Analysis.

Teleconnections Ensemble analysis & software solutions to investigate of frequency of teleconnection events.

Model Fidelity

Model FidelityILIAD workflowThe parallelization of the generation of initial conditions.Dynamic Building, Compilation & Execution of CESM.Module verification monitor execution status & successful completion.Module for automation of archiving of output (initial conditions, namelist files, CESM output).DepCache External tool for speeding up execution of Python libraries.Statistics Integration of Statistical Algorithms Working to deploy relevant statistical algorithms within CASCADE framework.

Parallelization Scaling statistics scripts to work in a parallel environment.

llex Installation Generalized Extreme Value Analysis & Peaks Over Threshold statistical analysis algorithms (Developed by Stats team members)

Software SuitePython environmentIPython, mpi4py, numpy, CDAT-Core (cdms2, cdtime,)Rpy2 (Python-R bridge)

R environmentextRemes, ismevLlex GEV & POT (Dr. Chris Pacioreks package)pbdR - pbdMPI, pbdSLAP, pbdPROF, pbdNCDF (ORNL)

TECA parallel toolkit developed at LBNL (TC, ETC, AR)

- Prototype deployment at NERSC (module load cascade)- Transitioning maintenance of NERSC ESGF Node to CASCADE analysis group.Workflow InfrastructureUnified Workflow Service Load balanced services that handle job Scheduling, Validation & Verification, Fault Tolerance

Core ModulesCalendar supportData Reduction Operations (Sum, Max, Min, Average, etc)I/O services (Parallel Read/Write)Threading/MPI wrapping (Map|Foreach)

Additional ServicesMPO A tool for recording scientific workflows, Developed by General Atomics & LBNL.

Tigres Template Interfaces for Agile Parallel Data-Intensive Science, Developed by Advanced Computing for Science Group at LBNL.

ESGF Support for automated distribution through ESGF installation.

Modules & APICoreModule Timing, LoggingStandard definition of parameter inputs & outputsAll modules are inherently Workflows of one. implicit connectivity of workflow

BaseAPI (Pythonic)__getitem__,__setitem__: param[input] = valcascade_static_{param|output}_spec: {name, value, type, user_defined}cascade_execute core execution function

Example Workflow Example use case: Running a single module^^^^^^^^^^^^^^t = Teca() # Where teca is a derived class of CascadeBasefilename = 'myfilet['filename'] = filenamet.execute()^^^^^^^^^^^^^^^^^t1 = Teca() # Where Teca is a derived class of CascadeBaset2 = TecaAnalysis() # Where TecaAnalysis is a derived class of CascadeBaset2['inputdata'] = t1['outputdata'] # Note, this establishes a link t2.execute()Proposed Workflowt1 = Teca()t2 = TecaAnalysis()t3 = TecaAnalysis()s = Diff()

t2['inputdata'] = t1['outputdata]t3[inputdata] = t1[outputdata]

s[inputdata1] = t2[outputdata]s[inputdata2] = t3[outputdata]s.write(prefix, file)

s.execute()

Recap: Anatomy of Climate Science-centric WorkflowSoftware Environment Development, Deployment, and Maintenance

Custom Use Case Support for D&A, Model Fidelity, and Statistics team needs.

Software Suite Scaling, Parallelism, Performance Management, Software Services (Python, R, TECA)

Workflow Development Thin Client & Workflow service, Module development, Optimization (Data Movement, Workflow execution), Provenance.ThanksQuestions?