risico on the grid architecture
DESCRIPTION
First implementation. RISICO on the GRID architecture. Mirko D'Andrea, Stefano Dal Pra. Outline of the presentation. Porting features; Jobs management; Implementation tests and results; Conclusions and further development. Porting features. Totally implemented in python. - PowerPoint PPT PresentationTRANSCRIPT
RISICO on the GRID architecture
First implementation
Mirko D'Andrea, Stefano Dal Pra
Outline of the presentation
➲ Porting features;➲ Jobs management;➲ Implementation tests and results;➲ Conclusions and further development.
Porting features
➲ Totally implemented in python.➲ Uses the same executable of the RISICO
system (no changes needed).➲ Easily configurable through configuration file.
The RISICO system
➲ Italy: 310000 km^2➲ Current system: 300k regular cells, 1km side.➲ Grid version: 30M regular cells, 0.1km side.
GRIDIFICATIONGRIDIFICATION
RISICO vs GRID-RISICO
Get Input from Database
Run RISICO
Write Output to Database
GRIDIFICATIONGRIDIFICATION
Get Input from Database
Upload Input into catalog
Create n jobs
Run RISICO on dataset 1
Collect outputs from catalog
Write Outputs to Database
JOB 1Get input
from catalog
Write output 1 tocatalog
Run RISICO on dataset n
JOB nGet input
from catalog
Write output n tocatalog
Job submission
➲ A RISICO's job is fully defined by a jdl (job description language) file and by a parameter file.
➲ Each submitted job must terminate successfully within a defined time. The job activity is monitored by a software module called JobMonitor.
➲ The job submission procedure is handled by a JobSubmitter, which creates a set of job and associates a JobMonitor with each job.
Job Monitoring
➲ All the jobs are monitored by an instance of a module called JobMonitor.
➲ The JobMonitor: Checks the job status during execution; Retrieves the job output from catalog; If the job fails, JobMonitor tries to resubmit it. JobMonitor will log the error if the job fails to run
correctly.
Workflow: job creation, submission and data-collection
➲ Downloads input from remote meteo-data database, creates an archive and uploads it to catalog;
➲ Creates a jdl and parameters file for each job;➲ Submits the jobs.➲ Waits for jobs output.➲ Gets jobs output from catalog and aggregates
them.
Job definition (1)
job 1
job n
➲ Each job works with a specific dataset defining a spatial domain (subset).
➲ Such subsets are created off-line and stored on the catalog.
➲ A parameters file states the association between a job and a dataset.
➲ Each job produces an output, whose path in the catalog is a-priori known.
Job definition (2)
Job 1:Domain: celle/celle_01.tar.bz2Status: celle/stato0_01.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_01_20071119.tar.bz2
➲ Each job has its own domain.
➲ Job domain, status information and output are referred to the same geographical domain
➲ All jobs share the same input file.
Job definition (3)
Job 2:Domain: celle/celle_02.tar.bz2Status: celle/stato0_02.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_02_20071119.tar.bz2
Job n:Domain: celle/celle_nn.tar.bz2Status: celle/stato0_nn.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_nn_20071119.tar.bz2
CATALOGJob 1:Domain: celle/celle_01.tar.bz2Status: celle/stato0_01.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_01_20071119.tar.bz2
Final version
➲ Estimated performances on the complete set of data (30M cells):
Total CPU-Time: about 2 hours and 30 minutes; Optimal job number: about 30 (5-10 minutes of CPU
time for each job); Storage: 30GByte / day.
Test Results
➲ The porting has been tested with a subset (1M cells) of the RISICO system final working-set .
➲ 10 parallel jobs were used.➲ Performances:
Job CPU-time: 30 seconds Grid overhead: 2 minutes.
Conclusions
➲ RISICO represents a feasible and significative test case.
➲ Grid architecture provides a valuable benefits to operational activities.