risico on the grid architecture

RISICO on the GRID architecture

First implementation

Mirko D'Andrea, Stefano Dal Pra

Outline of the presentation

➲ Porting features;➲ Jobs management;➲ Implementation tests and results;➲ Conclusions and further development.

Porting features

➲ Totally implemented in python.➲ Uses the same executable of the RISICO

system (no changes needed).➲ Easily configurable through configuration file.

The RISICO system

➲ Italy: 310000 km^2➲ Current system: 300k regular cells, 1km side.➲ Grid version: 30M regular cells, 0.1km side.

GRIDIFICATIONGRIDIFICATION

RISICO vs GRID-RISICO

Get Input from Database

Run RISICO

Write Output to Database

GRIDIFICATIONGRIDIFICATION

Get Input from Database

Upload Input into catalog

Create n jobs

Run RISICO on dataset 1

Collect outputs from catalog

Write Outputs to Database

JOB 1Get input

from catalog

Write output 1 tocatalog

Run RISICO on dataset n

JOB nGet input

from catalog

Write output n tocatalog

Job submission

➲ A RISICO's job is fully defined by a jdl (job description language) file and by a parameter file.

➲ Each submitted job must terminate successfully within a defined time. The job activity is monitored by a software module called JobMonitor.

➲ The job submission procedure is handled by a JobSubmitter, which creates a set of job and associates a JobMonitor with each job.

Job Monitoring

➲ All the jobs are monitored by an instance of a module called JobMonitor.

➲ The JobMonitor: Checks the job status during execution; Retrieves the job output from catalog; If the job fails, JobMonitor tries to resubmit it. JobMonitor will log the error if the job fails to run

correctly.

Workflow: job creation, submission and data-collection

➲ Downloads input from remote meteo-data database, creates an archive and uploads it to catalog;

➲ Creates a jdl and parameters file for each job;➲ Submits the jobs.➲ Waits for jobs output.➲ Gets jobs output from catalog and aggregates

them.

Job definition (1)

job 1

job n

➲ Each job works with a specific dataset defining a spatial domain (subset).

➲ Such subsets are created off-line and stored on the catalog.

➲ A parameters file states the association between a job and a dataset.

➲ Each job produces an output, whose path in the catalog is a-priori known.

Job definition (2)

Job 1:Domain: celle/celle_01.tar.bz2Status: celle/stato0_01.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_01_20071119.tar.bz2

➲ Each job has its own domain.

➲ Job domain, status information and output are referred to the same geographical domain

➲ All jobs share the same input file.

Job definition (3)

Job 2:Domain: celle/celle_02.tar.bz2Status: celle/stato0_02.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_02_20071119.tar.bz2

Job n:Domain: celle/celle_nn.tar.bz2Status: celle/stato0_nn.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_nn_20071119.tar.bz2

CATALOGJob 1:Domain: celle/celle_01.tar.bz2Status: celle/stato0_01.tar.bz2Input: input/input_20070119.tar.bz2Output: output/output_01_20071119.tar.bz2

Final version

➲ Estimated performances on the complete set of data (30M cells):

Total CPU-Time: about 2 hours and 30 minutes; Optimal job number: about 30 (5-10 minutes of CPU

time for each job); Storage: 30GByte / day.

Test Results

➲ The porting has been tested with a subset (1M cells) of the RISICO system final working-set .

➲ 10 parallel jobs were used.➲ Performances:

Job CPU-time: 30 seconds Grid overhead: 2 minutes.

Conclusions

➲ RISICO represents a feasible and significative test case.

➲ Grid architecture provides a valuable benefits to operational activities.

risico on the grid architecture

Documents