virtual data toolkit

19
R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit

Upload: constance-wilkerson

Post on 03-Jan-2016

28 views

Category:

Documents


2 download

DESCRIPTION

Virtual Data Toolkit. R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003. MCAT; GriPhyN catalogs. MDS. MDS. GDMP. DAGMAN, Condor-G. GSI, CAS. Globus. GRAM. GridFTP; GRAM; SRM. Very Early GriPhyN Data Grid Architecture. Application. = initial solution is operational. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Virtual Data  Toolkit

R. CavanaughGriPhyN

Analysis Workshop

Caltech, June, 2003

Virtual Data Toolkit

Page 2: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 2

Very Early GriPhyN Data Grid Architecture

Application

Planner

Executor

Catalog Services

Info Services

Monitoring

Repl. Mgmt.

Reliable TransferService

Compute Resource Storage Resource

DAGMAN, Condor-G

GRAM GridFTP; GRAM; SRM

GSI, CAS

MDS

MCAT; GriPhyN catalogs

GDMP

MDS

Globus

= initial solution is operational

Policy/Security

Page 3: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 3

Currently Evolved GriPhyN Picture

Picture Taken from Mike Wilde

Page 4: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 4

Current VDT Emphasis

Current reality– Easy grid construction

> Strikes a balance between flexibility and “easibility” > purposefully errs (just a little bit) on the side of “easibility”

– Long running, high-throughput, file-based computing– Abstract description of complex workflows– Virtual Data Request Planning– Partial provenance tracking of workflows

Future directions (current research) including:– Policy based scheduling

> With notions of Quality of Service (advanced reservation of resources, etc)

– Dataset based (arbitrary type structures)– Full provenance tracking of workflows– Several others…

Page 5: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 5

Current VDT Flavors

Client– Globus Toolkit 2

> GSI > globusrun> GridFTP Client

– CA signing policies for DOE and EDG

– Condor-G 6.5.1 / DAGMan– RLS 1.1.8 Client– MonALISA Client (soon)

Chimera 1.0.3 SDK

– Globus– ClassAds– RLS 1.1.8 Client– Netlogger 2.0.13

Server– Globus Toolkit 2.2.4

> GSI > Gatekeeper> job-managers and GASS Cache> MDS> GridFTP Server

– MyProxy– CA signing policies for DOE and

EDG– EDG Certificate Revocation List– Fault Tolerant Shell– GLUE Schema– mkgridmap– Condor 6.5.1 / DAGMan– RLS 1.1.8 Server– MonALISA Server (soon)

Page 6: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 6

Chimera Virtual Data System

Virtual Data Language– textual

– XML

Virtual Data Catalog– MySQL or PostGreSQL based

– File based version available

Page 7: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 7

Virtual Data Language

TR CMKIN( out a2, in a1 )

{

argument file = ${a1};

argument file = ${a2};

}

TR CMSIM( out a2, in a1 )

{

argument file = ${a1};

argument file = ${a2};

}

DV x1->CMKIN( a2=@{out:file2}, a1=@{in:file1});

DV x2->CMSIM( a2=@{out:file3}, a1=@{in:file2});

file1

file2

file3

x1

x2

Picture Taken from Mike Wilde

Page 8: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 8

Virtual Data Request Planning

Abstract Planner– Graph traversal of (virtual) data dependencies– Generates the graph with maximal data dependencies– Somewhat analogous to Build Style

Concrete (Pegasus) Planner– Prunes execution steps for which data already exists (RLS

lookup)– Binds all execution steps in the graph to a site– Adds “housekeeping” steps

> Create environment, stage-in data, stage-out data, publish data, clean-up environment, etc

– Generates a graph with minimal execution steps– Somewhat analogous to Make Style

Page 9: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 9

Chimera Virtual Data System:Mapping Abstract Workflows onto Concrete Environments

Abstract DAGs (virtual workflow)

– Resource locations unspecified

– File names are logical

– Data destinations unspecified

– build style

Concrete DAGs (stuff for submission)

– Resource locations determined

– Physical file names specified

– Data delivered to and returned from physical

locations

– make style

Abs. PlanVDC

RLS C. Plan.

DAX

DAGMan

DAG

VDL

Log

ical

Ph

ysi

cal

XML

XML

In general there is a full range of planning steps between abstract workflows and concrete workflows

Picture Taken from Mike Wilde

Page 10: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 10

mass = 200decay = WWstability = 1event = 8

mass = 200decay = WWstability = 1plot = 1

mass = 200decay = WWplot = 1

mass = 200decay = WWevent = 8

mass = 200decay = WWstability = 1

mass = 200decay = WWstability = 3

mass = 200

mass = 200decay = WW

mass = 200decay = ZZ

mass = 200decay = bb

mass = 200plot = 1

mass = 200event = 8

A virtual space of simulated data is generated for future use by scientists...

Supercomputing 2002

Page 11: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 11

mass = 200decay = WWstability = 1LowPt = 20HighPt = 10000

mass = 200decay = WWstability = 1event = 8

mass = 200decay = WWstability = 1plot = 1

mass = 200decay = WWplot = 1

mass = 200decay = WWevent = 8

mass = 200decay = WWstability = 1

mass = 200decay = WWstability = 3

mass = 200

mass = 200decay = WW

mass = 200decay = ZZ

mass = 200decay = bb

mass = 200plot = 1

mass = 200event = 8

Scientistsmay add new derived data branches...

Supercomputing 2002

Page 12: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 12

POOL

Gen

era

tor

Sim

ula

tor

Form

ato

r

wri

teES

D

wri

teA

OD

wri

teTA

G

wri

teES

D

wri

teA

OD

wri

teTA

G

An

aly

sis

Scr

ipts

Dig

itis

er

Calib

. D

B

ExampleCMS Data/Workflow

Page 13: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 13

POOL

Gen

era

tor

Sim

ula

tor

Form

ato

r

wri

teES

D

wri

teA

OD

wri

teTA

G

wri

teES

D

wri

teA

OD

wri

teTA

G

An

aly

sis

Scr

ipts

Dig

itis

er

Calib

. D

B

Onlin

e Te

ams

(Re)

proc

essin

g Te

amMC P

rodu

ctio

n Te

am

Phys

ics G

roup

s

Data/workflowis a collaborative

endeavour!

Page 14: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 14

A “Concurrent Analysis Versioning System:”

Complex Data Flow and Data Provenance in HEP

Raw

ESD

AO

D

TA

G

Plo

ts,

Table

s,

Fit

s

Com

pari

sons

Plo

ts,

Table

s,Fit

s

Real Data

SimulatedData

Family History of a Data Analysis

Collaborative Analysis Development Environment

"Check-point" a Data Analysis

Analysis Development Environment (like CVS)

Audit a Data Analysis

Page 15: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 15

Current Prototype GriPhyN “Architecture” (Picture)

Picture Taken from Mike Wilde

Page 16: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 16

Post-talk: My wandering mind…Typical VDT Configuration

Single public head-node (gatekeeper)– VDT-server installed

Many private worker-nodes – Local scheduler software installed– No grid-middleware installed

Shared file system (e.g. NFS) – User area shared between head-node and

worker-nodes– One or many raid systems typically shared

Page 17: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 17

computemachinesCondor-G

Chimera

DAGman

gahp_server

submit host remote host

gatekeeper

Local Scheduler(Condor, PBS, etc.)

Default middleware configurationfrom the Virtual Data Toolkit

Page 18: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 18

EDG Configuration(for comparison)

CPU separate from Storage– CE: single gatekeeper for access to cluster

– SE: single gatekeeper for access to storage Many public worker-nodes (at least NAT)

– Local scheduler installed (LSF or PBS)

– Each worker-node runs a GridFTP Client No assumed shared file system

– Data access is accomplished via globus-url-copy to local disk on worker-node

Page 19: Virtual Data  Toolkit

06.23.2003 Caltech Analysis Workshop 19

Why Care?

Data Analyses would benefit from being fabric independent!

But…the devil is (still) in the details!– Assumptions in job descriptions/requirements currently lead to direct

fabric-level consequences and vice versa.

Are existing middleware configurations sufficient for Data Analysis (“scheduled” and “interactive”)?– Really need input from groups like here!– What kind of fabric layer is necessary for “interactive” data analysis using

PROOF, JAS?

Does the VDT need multiple configuration flavors? – Production, batch oriented (current default)– Analysis, interactive oriented