casjobs: a workflow environment designed for large scientific catalogs nolan li, johns hopkins...

14
CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS

Nolan Li, Johns Hopkins University

Page 2: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

What is CASJobs

Terabytes of scientific data Web based system

Data distribution Server-side analysis Optimize user work patterns Server-side user storage and

programmability

Page 3: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Sloan Digital Sky Survey (SDSS) Astronomical Survey

Images (fits) - 15.7 TB

Other data products ( masks, jpeg images, etc.) (DAS, fits format) - 26.8 TB

Catalogs (CAS, SQL database) - 18 TB

Data is public Delivery?

Page 4: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Database

Bandwidth is expensive!

10 terabytes is big! So database it

(SkyServer) Partial delivery Move work to data

Scalability Traffic++ Complexity ++ Data++

So… Cap execution time Cap results Build something else

Monthly CAS Usage

1.E+04

1.E+05

1.E+06

1.E+07

Web Hits

SQL Queries

Page 5: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

CASJobs

Catalog Archive Server Jobs Server-side user storage and programmability

MyDB Hardware abstraction and long-term query

portability Contexts

Complete, automatic query logging Scalable performance

Controlled asynchronous query execution Data sharing

Groups http://casjobs.sdss.org/casjobs

Page 6: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

MyDB

Server-side user database

Intermediate storage

Data import User

programmable

SELECT *FROM DR4WHERE a.objid = 38573498OR a.objid = 92837451OR a.objid = 20394833OR a.objid = 90284723

SELECT *FROM DR4 a, MyDB.MyTable bWHERE a.objid = b.objid

Page 7: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Logging

Automatically log all user queries Resubmit old queries Reconstruct database objects

Page 8: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Contexts

Databases are identified by their data, not their location

Queries are independent of hardware configuration

SELECT TOP 10 *FROM [server].[catalog].[user].MyTable

SELECT TOP 10 *FROM DR4.MyTable

Page 9: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Quick Jobs

Executes right away

But not for very long

Restricted memory usage

For things like… How many objects

? Table previews Preliminary

queries System queries

Page 10: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Long Jobs

Asynchronous Less restricted

execution time Storage capped

by MyDB size

For things like… Heavy IO Heavy

computation

Page 11: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Groups

Non exclusive sets of CASJobs users

Share data Keep more work

at the data

SELECT *FROM myGroup.otherUser.theirTable

Page 12: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Hardware

Flexible configuration

1+ machine per context (non exclusive)

1+ machine for MyDBs

Page 13: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Interface

Web Site Web Services

Page 14: CASJOBS: A WORKFLOW ENVIRONMENT DESIGNED FOR LARGE SCIENTIFIC CATALOGS Nolan Li, Johns Hopkins University

Usage

> two million jobs > 2200 users Astro deployments

Galaxy Evolution Explorer (GALEX)

Palomar Quest Panoramic Survey

Telescope and Rapid Response System (Pan-STARRS)[3].

Non Astro deployments Ameriflux Swiss Institute of

Bioinformatics (ISB) 8/29

/200

3 17

:32

11/3

0/20

03 1

6:33

2/27

/200

4 15

:45

5/31

/200

4 8:

42

8/31

/200

4 19

:41

11/3

0/20

04 2

0:08

2/28

/200

5 23

:59

5/31

/200

5 23

:57

8/31

/200

5 23

:58

11/3

0/20

05 2

3:58

2/28

/200

6 23

:57

5/31

/200

6 23

:42

8/31

/200

6 23

:35

11/3

0/20

06 2

3:41

2/28

/200

7 22

:44

5/31

/200

7 14

:08

8/31

/200

7 23

:46

11/3

0/20

07 2

3:35

2/29

/200

8 23

:43

5/31

/200

8 23

:47

8/31

/200

8 23

:59

0

50000

100000

150000

200000

250000

Monthly CASJobs