data publication and quality control procedure for cmip5 / ipcc-ar5 data

1
Data Publication and Quality Control Procedure for CMIP5 / IPCC- AR5 Data [email protected] WDC Climate / DKRZ: www.wdc-climate.de; www.dkrz.de CMIP5: cmip-pcmdi.llnl.gov/cmip5; CMIP5 Quality Control: purl.org/org/cmip5/qc EGU2011-2859 Martina Stockhause, Michael Lautenschlager, Heinke Hoeck, and Frank Toussaint CMIP5 Quality Control Workflow Distributed Quality Control Approach Data Publication Procedure Future Perspective CMIP5 Quality Control (QC) Three Quality Control (QC) Levels are defined for CMIP5 data: Quality Levels for CMIP5 QC Level 1: Metadata: Technical checks on METAFOR questionnaire input data Data: CMOR2 and ESG publisher conformance checks QC Level 2: Metadata: METAFOR questionnaire metadata checked by scientist Data: Technical checks e.g. on the reliability of variable ranges and the consistency checks between data and data requirements QC Level 3 / DOI: Data approved by author and published as DOI Data assigned a DOI is formally citable and is granted persistent access. The final DOI data publication procedure is in agreement with the regulations of the DataCite consortium: DOI Publication Process Scientific Quality Assurance: performed by the data author and documented via a publication service GUI (atarrabi) Technical Quality Assurance: cross- and double checks of data and metadata integrity DOI Publication: DataCite DOI metadata and DOI are separately send to the registration agency, a member of the DOI Foundation. Data and DOI remain unchanged and persistent. / LLNL: data and security infrastructure (ESG) (British Atmospheric Data Centre): metadata infrastructure (METAFOR / CIM) (World Data Center Climate) / DKRZ: quality control, data publication (DataCite DOI) DN TDS QC L2 DN TDS QC L2 ESG Gateways QC Repository CIM MD Repository DN TDS QC L2 PCMDI/BADC/WDCC CIM Questionnai re QC L3 / STD-DOI WDCC: DOI Publication Agency MD on Model / Simulation MD on Data MD on Quality Store, Analyse, and Plot Results DOI Data Long-term Archive DOI Catalogu e IDF MD on Data / Quality / Model / Simulation Quality Results TQA Atarrabi SQA of Data Author CERA2 MD Register STD-DOI / URL of DOI Target Page Harvest all MD Harvest all MD of long-term archive DOI Target Page DOI Access Parts of Data / MD on Simulation CMIP5 Organization & Infrastructure Components distribution of data connected to the next IPCC report, the Earth System Grid Federation F) was founded. Its members have different responsibilities within the data infrastructure: For CMIP5 ca. 3 PB of officially requested data are expected to be archived. About 1 PB of that data will likely be of especially high interest and will be replicated by the three ESGF partners. Because of the high data volume the QC checks up to level 2 are performed distributed among the ESGF. The final QC level 3 checks for the DOI assignment are carried out by WDCC. Afterwards CMIP5 data is formally citable and remains persistent. .org The different QC levels are connected with different access rights for registered users: Quality Control Workflow CMIP5 DOI Publication GUI atarrabi Actors in DOI Publication Process Workflow of Distributed QC in CMIP5 QC checks for data and metadata are performed, separately, for levels 1 and 2. During the cross-checks of QC L3 checks their results are reviewed. QC is accomplished on DRS Atomic Dataset level. The QC results are aggregated on DRS experiment level. In the gateways data discovery is supported down to the level of Ensemble versions (ESG dataset). Granularity of Quality Control Restricted Access (QC Level 1 Data): After ESG publication the access is restricted under control of the modelling centre. Scientific Access (QC Level 2): The scientific community is granted access of data of QC L2. QC Level 3 / DOI: With the DOI assignment the data archive is opened for access by non- scientific users. For high volume data such as climate data quality assurance has to be carried out at the data storage centre before opening the repository for data access. Data distributed in a Data Grid with its decentralized data repositories have to be checked at different sites with comparable QC procedures. The cross- and double checks of the Technical Quality Assurance make use of the QC result of the preceding levels. Data as well as metadata is reviewed and data accessibility checked. QC L3 / DOI Process in CMIP5 More Information: http://purl.org/org/cmip5/qc More Information: http://purl.org/org/cmip5/qc More Information: http://cera-www.dkrz.de/atarrabi The final author approval step is supported by the GUI atarrabi. Authors check basic metadata and add information about their own quality assurance (Scientific QA). A DOI is assigned and registered at the International DOI Foundation (IDF: dx.doi.org) via the Registration Agency DataCite. DOI Construction Rule for CMIP5: doi: 10.1594/WDCC/CMIP5.<opaque bit> Overall CMIP5 QC Workflow Data NOT formally citable Modelling group control access manually. Data NOT formally citable Automatic access granted after filling in ESGF registration page. DOI Assigned: Data formally citable Data can appear in IPCC-DDC ------------------------------------------ Automatic access granted after filling in ESGF registration page Data Published > 10 PB Data and Metadata QC L1 -------------------------------- On globally distributed data nodes Metadata QC L2 passed? To be replicated among ESGF? Data QC L2 passed? QC L3 passed? Discard data (Informal citation still requested where formal citation not available) NO NO NO YES YES YES YES Replicated: Copied to PCMDI, BADC, WDCC & elsewhere ~ 1PB Granularity of QC in CMIP5 context <activity> <product> <institute> <m odel> <experim ent> <frequency> <m odeling realm > <variable identifier> <ensem ble m em ber> <netcdfdatasets> CM IP5 Process DRS Nam e /H ierarchy Level D O I/Q C G atew ay Search D ata Access <version> Atom ic Dataset <M IP/C M O R 2 table> The current DOI publication procedure is comparable to the publication of grey literature scientific print media. For the integration of a peer review process quality procedures ac and agreed on by the earth system modelling community are necessary. The distributed quality control approach could be reduced in complexity by the integratio the QC Repository into the CIM Metadata Repository. Distributed QC Approach Distributed QC Procedure in CMIP5 CMIP5 data is delivered to one of the three ESGF partners, where it is ESG published and thus QC L1 Data checked. Afterwards QC L2 Data consistency checks are performed, before a data subset is replicated among the ESGF. QC L2 results are stored in a central QC Repository. During QC L3 / DOI checks the QC results are accessed by the DOI Publication Agency WDCC. Other sources for cross- and double-checks are the CIM Metadata Repository, the Thredds Data Server (TDS), and the metadata stored in the long- term archive at WDCC. Thus, the effort of the QC L2 Data checks is shared among the ESGF. But the QC L3 / DOI checks are performed at one site making use of the QC L2 results stored in a central QC Repository. Thus a QC procedure/tools have to be developed, maintained and distributed centrally and agreed upon within the scientific community. Our distributed QC approach consists of different software components: ______________________________________________________________ _____ ESG: Earth System Grid, MD: Metadata, DN: Data Node, TDS: Thredds Data Server, TQA: Technical Quality Assurance, SQA: Scientific Quality Assurance. Registration Agency Scientist Publication Agency Permission: QC L2 DOI- Publication TIME Scientific Q. Assurance Technical Q. Assurance QC Run Service: QC tool run and Repository ingests of configuration and results QC Services for data analyses and exception statistics QC Plotting Service and plot ingest in Repository QC L2 assignment QC checks: QC Tool QC Services DOI Publication Agency with long-term archive Multiple Sites performing local QC checks Project Metadata Repository QC Tool QC Services Central Repository QC Tool QC Services QC Services DOI process Export QC results for DOI publication process DOI publication: Organisation of CMIP5 Data Federation Community Tools Metadata Handling Semantic Data Search & Discovery Security Publishing API User Registration Harvester Data Products Metrics Aggregation Database BADC DKRZ WDCC NASA JPL NCAR ANU NERSC ORNL ESG Federated Gateways Users PCMDI Data Node Deep Archive THREDDS Manager Product GridFTP OPeNDAP Browsing Publisher Node Database TDS Catalogs Disk Cache Security Middleware NASA/GISS Data Node GFDL Data Node INM Data Node IPSL Data Node QC L1L2 QC L1L2 QC L1L2 L3 LLNL PCMDI QC L1L2

Upload: brianna-meyer

Post on 30-Dec-2015

20 views

Category:

Documents


1 download

DESCRIPTION

Scientific Q. Assurance. Technical Q. Assurance. DOI- Publication. Permission: QC L2. QC Services. QC Services. Scientist. QC Tool. QC Tool. Web User Environment. Application Environment. Scripts. Metadata QC L2 passed?. Users. NERSC. NCAR. Community Tools. DKRZ WDCC. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Publication and Quality Control Procedure for CMIP5 / IPCC-AR5 Data

Data Publication and Quality Control Procedure for CMIP5 / IPCC-AR5 Data

[email protected] WDC Climate / DKRZ: www.wdc-climate.de; www.dkrz.de

CMIP5: cmip-pcmdi.llnl.gov/cmip5; CMIP5 Quality Control: purl.org/org/cmip5/qc

EGU2011-2859Martina Stockhause, Michael Lautenschlager, Heinke Hoeck, and Frank Toussaint

CMIP5 Quality Control Workflow Distributed Quality Control Approach Data Publication Procedure

Future PerspectiveCMIP5 Quality Control (QC)

Three Quality Control (QC) Levelsare defined for CMIP5 data:

Quality Levels for CMIP5

• QC Level 1: Metadata: Technical checks on METAFOR questionnaire input data

Data: CMOR2 and ESG publisher conformance checks• QC Level 2: Metadata: METAFOR questionnaire metadata checked by scientist Data: Technical checks e.g. on the reliability of variable ranges and the consistency checks between data and data requirements• QC Level 3 / DOI: Data approved by author and published as DOI

Data assigned a DOI is formallycitable and is granted persistent access.

The final DOI data publicationprocedure is in agreement with theregulations of the DataCite consortium:

DOI Publication Process

• Scientific Quality Assurance: performed by the data author and documented via a publication service GUI (atarrabi)• Technical Quality Assurance: cross- and double checks of data and metadata integrity• DOI Publication: DataCite DOI metadata and DOI are separately send to the registration agency, a member of the DOI Foundation. Data and DOI remain unchanged and persistent.

• PCMDI / LLNL: data and security infrastructure (ESG)• BADC (British Atmospheric Data Centre): metadata infrastructure (METAFOR / CIM)• WDCC (World Data Center Climate) / DKRZ: quality control, data publication (DataCite DOI)

DNTDS

QCL2

DNTDS

QCL2

ESGGateways

QCRepository

CIMMD Repository

DNTDS

QCL2

PCMDI/BADC/WDCC

CIMQuestionnaire

QC L3 / STD-DOI

WDCC: DOI Publication Agency

MD onModel /

Simulation

MD onData

MD onQuality

Store, Analyse,and Plot Results

DOI DataLong-term

Archive

DOI Catalogue

IDF

MD on Data / Quality /Model / Simulation

Quality Results

TQA

Atarrabi

SQA ofData Author

CERA2MD

RegisterSTD-DOI /

URL of DOI Target Page

Harvestall MD

Harvest all MD oflong-term archive

DOI TargetPage

DOIAccess

Parts of Data /MD on Simulation

CMIP5 Organization & Infrastructure ComponentsFor distribution of data connected to the next IPCC report, the Earth System Grid Federation(ESGF) was founded. Its members have different responsibilities within the data infrastructure:

For CMIP5 ca. 3 PB of officially requested data are expected to be archived. About 1 PB of thatdata will likely be of especially high interest and will be replicated by the three ESGF partners.

Because of the high data volume the QC checks up to level 2 are performed distributed amongthe ESGF. The final QC level 3 checks for the DOI assignment are carried out by WDCC.Afterwards CMIP5 data is formally citable and remains persistent.

.org

The different QC levels are connected with different access rights for registered users:

Quality Control Workflow CMIP5

DOI Publication GUI atarrabi

Actors in DOI Publication ProcessWorkflow of Distributed QC in CMIP5

QC checks for data and metadataare performed, separately, for levels1 and 2. During the cross-checksof QC L3 checks their results arereviewed.

QC is accomplished on DRS Atomic Dataset level. The QC results are aggregated on DRS experiment level. In the gateways data discovery is supported downto the level of Ensemble versions (ESG dataset).

Granularity of Quality Control

• Restricted Access (QC Level 1 Data): After ESG publication the access is restricted under control of the modelling centre.• Scientific Access (QC Level 2): The scientific community is granted access of data of QC L2.• QC Level 3 / DOI: With the DOI assignment the data archive is opened for access by non-scientific users.

For high volume data such as climatedata quality assurance has to becarried out at the data storage centrebefore opening the repository for dataaccess. Data distributed in a Data Gridwith its decentralized data repositorieshave to be checked at different siteswith comparable QC procedures.

The cross- and double checks of theTechnical Quality Assurance make use of the QC result of the precedinglevels. Data as well as metadata isreviewed and data accessibility checked.

QC L3 / DOI Process in CMIP5

More Information: http://purl.org/org/cmip5/qc More Information: http://purl.org/org/cmip5/qc More Information: http://cera-www.dkrz.de/atarrabi

The final author approval step issupported by the GUI atarrabi. Authors check basic metadata and add information about their own quality assurance (Scientific QA).

A DOI is assigned and registered atthe International DOI Foundation (IDF: dx.doi.org) via the Registration Agency DataCite.

DOI Construction Rule for CMIP5:

doi: 10.1594/WDCC/CMIP5.<opaque bit>

Overall CMIP5 QC Workflow

Data NOTformally citableModelling groupcontrol access

manually.

Data NOTformally citable

Automatic accessgranted after filling in

ESGF registration page.

DOI Assigned:Data formally citable

Data can appear in IPCC-DDC------------------------------------------

Automatic accessgranted after filling in

ESGF registration page

Data Published > 10 PBData and Metadata QC L1

--------------------------------On globally distributed

data nodes

MetadataQC L2

passed?

To bereplicated

amongESGF?

DataQC L2

passed?

QC L3passed?

Discarddata

(Informal citation still requested where formal citation not available)

NO NO NO

YES YES YES

YES

Replicated:Copied to

PCMDI, BADC, WDCC& elsewhere

~ 1PB

Granularity of QC in CMIP5 context

<activity> <product> <institute> <model> <experiment>

<frequency>

<modeling realm>

<variable identifier>

<ensemble member>

<netcdf datasets>

CMIP5 Process DRS Name / Hierarchy Level

DOI / QC

Gateway Search

Data Access

<version>

Atomic Dataset

<MIP/CMOR2 table>

The current DOI publication procedure is comparable to the publication of grey literature inscientific print media. For the integration of a peer review process quality procedures acceptedand agreed on by the earth system modelling community are necessary.

The distributed quality control approach could be reduced in complexity by the integration ofthe QC Repository into the CIM Metadata Repository.

Distributed QC Approach

Distributed QC Procedure in CMIP5CMIP5 data is delivered to one of the three ESGFpartners, where it is ESG published and thus QC L1Data checked. Afterwards QC L2 Data consistencychecks are performed, before a data subset isreplicated among the ESGF. QC L2 results arestored in a central QC Repository.

During QC L3 / DOI checks the QC results areaccessed by the DOI Publication Agency WDCC.Other sources for cross- and double-checks arethe CIM Metadata Repository, the Thredds DataServer (TDS), and the metadata stored in the long-term archive at WDCC.

Thus, the effort of the QC L2 Data checks is sharedamong the ESGF. But the QC L3 / DOI checks areperformed at one site making use of the QC L2results stored in a central QC Repository.

Thus a QC procedure/tools have to bedeveloped, maintained and distributedcentrally and agreed upon within thescientific community.

Our distributed QC approach consistsof different software components:

___________________________________________________________________ESG: Earth System Grid, MD: Metadata, DN: Data Node, TDS: Thredds Data Server,TQA: Technical Quality Assurance, SQA: Scientific Quality Assurance.

Reg

istr

atio

nA

gen

cyS

cien

tist

Pu

bli

cati

on

Ag

ency

Permission:QC L2

DOI-Publication

TIME

Scientific Q.Assurance

Technical Q.Assurance

• QC Run Service: QC tool run and Repository ingestsof configuration and results

• QC Services for data analyses and exception statistics

• QC Plotting Service and plot ingest in Repository

• QC L2 assignment

QC checks:

QC Tool

QC Services

DOI Publication Agencywith long-term archive

Multiple Sites performinglocal QC checks

ProjectMetadata

RepositoryQC Tool

QC Services

CentralRepository

QC Tool

QC Services QC Services

DOI process

• Export QC results for DOI publication processDOI publication:

Organisation of CMIP5 Data Federation

CommunityTools

MetadataHandling

Semantic Data Search

& Discovery

Security

PublishingAPI

UserRegistration

Harvester

Data Products

MetricsAggregation

Database

BADC

DKRZWDCC

NASAJPL

NCAR

ANU

NERSC

ORNL

ESG FederatedGateways

UsersUsers

PCMDI Data Node

Deep Archive

THREDDSManager

Product

GridFTP

OPeNDAP

Browsing

Publisher

NodeDatabase

TDSCatalogs

Disk Cache

Secu

rity

Mid

dlew

are

Secu

rity

Mid

dlew

are

NASA/GISSData Node

NASA/GISSData Node

GFDL Data NodeGFDL

Data Node

INMData NodeINM

Data Node

……

IPSL Data NodeIPSL

Data Node

QC L1L2

QC L1L2

QC L1L2L3

LLNLPCMDI

QC L1L2