Data Publication and Quality Control Procedure for CMIP5 / IPCC-AR5 Data
[email protected] WDC Climate / DKRZ: www.wdc-climate.de; www.dkrz.de
CMIP5: cmip-pcmdi.llnl.gov/cmip5; CMIP5 Quality Control: purl.org/org/cmip5/qc
EGU2011-2859Martina Stockhause, Michael Lautenschlager, Heinke Hoeck, and Frank Toussaint
CMIP5 Quality Control Workflow Distributed Quality Control Approach Data Publication Procedure
Future PerspectiveCMIP5 Quality Control (QC)
Three Quality Control (QC) Levelsare defined for CMIP5 data:
Quality Levels for CMIP5
• QC Level 1: Metadata: Technical checks on METAFOR questionnaire input data
Data: CMOR2 and ESG publisher conformance checks• QC Level 2: Metadata: METAFOR questionnaire metadata checked by scientist Data: Technical checks e.g. on the reliability of variable ranges and the consistency checks between data and data requirements• QC Level 3 / DOI: Data approved by author and published as DOI
Data assigned a DOI is formallycitable and is granted persistent access.
The final DOI data publicationprocedure is in agreement with theregulations of the DataCite consortium:
DOI Publication Process
• Scientific Quality Assurance: performed by the data author and documented via a publication service GUI (atarrabi)• Technical Quality Assurance: cross- and double checks of data and metadata integrity• DOI Publication: DataCite DOI metadata and DOI are separately send to the registration agency, a member of the DOI Foundation. Data and DOI remain unchanged and persistent.
• PCMDI / LLNL: data and security infrastructure (ESG)• BADC (British Atmospheric Data Centre): metadata infrastructure (METAFOR / CIM)• WDCC (World Data Center Climate) / DKRZ: quality control, data publication (DataCite DOI)
DNTDS
QCL2
DNTDS
QCL2
ESGGateways
QCRepository
CIMMD Repository
DNTDS
QCL2
PCMDI/BADC/WDCC
CIMQuestionnaire
QC L3 / STD-DOI
WDCC: DOI Publication Agency
MD onModel /
Simulation
MD onData
MD onQuality
Store, Analyse,and Plot Results
DOI DataLong-term
Archive
DOI Catalogue
IDF
MD on Data / Quality /Model / Simulation
Quality Results
TQA
Atarrabi
SQA ofData Author
CERA2MD
RegisterSTD-DOI /
URL of DOI Target Page
Harvestall MD
Harvest all MD oflong-term archive
DOI TargetPage
DOIAccess
Parts of Data /MD on Simulation
CMIP5 Organization & Infrastructure ComponentsFor distribution of data connected to the next IPCC report, the Earth System Grid Federation(ESGF) was founded. Its members have different responsibilities within the data infrastructure:
For CMIP5 ca. 3 PB of officially requested data are expected to be archived. About 1 PB of thatdata will likely be of especially high interest and will be replicated by the three ESGF partners.
Because of the high data volume the QC checks up to level 2 are performed distributed amongthe ESGF. The final QC level 3 checks for the DOI assignment are carried out by WDCC.Afterwards CMIP5 data is formally citable and remains persistent.
.org
The different QC levels are connected with different access rights for registered users:
Quality Control Workflow CMIP5
DOI Publication GUI atarrabi
Actors in DOI Publication ProcessWorkflow of Distributed QC in CMIP5
QC checks for data and metadataare performed, separately, for levels1 and 2. During the cross-checksof QC L3 checks their results arereviewed.
QC is accomplished on DRS Atomic Dataset level. The QC results are aggregated on DRS experiment level. In the gateways data discovery is supported downto the level of Ensemble versions (ESG dataset).
Granularity of Quality Control
• Restricted Access (QC Level 1 Data): After ESG publication the access is restricted under control of the modelling centre.• Scientific Access (QC Level 2): The scientific community is granted access of data of QC L2.• QC Level 3 / DOI: With the DOI assignment the data archive is opened for access by non-scientific users.
For high volume data such as climatedata quality assurance has to becarried out at the data storage centrebefore opening the repository for dataaccess. Data distributed in a Data Gridwith its decentralized data repositorieshave to be checked at different siteswith comparable QC procedures.
The cross- and double checks of theTechnical Quality Assurance make use of the QC result of the precedinglevels. Data as well as metadata isreviewed and data accessibility checked.
QC L3 / DOI Process in CMIP5
More Information: http://purl.org/org/cmip5/qc More Information: http://purl.org/org/cmip5/qc More Information: http://cera-www.dkrz.de/atarrabi
The final author approval step issupported by the GUI atarrabi. Authors check basic metadata and add information about their own quality assurance (Scientific QA).
A DOI is assigned and registered atthe International DOI Foundation (IDF: dx.doi.org) via the Registration Agency DataCite.
DOI Construction Rule for CMIP5:
doi: 10.1594/WDCC/CMIP5.<opaque bit>
Overall CMIP5 QC Workflow
Data NOTformally citableModelling groupcontrol access
manually.
Data NOTformally citable
Automatic accessgranted after filling in
ESGF registration page.
DOI Assigned:Data formally citable
Data can appear in IPCC-DDC------------------------------------------
Automatic accessgranted after filling in
ESGF registration page
Data Published > 10 PBData and Metadata QC L1
--------------------------------On globally distributed
data nodes
MetadataQC L2
passed?
To bereplicated
amongESGF?
DataQC L2
passed?
QC L3passed?
Discarddata
(Informal citation still requested where formal citation not available)
NO NO NO
YES YES YES
YES
Replicated:Copied to
PCMDI, BADC, WDCC& elsewhere
~ 1PB
Granularity of QC in CMIP5 context
<activity> <product> <institute> <model> <experiment>
<frequency>
<modeling realm>
<variable identifier>
<ensemble member>
<netcdf datasets>
CMIP5 Process DRS Name / Hierarchy Level
DOI / QC
Gateway Search
Data Access
<version>
Atomic Dataset
<MIP/CMOR2 table>
The current DOI publication procedure is comparable to the publication of grey literature inscientific print media. For the integration of a peer review process quality procedures acceptedand agreed on by the earth system modelling community are necessary.
The distributed quality control approach could be reduced in complexity by the integration ofthe QC Repository into the CIM Metadata Repository.
Distributed QC Approach
Distributed QC Procedure in CMIP5CMIP5 data is delivered to one of the three ESGFpartners, where it is ESG published and thus QC L1Data checked. Afterwards QC L2 Data consistencychecks are performed, before a data subset isreplicated among the ESGF. QC L2 results arestored in a central QC Repository.
During QC L3 / DOI checks the QC results areaccessed by the DOI Publication Agency WDCC.Other sources for cross- and double-checks arethe CIM Metadata Repository, the Thredds DataServer (TDS), and the metadata stored in the long-term archive at WDCC.
Thus, the effort of the QC L2 Data checks is sharedamong the ESGF. But the QC L3 / DOI checks areperformed at one site making use of the QC L2results stored in a central QC Repository.
Thus a QC procedure/tools have to bedeveloped, maintained and distributedcentrally and agreed upon within thescientific community.
Our distributed QC approach consistsof different software components:
___________________________________________________________________ESG: Earth System Grid, MD: Metadata, DN: Data Node, TDS: Thredds Data Server,TQA: Technical Quality Assurance, SQA: Scientific Quality Assurance.
Reg
istr
atio
nA
gen
cyS
cien
tist
Pu
bli
cati
on
Ag
ency
Permission:QC L2
DOI-Publication
TIME
Scientific Q.Assurance
Technical Q.Assurance
• QC Run Service: QC tool run and Repository ingestsof configuration and results
• QC Services for data analyses and exception statistics
• QC Plotting Service and plot ingest in Repository
• QC L2 assignment
QC checks:
QC Tool
QC Services
DOI Publication Agencywith long-term archive
Multiple Sites performinglocal QC checks
ProjectMetadata
RepositoryQC Tool
QC Services
CentralRepository
QC Tool
QC Services QC Services
DOI process
• Export QC results for DOI publication processDOI publication:
Organisation of CMIP5 Data Federation
CommunityTools
MetadataHandling
Semantic Data Search
& Discovery
Security
PublishingAPI
UserRegistration
Harvester
Data Products
MetricsAggregation
Database
BADC
DKRZWDCC
NASAJPL
NCAR
ANU
NERSC
ORNL
ESG FederatedGateways
UsersUsers
PCMDI Data Node
Deep Archive
THREDDSManager
Product
GridFTP
OPeNDAP
Browsing
Publisher
NodeDatabase
TDSCatalogs
Disk Cache
Secu
rity
Mid
dlew
are
Secu
rity
Mid
dlew
are
NASA/GISSData Node
NASA/GISSData Node
GFDL Data NodeGFDL
Data Node
INMData NodeINM
Data Node
……
IPSL Data NodeIPSL
Data Node
QC L1L2
QC L1L2
QC L1L2L3
LLNLPCMDI
QC L1L2