standardized metadata for human pathogen/vector genomic sequences

16
Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for Human Pathogen/Vector Genomic Sequences

Upload: gur

Post on 20-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Standardized Metadata for Human Pathogen/Vector Genomic Sequences. Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group. Genome Sequencing Centers for Infectious Disease (GSCID). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Richard H. Scheuermann, Ph.D.Director of InformaticsJ. Craig Venter Institute

On behalf of theGSC-BRC Metadata Working Group

Standardized Metadata for Human Pathogen/Vector Genomic Sequences

Page 2: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Genome Sequencing Centers for Infectious Disease (GSCID)

Bioinformatics Resource Centers (BRC)

www.viprbrc.org www.fludb.org

Page 3: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

High Throughput Sequencing

• Enabling technology– Epidemiology of outbreaks– Pathogen evolution– Host range restriction– Genetic determinants of virulence and pathogenicity

• Metadata requirements– Temporal-spatial information about isolates– Selective pressures– Host species of specimen source– Disease severity and clinical manifestations

Page 4: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Metadata Submission Spreadsheets

1 1 1 1

2

2 3

3

4

4 4

Page 5: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Complex Query Interface

Page 6: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Metadata Inconsistencies

• Each project was providing different types of metadata

• No consistent nomenclature being used• Impossible to perform reliable comparative

genomics analysis• Required extensive custom bioinformatics

system development

Page 7: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

GSC-BRC Metadata Standards Working Group

• NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs

• Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects

• Bottom up approach to capture data considered to be important by users

• Compatible with data standards and submission requirements

Page 8: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Metadata Standardization Process

• Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS)

• Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific

• For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc.

• Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI)

• Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples

• Draft data submission spreadsheets • Beta test version 1.0 standard with new GSCID white paper projects, collecting

feedback• Adopt version 1.1 metadata standard and data submission spreadsheets for all

GSCID white paper and BRC-associated projects

Page 9: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Core ProjectMetadata Field ID Metadata Field Descriptor OBO Foundry ID BioProject/BioSample MIxS

CP1 Project Title  http://purl.obolibrary.org/obo/OBI_0001622 Title project name

CP2 Project ID http://purl.obolibrary.org/obo/OBI_0001628    

CP3 Project Description http://purl.obolibrary.org/obo/OBI_0001615 Description  

CP4 Supporting Grants/Contract ID http://purl.obolibrary.org/obo/OBI_0001629 Grant Agency  

CP5 Publication Citation http://purl.obolibrary.org/obo/OBI_0001617 PubMed ID ref_biomaterial

CP6 Sample Provider Principal Investigator (PI) Name      

CP7 Sample Provider PI's Institution      

CP8 Sample Provider PI's email      

CP9 Sequencing Facility      

CP10 Sequencing Facility Contact Name      

CP11 Sequencing Facility Contact's Institution      

CP12 Sequencing Facility Contact's email      

CP13 Bioinformatics Resource Center http://purl.obolibrary.org/obo/OBI_0001626    

CP14 Bioinformatics Resource Center Contact Name      

CP15 Bioinformatics Resource Center Contact's Institution      

CP16 Bioinformatics Resource Center Contact's email      

CP17 Target Material   Material  

CP18 Project Method   Methodology  

CP19 Project Objectives   Objective  

CP20 Sample Scope   Sample Scope  

CP21 Target Capture   Capture  

Page 10: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Core SampleMetadata Field ID Metadata Field Descriptor OBO Foundry ID NCBI BioSample MIxS

CS1 Specimen Source ID http://purl.obolibrary.org/obo/OBI_0001141 host-subject-id host_subject_idCS2 Specimen Source Species http://purl.obolibrary.org/obo/OBI_0100026 specific_host host_taxidCS3 Species Source Common Name   host-common-name host_common_nameCS4 Specimen Source Gender http://purl.obolibrary.org/obo/PATO_0000047 host-sex sexCS5 Specimen Source Age - Value http://purl.obolibrary.org/obo/OBI_0001167 host-age ageCS6 Specimen Source Age - Unit http://purl.obolibrary.org/obo/UO_0000003 host-age  CS7 Specimen Source Health Status http://purl.obolibrary.org/obo/OGMS_0000022 host-health-state disease statusCS8 Specimen Collection Date http://purl.obolibrary.org/obo/OBI_0001619 collection_date collection dateCS9 Specimen Collection Location - Latitude http://purl.obolibrary.org/obo/OBI_0001620 lat_lon geographic location (lat and long)CS10 Specimen Collection Location - Longitude http://purl.obolibrary.org/obo/OBI_0001621 lat_lon geographic location (lat and long)CS11 Specimen Collection Location - Location http://purl.obolibrary.org/obo/GAZ_00000448 geo_loc_name CS12 Specimen Collection Location - Country http://purl.obolibrary.org/obo/OBI_0001627 geo_loc_name geographic location (country and/or sea)CS13 Specimen ID http://purl.obolibrary.org/obo/OBI_0001616 sample name  CS14 Specimen Type http://purl.obolibrary.org/obo/OBI_0001479 host-tissue-sampled body habitat, body site, body productCS15 Suspected Organism(s) in Specimen - Species http://purl.obolibrary.org/obo/OBI_0000925 organism

CS16 Suspected Organism(s) in Specimen - Subclass   strain subspecific genetic lineage

CS17 Human Pathogenicity of Suspected Organism(s) in Specimen http://purl.obolibrary.org/obo/OBI_0000925   phenotype

CS18 Environmental Material http://purl.obolibrary.org/obo/ENVO_00010483 isolation-source environment (material)CS19 Organism Detection Method http://purl.obolibrary.org/obo/OBI_0001624   sample collection device or methodCS20 Specimen Repository   culture-collection source material identifiersCS21 Specimen Repository Sample ID   culture-collection source material identifiersCS22 Sample ID - Sequencing Facility      CS23 Nucleic Acid Extraction Method http://purl.obolibrary.org/obo/OBI_0666667 samp_mat_process sample material processingCS24 Nucleic Acid Preparation Method   samp_mat_process sample material processingCS25 Sequencing Method http://purl.obolibrary.org/obo/OBI_0600047   sequencing methodCS26 Assembly Algorithm http://purl.obolibrary.org/obo/OBI_0001522   assemblyCS27 Depth of Coverage - Average http://purl.obolibrary.org/obo/OBI_0001618   finishing strategyCS28 Annotation Algorithm http://purl.obolibrary.org/obo/OBI_0001625   CS29 GenBank Record ID http://purl.obolibrary.org/obo/OBI_0001614    CS30 Comments http://purl.obolibrary.org/obo/IAO_0000300 host-description  CS31 Specimen Collector Name   collected-by  CS32 Specimen Collector's Institution      CS33 Specimen Collector's email      CS34 Sample Category   attribute_package  CS35 Host Disease   host-disease  

Page 11: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Metadata Processes

data transformations –image processing

assemblysequencing assay

specimen source – organism or environmental

specimencollector

input sample

reagents

technician

equipment

type ID

qualities

temporal-spatialregion

data transformations –variant detection

serotype marker detect.gene detection

primarydata

sequencedata

genotype/serotype/gene data

specimen

microorganism

enrichedNA sample

microorganismgenomic NA

specimen isolationprocess

isolationprotocol

sample processing

data archivingprocess

sequencedata record

has_input

has_output

has_output

has_specification has_part has_part

is_about

has_input

has_output

has_input

has_input

has_input

has_output

has_output

has_output

is_about

GenBankID

denotes

located_in

denotes

has_input

has_quality

instance_of

temporal-spatialregion

located_in

Specimen Isolation Material Processing

Data ProcessingSequencing Assay

Investigation

temporal-spatialregion

located_in

temporal-spatialregion

located_in

temporal-spatialregion

located_in

temporal-spatialregion

located_in

quality assessmentassay

Host Characterization

has_input

has_output

Page 12: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

organism

environmentalmaterial

equipment

person

specimensource role

specimencapture role

specimencollector role

temporal-spatialregion

spatialregion

temporalinterval

GPSlocation

date/time

specimen Xspecimen isolationprocedure X

isolationprotocol

has_input

has_output

plays

plays

has_specification

has_partdenotes

located_in

name

denotes

spatialregion

geographiclocation

denoteslocated_in

affiliation

has_affiliation

ID

denotes

specimen typeinsta

nce_of

specimen isolationprocedure type

instance_of

Specimen Isolation

plays

has_input

organism parthypothesis

is_about

IRB/IACUCapproval

has_authorization

environment

has_quality

organismpathogenicdisposition

has part

has disp

osition

ID

denotes

CS1

gender age health status

has quality

CS4 CS5/6 CS7

CS2/3

CS8

CS9/10

CS11/12

CS13

CS14

CS18

CS15/16

Page 13: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Core Project Semantics

Page 14: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Outcome of Metadata Standards WG

• Consistent metadata captured across GSCID• Bottom up approach focuses standard on important

features• Support more standardized BRC interface

development• Harmonization with related stakeholders – Genome

Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample

• Represented in the context of an extensible semantic framework

Page 15: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

• Identified gaps in data field list (e.g. temporal components)• Includes logical structure for other, project-specific, data

fields - extensible• Identified gaps in ontology data standards (use case-driven

standard development)• Identified commonalities in data structures (reusable)• Support for semantic queries and inferential analysis in

future• Ontology-based framework is extensible

– Sequencing => “omics”

Utility of semantic representation

Page 16: Standardized Metadata for Human Pathogen/Vector Genomic  Sequences

Acknowledgements

Bruce Birren2,b, Lauren Brinkac1,a, Vincent Bruno3,c, Elizabeth Caler1,a, Ishwar Chandramouliswaran1,a, Sinéad Chapman2,b, Frank Collins8,h, Christina Cuomo2,b, Joana Carneiro Da Silva3,c, Valentina Di Francesco4, Vivien Dugan1,a, Scott Emrich8,h, Mark Eppinger3,c, Michael Feldgarden2,b, Claire Fraser3,c, W. Florian Fricke3,c, Maria Giovanni4, Gloria Giraldo-Calderon8,h, Omar S. Harb5,g, Matt Henn2,b, Erin Hine3,c, Julie Dunning Hotopp3,c, Jessica C. Kissinger6,g, Eun Mi Lee4, Punam Mathur4, Garry Myers3,c, Emmanuel Mongodin3,c, Cheryl Murphy2,b, Dan Neafsey2,b, Karen Nelson1,a, Ruchi Newman2,b, William Nierman1,a, Brett E. Pickett1,d,e, Julia Puzak4, David Rasko3,c, David S. Roos5,g, Lisa Sadzewica3,c, Richard H. Scheuermann1,d,e, Lynn M. Schriml3,c, Bruno Sobral7,f, Tim Stockwell1,a, Chris Stoeckert5,g, Dan Sullivan7,f, Luke Tallon3,c, Herve Tettelin3,c, Doyle V. Ward2,b, David Wentworth1,a, Owen White3,c, Rebecca Will7,f, Jennifer Wortman2,b, Alison Yao4, Jie Zheng5,g

 1J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2Broad Institute, Cambridge, MA, 3Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4National Institute of Allergy and Infectious Diseases, Rockville, MD, 5University of Pennsylvania, Philadelphia, PA, 6University of Georgia, Athens, GA, 7Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8University of Notre Dame, South Bend, IN, aJ. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, bBroad Institute Genome Sequencing Center for Infectious Diseases, cInstitute for Genome Sciences Genome Sequencing Center for Infectious Diseases, dInfluenza Research Database Bioinformatics Resource Center, eVirus Pathogen Resource Bioinformatics Resource Center, fPATRIC Bioinformatics Resource Center, gEuPathDB Bioinformatics Resource Center, hVectorBase Bioinformatics Resource Center

Tanya Barrett – NCBIPelin Yilmaz – Genome Standards Consortium

N01AI2008038 /N01AI40041