the nhgri gwas catalog, a curated resource of snp-trait ... · ‘inflammatory marker...

6
The NHGRI GWAS Catalog, a curated resource of SNP-trait associations Danielle Welter 1 , Jacqueline MacArthur 1 , Joannella Morales 1 , Tony Burdett 1 , Peggy Hall 2 , Heather Junkins 2 , Alan Klemm 3 , Paul Flicek 1 , Teri Manolio 2 , Lucia Hindorff 2, * and Helen Parkinson 1, * 1 European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK, 2 Division of Genomic Medicine, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA and 3 Division of Policy, Communication and Education, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA Received August 15, 2013; Revised November 6, 2013; Accepted November 7, 2013 ABSTRACT The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single- nucleotide polymorphisms (SNPs) and all SNP-trait associations with P <1 Â 10 5 . The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs’ chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure. INTRODUCTION Genome-wide association studies (GWAS) assay at minimum hundreds of thousands of single-nucleotide polymorphisms (SNPs) to identify associations with complex clinical conditions and phenotypic traits. Unlike single gene disorders, such as Huntington’s disease, complex diseases are usually the result of a combination of genetic and environmental factors, each of which in- creases susceptibility to the condition. In the absence of a single causative variant, GWAS aim to identify SNPs in linkage disequilibrium with a gene or other regulatory element that might contain a causal variant contributing to a condition’s aetiology. GWAS are non-candidate gene- driven and use a whole-genome approach in large studies of up to hundreds of thousands of individuals investi- gating traits such as disease, response to drug and anthro- pometry (1). The importance of GWAS in advancing scientific understanding of disease mechanisms and providing starting points for the development of medical treatments is well established. One of the first GWAS identified a polymorphism in complement factor H, part of comple- ment-mediated inflammation, in a study on age-related macular degeneration. This association between the de- generative disease and inflammation pathways has since been verified in other studies and is now showing promising results as a therapeutic target (2). Recent dis- covery of shared loci in diseases previously thought not to have any common aetiology include gene CDKN2A/B in type II diabetes mellitus (3) and myocardial infarction (4) and CDKAL1 in Crohn’s disease (5) and type II diabetes mellitus (6). GWAS data are typically reported in peer-reviewed publications as single studies or meta-analyses, and data deposited in resources such as the European Genome- phenome Archive (http://www.ebi.ac.uk/ega), or dbGaP (7), where users must apply for access to individual-level genotype and phenotype data and comply with a data access agreement. The National Human Genome Research Institute (NHGRI) Catalog of Published GWAS Catalog provides a publicly available manually curated collection of all published GWAS conforming to eligibility criteria including number of SNPs assayed and non-targeted study design. The eligibility criteria are public and available at http://www.genome.gov/ 27529028 (8). This article introduces a number of recent *To whom correspondence should be addressed. Tel: +44 1223 494 672; Email: [email protected] Correspondence may be also addressed to Lucia Hindorff. Tel: +1 301 496 7531; Email: hindorffl@mail.nih.gov Published online 6 December 2013 Nucleic Acids Research, 2014, Vol. 42, Database issue D1001–D1006 doi:10.1093/nar/gkt1229 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Downloaded from https://academic.oup.com/nar/article-abstract/42/D1/D1001/1062755 by University of Connecticut user on 29 August 2019

Upload: others

Post on 01-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The NHGRI GWAS Catalog, a curated resource of SNP-trait ... · ‘inflammatory marker measurement’ allows queries for combinations of all inflammatory diseases and all inflammatory

The NHGRI GWAS Catalog a curated resourceof SNP-trait associationsDanielle Welter1 Jacqueline MacArthur1 Joannella Morales1 Tony Burdett1

Peggy Hall2 Heather Junkins2 Alan Klemm3 Paul Flicek1 Teri Manolio2

Lucia Hindorff2 and Helen Parkinson1

1European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome TrustGenome Campus Hinxton Cambridge CB10 1SD UK 2Division of Genomic Medicine National HumanGenome Research Institute National Institutes of Health Bethesda MD 20892 USA and 3Division of PolicyCommunication and Education National Human Genome Research Institute National Institutes of HealthBethesda MD 20892 USA

Received August 15 2013 Revised November 6 2013 Accepted November 7 2013

ABSTRACT

The National Human Genome Research Institute(NHGRI) Catalog of Published Genome-WideAssociation Studies (GWAS) Catalog provides apublicly available manually curated collection ofpublished GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-traitassociations with P lt1 105 The Catalog includes1751 curated publications of 11 912 SNPs Inaddition to the SNP-trait association data theCatalog also publishes a quarterly diagram of allSNP-trait associations mapped to the SNPsrsquochromosomal locations The Catalog can beaccessed via a tabular web interface via adynamic visualization on the human karyotype asa downloadable tab-delimited file and as an OWLknowledge base This article presents a number ofrecent improvements to the Catalog including novelways for users to interact with the Catalog andchanges to the curation infrastructure

INTRODUCTION

Genome-wide association studies (GWAS) assay atminimum hundreds of thousands of single-nucleotidepolymorphisms (SNPs) to identify associations withcomplex clinical conditions and phenotypic traits Unlikesingle gene disorders such as Huntingtonrsquos diseasecomplex diseases are usually the result of a combinationof genetic and environmental factors each of which in-creases susceptibility to the condition In the absence ofa single causative variant GWAS aim to identify SNPs inlinkage disequilibrium with a gene or other regulatory

element that might contain a causal variant contributingto a conditionrsquos aetiology GWAS are non-candidate gene-driven and use a whole-genome approach in large studiesof up to hundreds of thousands of individuals investi-gating traits such as disease response to drug and anthro-pometry (1)The importance of GWAS in advancing scientific

understanding of disease mechanisms and providingstarting points for the development of medical treatmentsis well established One of the first GWAS identified apolymorphism in complement factor H part of comple-ment-mediated inflammation in a study on age-relatedmacular degeneration This association between the de-generative disease and inflammation pathways has sincebeen verified in other studies and is now showingpromising results as a therapeutic target (2) Recent dis-covery of shared loci in diseases previously thought not tohave any common aetiology include gene CDKN2AB intype II diabetes mellitus (3) and myocardial infarction (4)and CDKAL1 in Crohnrsquos disease (5) and type II diabetesmellitus (6)GWAS data are typically reported in peer-reviewed

publications as single studies or meta-analyses and datadeposited in resources such as the European Genome-phenome Archive (httpwwwebiacukega) or dbGaP(7) where users must apply for access to individual-levelgenotype and phenotype data and comply with a dataaccess agreement The National Human GenomeResearch Institute (NHGRI) Catalog of PublishedGWAS Catalog provides a publicly available manuallycurated collection of all published GWAS conforming toeligibility criteria including number of SNPs assayed andnon-targeted study design The eligibility criteria arepublic and available at httpwwwgenomegov27529028 (8) This article introduces a number of recent

To whom correspondence should be addressed Tel +44 1223 494 672 Email parkinsonebiacukCorrespondence may be also addressed to Lucia Hindorff Tel +1 301 496 7531 Email hindorfflmailnihgov

Published online 6 December 2013 Nucleic Acids Research 2014 Vol 42 Database issue D1001ndashD1006doi101093nargkt1229

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc30) which permits non-commercial re-use distribution and reproduction in any medium provided the original work is properly cited For commercialre-use please contact journalspermissionsoupcom

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

improvements including the automation of SNP extrac-tion from papers increasing the number of SNPsincluded per paper review and re-annotation of traitdesignations using a dedicated GWAS ontology deliveryof a new automated GWAS visualization delivery ofautomated PubMed searching to detect eligible papersand implementation of a curation tracking systemTogether these features support the curatorial staff in de-livering more data for the Catalog improve integrationwith other resources and provide improved visualizationand queries for users

CATALOG DATA

The GWAS Catalog data are extracted from the literatureExtracted information includes publication informationstudy cohort information such as cohort size country ofrecruitment and subject ethnicity and SNP-disease associ-ation information including SNP identifier (ie ReferenceSNP cluster ID (RSID)) P-value gene and risk alleleEach study is also assigned a trait that best representsthe phenotype under investigation When multiple traitsare analysed in the same study either multiple entries arecreated or individual SNPs are annotated with theirspecific traits Traits are used both to query and visualizethe data in the Catalogrsquos web form and diagram-basedquery interfacesData extraction and curation for the GWAS Catalog is

an expert activity each step is performed by scientistssupported by a web-based tracking and data entrysystem which allows multiple curators to searchannotate verify and publish the Catalog data Papersthat qualify for inclusion in the Catalog are identifiedthrough weekly PubMed searches and then undergo twolevels of curation A detailed overview of the curationprocess can be found on the GWAS diagram website(wwwebiacukfgptgwascurationtab)Extraction and verification of data from the literature is

time-consuming GWAS include initial and replicationdata sets and these are often not clearly separated inpapers Accurate description of the phenotype is alsocomplex as disease markers may be measured andimplicit inferences are made by the authors from theseOne of the most challenging areas of curation is the con-sistent extraction of ethnicity data Many GWAS papersdo not provide detailed descriptions of the ethnic back-ground of their study cohorts and even today there is nostandardized way of reporting ethnicity In some casesinferences about the ethnicity of study subjects can bemade based on the country of origin or recruitment usingcensus information provided by the CIA world fact book(9) For example subjects recruited in East Asian countriescan usually be assumed to be of East Asian ancestry orsubjects recruited in Scandinavian countries of Europeanancestry However such inferences cannot necessarily bemade in more ethnically diverse countries such as the USAor the UK which is where a disproportionately largenumber of GWAS are carried outAs the number of eligible published GWAS continues to

rise (Figure 1) it proved essential to automate the

curation process Automation was achieved in 2013 bydeploying an infrastructure to perform automaticsearches of PubMed tracking of papers through thecuratorial pipeline and for eligible papers automaticextraction of citation information and batch extractionof SNP RSIDs traits and P-value information frompapers Ease of extraction of large data sets wasimproved by the development of a spreadsheet-basedbatch SNP extractor to process all SNPs for one papersimultaneously Historically it was necessary for practicalreasons to limit the number of extracted SNPs to 50 perpaper The SNP extractor allows curators to processall eligible SNP associations typically those with aP lt1 105 in a paper and has increased the numberof SNPs curated and improved processing time forcurators as studies typically report larger numbers ofSNPs and traits over time Papers that were previouslysubject to the 50-SNP limit have been recently re-curated to extract the additional SNPs in a smallnumber of papers gt1000 significant SNPs were identifiedthus greatly enriching the Catalogrsquos content

A recent Catalog improvement has been to structure thetrait annotations using an ontology and to use ontologyand semantic web technologies to automate the process ofproviding the karyotype visualization This involveddeveloping an ontology to describe the traits a knowledgebase to contain the data verifying all previous trait anno-tations and mapping ontology terms to them beforedelivering a new query interface for visualizing andquerying the data

ONTOLOGY

The utility of ontologies for providing formal datadescription frameworks in the biological domain is wellestablished (10) This applies particularly to ontologiesdeveloped in an expressive knowledge representationlanguage like OWL (11) whose rich semantic capabilitiesallow a much more natural modelling of complex conceptsthrough axiomatization This in turn allows for muchricher querying than simple string searching Mapping toontologies also greatly facilitates data integration acrossheterogeneous data sources such as data extracted fromthe scientific literature Until recently the traits in theGWAS Catalog were available as an unstructured flatlist allowing only querying through direct stringmatching and therefore limiting the Catalogrsquos potentialboth as a stand-alone resource and as an integral part ofa wider network of genomic resources Producing consist-ent high-quality phenotype mappings represents an essen-tial yet challenging task There is a high level of variationbetween studies in what constitutes a phenotype and howa phenotype is described For example a specific pheno-type or disease may be studied explicitly or a phenotypeconsidered to be a marker for disease eg measurementsof blood glucose concentration and body weight as riskfactors for type II diabetes To allow consistent groupingsof studies and comparisons of results among studies it isessential that all mappings are consistent but withoutmaking potentially incorrect assumptions For example

D1002 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

a study exploring the effect of SNP level variation onblood glucose levels may use this as a risk indicator fortype II diabetes but unless this is explicitly stated amapping to type II diabetes would be incorrect

We evaluated several domain ontologies for coverage ofthe GWAS data including Medical Subject Headings(MeSH) the Human Phenotype Ontology (12) theDisease Ontology (13) and the Experimental FactorOntology (EFO) (14) No single ontology addresses allthe representational needs of the GWAS CatalogHowever EFO has branches that can be extended toinclude new terms and allows modelling of tests diseasesanatomy and anthropometry imports terms from availabledomain-specific ontologies and uses axiomatization tomaintain its structure Many MeSH terms were approxi-mately mapped and did not provide the precisionrequired by the curators For example the annotationlsquoC-reactive protein levelsrsquo in the Catalog was mapped toMeSH lsquoD002097 C-reactive proteinrsquo But the implicit infor-mation in the Catalog annotation is that the protein levelsare an indication of inflammatory response Whenmodelled in EFO a new term EFO lsquoEFO_0004458C-reactive protein measurementrsquo with a parent classlsquoinflammatory marker measurementrsquo allows queries forcombinations of all inflammatory diseases and allinflammatory markers and provides a more detailedquery result for the user

EFO is an application ontology that uses multiple refer-ence ontologies such as chemical entities of biologicalinterest (15) to produce a single ontology which can beviewed under one hierarchy and can be applied to visualizea data set Current GWAS traits range from simple diseasedescriptions such as lsquobreast cancerrsquo to measurements suchas lsquowaist-hip ratiorsquo or lsquoliver enzyme levelsrsquo to compound

traits like lsquobody mass in chronic obstructive pulmonarydiseasersquo which are hard to represent in an ontology andtrait definitions are often context-dependent EFO not onlycontains disease categories but also phenotypic descrip-tions compound treatments and so forth Traits notpresent in EFO were added either by importing existingterms from reference ontologies including the GeneOntology (GO) (16) and the Human Phenotype Ontology(12) or creating new concepts locally or by requestingthese from external ontologies A new TermGenietemplate was provided by the GO editors to provide therequired lsquoresponse to drugrsquo terms for the Catalog Theprocess of building the ontology improved both the con-sistency of term assignment for existing studies which werereviewed and the query capability for the CatalogThe availability of high-quality mappings between

GWAS traits and ontology terms facilitates the integrationof the GWAS Catalog with other resources The Catalog isused by Ensembl (17) through direct import of the tab-delimited file of all the Catalog content that can be down-loaded from the Catalog website Ensembl also is in theprocess of mapping phenotypes to EFO so adding theexisting GWAS mappings to Ensembl would immediatelyadd an entirely new layer of connectivity between conceptsthat are annotated with the same EFO terms This is par-ticularly useful at the cross-species mapping level wheremapping genetic and variation data from different speciessuch as mouse and human to common phenotypes providesa new way of identifying related genesA schema ontology was designed to model the key

concepts represented in the Catalog such as trait SNPstudy gene and chromosome and relationships that linkthese concepts eg lsquoSNP part of genersquo or lsquostudy describestrait associationrsquo This schema ontology forms the basis of

Figure 1 Studies traits and SNP-trait associations from 2005ndash2013 reveal the growth in eligible studies

Nucleic Acids Research 2014 Vol 42 Database issue D1003

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

an OWL knowledge base containing all the data in theGWAS Catalog in a format that can be processed by anontology reasoner and queried By representing the GWASCatalog as an OWL knowledge base and reasoning overthe asserted axioms it is possible to make inferences aboutthe SNP-trait associations that are not possible in the rela-tional database version of the Catalog This enables moreexpressive queries at different levels of granularity and theability to detect errors and inconsistencies in the dataWhere previously a comparison between gastric and oe-sophageal cancers would have required either all terms tobe enumerated and searched separately or a high-levelsearch for all cancer-related traits possible queries of theknowledge base range from lsquoFind all SNPs that areassociated with gastric cancerrsquo to lsquoFind all SNPs that areassociated with cancers located in the upper digestive tractrsquoto lsquoFind all SNPs that are associated with cancerrsquoSynonyms are both imported from reference ontologiesand added locally to support spelling variations andplurals The knowledge base currently contains the datathat are publicly available in the Catalog To effectivelyrepresent the ethnicity information extracted by curatorsan ethnicity ontology is currently under developmentFurthermore by extending the schema ontology toinclude additional concepts it would be relatively easy toinclude extra information in a knowledge base such aslinkage disequilibrium information or population-specificinformation in the future Equally other ontologies couldbe imported without much difficulty to allow visualizationof the data using alternative terminologies

DATA ACCESS

GWAS Catalog data can be accessed in three ways (i) Viathe web interface hosted at the NHGRI which providesaccess to data via a search form including traits and studypublication information as well as a tab-delimited file isavailable for download (ii) Via a new dynamic queryinterface of traits visualized on the human karyotypelinked to literature SNPs in Ensembl and ontologyterms in the Experimental Factor Ontology was imple-mented (described later) (iii) GWAS Catalog data areavailable through commonly used data portals such asEnsembl (17) UCSC Genome Browser (18) andPheGenI (19) and many other community toolsThe GWAS diagram shown in Figure 2 is a visualiza-

tion of all SNP-trait associations in the GWAS Catalogwith the P lt 5 108 mapped onto the human karyo-type Each dot represents an association between a SNP(or chromosomal region if multiple SNPs are in proxim-ity) and a disease or trait It has been produced quarterlyby the GWAS Catalog team since 2006 Until mid-2012this was done manually with the help of a medical illus-trator and required a considerable number of personhours each month The result was a static imagedistributed in PDF and PowerPoint formats Each traitassociation on the diagram was represented by a circleof a different colour While this generated an iconicdiagram for the Catalog the last published version ofthe hand-drawn diagram contained in excess of 200

colours To maintain visual consistency in the redevelopeddiagram the new diagram retains the fundamental prin-ciples of the manually drawn version by reusing the samechromosomal ideograms and mapping SNP-trait associ-ations to the level of cytogenetic bands The crucialchange from the old diagram is the design of the newcolour scheme based on 17 colours representing largerempirically determined and ontology-based traitcategories The high-level categories shown in thediagram were determined by analysis of the query logsto capture the most popular search terms and assignmentof colours by number of traits across the ontology hier-archy to ensure that colour distribution made the diagramcomprehensible Figure 2 shows that users can visualize allSNPs by a high-level term such as lsquometabolic diseasersquo andlinks are available to Ensembl the Catalog interface andPubMed Data can also be visualized by typing a querysuch as lsquoC-reactive proteinrsquo and using the NCBOautocomplete widget (20) An animation of SNPs andtraits per Catalog release and downloadable images forcommon queries are provided for user download asthese are often used in slide presentations It is possibleto generate a customized diagram pertaining to a categoryof interest eg cancer and identify any zones of particu-larly high density for that trait The nature of thecategories gives an interesting insight into which GWAStraits produce an especially high number of strongSNP-trait associations eg the category lsquoliver enzymemeasurementrsquo has a similar number of traits on thediagram as the much broader category lsquocardiovasculardiseasersquo The visualization and ontology in combinationallows users to access rich genomic information pertainingto a trait by a simple query for trait name eg Alzheimerrsquosdisease using the hierarchy eg all nervous systemdisease and the diagram links to Ensembl for RSIDsallowing access to genomic context and population levelinformation

CONCLUSION

We present the GWAS Catalog supporting infrastructureand new query curatorial and visualization toolsThe mapping of complex phenotypic traits to theExperimental Factor Ontology facilitates cross-studycomparisons within the Catalog as well as the integrationof Catalog data with other resources It also supports thegeneration of a semantic web technology-driven inter-active visualization of the GWAS Catalog data andoffers community access A number of improvements todata holdings extraction processes and data access areplanned based on recruitment of community feedback

A user survey (performed at ASHG in 2012 and byemailing owners of data sets included in the Catalog)provided useful community feedback and allowed priori-tization of future developments including the provisionand visualization of trait information at the SNP levelinclusion of queryable ethnicity information and enhance-ments to the usability of the web interfaces such as de-ployment of the ontology in the form-based web portalTo ensure that the infrastructure and curation processes

D1004 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

scale in the future we are investigating inclusion of textmining to support curation for example to more easilyidentify ethnicity and population information and use of atriple stores for data storage to improve performance asdata volumes grow

A major challenge for the Catalog is the increasing com-plexity of studies eg traits are more often compoundgene by environment studies are available and in thepast year several gene by gene interaction studies wereincluded in the Catalog Therefore the Catalog

Figure 2 The interactive GWAS diagram is a visualization of all SNP-trait associations with P lt5 108 mapped to the SNPrsquos cytogenetic bandVisualizations of SNPs with metabolic or immune system disease are highlighted Hovering over a SNP displays the trait name clicking on it returnsthe individual SNP-trait associations including P-value and publication as well as links to the relevant entry in the GWAS Catalog Europe PMCand Ensembl

Nucleic Acids Research 2014 Vol 42 Database issue D1005

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

infrastructure must evolve to support these cases improvedata extraction times and address changing technologiesfor GWA studies including the use of next-generationsequencing technology A recent webinar organized bythe Catalog explored these issues with a diverse range ofresearchers and provides background information oncommunity use of and future directions for the Catalog(httptinyurlcomnbxxm2e)Future technical developments include further enhance-

ments to ethnicity extraction including the development ofan ontology to model ethnicity improvements to thediagram to allow richer querying and filtering eg by dateranges publication and a combination of these Thediagram will also support better linking to externalresources (Ensembl dbSNP and PubMed) and moreprecise modelling of trait associations to distinguishbetween the case where a single SNP is associated withmultiple traits (myocardial infarction and high bloodpressure) andthecasewhereaSNPisshowntobeassociatedwith a trait in the context of a condition (eg myocardialinfarction in a cohort of patients with high bloodpressure) We plan to publish widgets that allow views ofthe GWAS diagram to be embedded in other resourcesFinally we would also like to improve the GWAS Catalogsearch interface to support ontology-driven querying andbetter integrate the visualization with the web form-basedinterface by developing a new user interface

ACKNOWLEDGEMENTS

The authors thank Francis Rowland (EMBL-EBI) foradvice on interface development and usability MaryTodd Bergman and Spencer Phillips (EMBL-EBIOutreach) for assistance with figures Fiona Cunningham(EMBL-EBI) and many users of the Catalog for theirfeedback They also thank their colleagues at the NCBIand Ensembl for distribution of data via their resourcesand feedback They thank Darryl Leja for advice on devel-opment of the karyotype visualization and Mike Feolo andMing Xu (NCBI) for informatics support

FUNDING

The National Institutes of Health [3U41-HG006104 toDW and JM-A] and EMBL core funds supporting(to PF HP and TB) EMBL Core Funds and by theHealth Thematic Area of the Cooperation Programme ofthe European Commission within the VII FrameworkProgramme for Research and Technological Developmentsupported (to JM) Grant number [200754] Gen2Phenconducted this work as employees of the NHGRI NIH(by HAJ PH KAH TAM and LAH) Fundingfor open access charge NIH

Conflict of interest statement None declared

REFERENCES

1 ManolioTA (2009) Cohort studies and the genetics of complexdisease Nat Genet 41 5ndash6

2 TroutbeckR Al-QureshiS and GuymerRH (2012) Therapeutictargeting of the complement system in age-related maculardegeneration a review Clin Exp Ophthalmol 40 18ndash26

3 VoightB ScottL SteinthorsdottirV MorrisA DinaCWelchR ZegginiE HuthC AulchenkoY ThorleifssonGet al (2010) Twelve type 2 diabetes susceptibility loci identifiedthrough large-scale association analysis Nat Genet 42 579ndash589

4 SchunkertH KonigIR KathiresanS ReillyMPAssimesTL HolmH PreussM StewartAFR BarbalicMGiegerC et al (2011) Large-scale association analysis identifies13 new susceptibility loci for coronary artery disease CircCardiovasc Genet 43 333ndash338

5 FrankeA McGovernDPB BarrettJC WangK Radford-SmithGL AhmadT LeesCW BalschunT LeeJRobertsR et al (2010) Genome-wide meta-analysis increases to71 the number of confirmed Crohnrsquos disease susceptibility lociNat Genet 42 1118ndash1125

6 SaxenaR VoightBF LyssenkoV BurttNP de BakkerPIWChenH RoixJJ KathiresanS HirschhornJN DalyMJ et al(2007) Genome-wide association analysis identifies loci for type 2diabetes and triglyceride levels Science 316 1331ndash1336

7 MailmanMD FeoloM JinY KimuraM TrykaKBagoutdinovR HaoL KiangA PaschallJ PhanL et al(2007) The NCBI dbGaP database of genotypes and phenotypesNat Genet 39 1181ndash1186

8 HindorffLA SethupathyP JunkinsHA RamosEMMehtaJP CollinsFS and ManolioTA (2009) Potentialetiologic and functional implications of genome-wide associationloci for human diseases and traits Proc Natl Acad Sci USA106 9362ndash9367

9 The World Factbook 2013-14 (2013) Washington DC CentralIntelligence Agency

10 SmithB AshburnerM RosseC BardJ BugW CeustersWGoldbergLJ EilbeckK IrelandA MungallCJ et al (2007)The OBO Foundry coordinated evolution of ontologies tosupport biomedical data integration Nat Biotechnol 251251ndash1255

11 SmithMK WeltyC and McGuinnessDL (2009) Web ontologylanguage In W3C Recommendation

12 RobinsonPN and MundlosS (2010) The human phenotypeontology Clin Genet 77 525ndash534

13 SchrimlLM ArzeC NadendlaS ChangY-WWMazaitisM FelixV FengG and KibbeWA (2012) DiseaseOntology a backbone for disease semantic integration NucleicAcids Res 40 D940ndashD946

14 MaloneJ RaynerTF Zheng BradleyX and ParkinsonH(2008) Developing an application focused experimental factorontology embracing the OBO Community In Proceedingsof the Eleventh Annual Bioontologies Meeting Toronto Canada

15 De MatosP DekkerA EnnisM HastingsJ HaugKTurnerS and SteinbeckC (2010) ChEBI a chemistry ontologyand database J Cheminform 2 P6

16 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification ofbiology The Gene Ontology Consortium Nat Genet 2525ndash29

17 FlicekP AmodeMR BarrellD BealK BrentSCarvalho-SilvaD ClaphamP CoatesG FairleySFitzgeraldS et al (2012) Ensembl 2012 Nucleic Acids Res 40D84ndashD90

18 KarolchikD HinrichsAS and KentWJ (2012) The UCSCGenome Browser Current Protocols in Bioinformatics 40141ndash1433

19 RamosEM HoffmanD JunkinsHA MaglottD PhanLSherryST FeoloM and HindorffLA (2013) Phenotype-Genotype Integrator (PheGenI) synthesizing genome-wideassociation study (GWAS) data with existing genomic resourcesEur J Hum Genet EJHG 22 144ndash147

20 WhetzelPL NoyNF ShahNH AlexanderPR NyulasCTudoracheT and MusenMA (2011) BioPortal enhancedfunctionality via new Web services from the National Center forBiomedical Ontology to access and use ontologies in softwareapplications Nucleic Acids Res 39 W541ndashW545

D1006 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

Page 2: The NHGRI GWAS Catalog, a curated resource of SNP-trait ... · ‘inflammatory marker measurement’ allows queries for combinations of all inflammatory diseases and all inflammatory

improvements including the automation of SNP extrac-tion from papers increasing the number of SNPsincluded per paper review and re-annotation of traitdesignations using a dedicated GWAS ontology deliveryof a new automated GWAS visualization delivery ofautomated PubMed searching to detect eligible papersand implementation of a curation tracking systemTogether these features support the curatorial staff in de-livering more data for the Catalog improve integrationwith other resources and provide improved visualizationand queries for users

CATALOG DATA

The GWAS Catalog data are extracted from the literatureExtracted information includes publication informationstudy cohort information such as cohort size country ofrecruitment and subject ethnicity and SNP-disease associ-ation information including SNP identifier (ie ReferenceSNP cluster ID (RSID)) P-value gene and risk alleleEach study is also assigned a trait that best representsthe phenotype under investigation When multiple traitsare analysed in the same study either multiple entries arecreated or individual SNPs are annotated with theirspecific traits Traits are used both to query and visualizethe data in the Catalogrsquos web form and diagram-basedquery interfacesData extraction and curation for the GWAS Catalog is

an expert activity each step is performed by scientistssupported by a web-based tracking and data entrysystem which allows multiple curators to searchannotate verify and publish the Catalog data Papersthat qualify for inclusion in the Catalog are identifiedthrough weekly PubMed searches and then undergo twolevels of curation A detailed overview of the curationprocess can be found on the GWAS diagram website(wwwebiacukfgptgwascurationtab)Extraction and verification of data from the literature is

time-consuming GWAS include initial and replicationdata sets and these are often not clearly separated inpapers Accurate description of the phenotype is alsocomplex as disease markers may be measured andimplicit inferences are made by the authors from theseOne of the most challenging areas of curation is the con-sistent extraction of ethnicity data Many GWAS papersdo not provide detailed descriptions of the ethnic back-ground of their study cohorts and even today there is nostandardized way of reporting ethnicity In some casesinferences about the ethnicity of study subjects can bemade based on the country of origin or recruitment usingcensus information provided by the CIA world fact book(9) For example subjects recruited in East Asian countriescan usually be assumed to be of East Asian ancestry orsubjects recruited in Scandinavian countries of Europeanancestry However such inferences cannot necessarily bemade in more ethnically diverse countries such as the USAor the UK which is where a disproportionately largenumber of GWAS are carried outAs the number of eligible published GWAS continues to

rise (Figure 1) it proved essential to automate the

curation process Automation was achieved in 2013 bydeploying an infrastructure to perform automaticsearches of PubMed tracking of papers through thecuratorial pipeline and for eligible papers automaticextraction of citation information and batch extractionof SNP RSIDs traits and P-value information frompapers Ease of extraction of large data sets wasimproved by the development of a spreadsheet-basedbatch SNP extractor to process all SNPs for one papersimultaneously Historically it was necessary for practicalreasons to limit the number of extracted SNPs to 50 perpaper The SNP extractor allows curators to processall eligible SNP associations typically those with aP lt1 105 in a paper and has increased the numberof SNPs curated and improved processing time forcurators as studies typically report larger numbers ofSNPs and traits over time Papers that were previouslysubject to the 50-SNP limit have been recently re-curated to extract the additional SNPs in a smallnumber of papers gt1000 significant SNPs were identifiedthus greatly enriching the Catalogrsquos content

A recent Catalog improvement has been to structure thetrait annotations using an ontology and to use ontologyand semantic web technologies to automate the process ofproviding the karyotype visualization This involveddeveloping an ontology to describe the traits a knowledgebase to contain the data verifying all previous trait anno-tations and mapping ontology terms to them beforedelivering a new query interface for visualizing andquerying the data

ONTOLOGY

The utility of ontologies for providing formal datadescription frameworks in the biological domain is wellestablished (10) This applies particularly to ontologiesdeveloped in an expressive knowledge representationlanguage like OWL (11) whose rich semantic capabilitiesallow a much more natural modelling of complex conceptsthrough axiomatization This in turn allows for muchricher querying than simple string searching Mapping toontologies also greatly facilitates data integration acrossheterogeneous data sources such as data extracted fromthe scientific literature Until recently the traits in theGWAS Catalog were available as an unstructured flatlist allowing only querying through direct stringmatching and therefore limiting the Catalogrsquos potentialboth as a stand-alone resource and as an integral part ofa wider network of genomic resources Producing consist-ent high-quality phenotype mappings represents an essen-tial yet challenging task There is a high level of variationbetween studies in what constitutes a phenotype and howa phenotype is described For example a specific pheno-type or disease may be studied explicitly or a phenotypeconsidered to be a marker for disease eg measurementsof blood glucose concentration and body weight as riskfactors for type II diabetes To allow consistent groupingsof studies and comparisons of results among studies it isessential that all mappings are consistent but withoutmaking potentially incorrect assumptions For example

D1002 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

a study exploring the effect of SNP level variation onblood glucose levels may use this as a risk indicator fortype II diabetes but unless this is explicitly stated amapping to type II diabetes would be incorrect

We evaluated several domain ontologies for coverage ofthe GWAS data including Medical Subject Headings(MeSH) the Human Phenotype Ontology (12) theDisease Ontology (13) and the Experimental FactorOntology (EFO) (14) No single ontology addresses allthe representational needs of the GWAS CatalogHowever EFO has branches that can be extended toinclude new terms and allows modelling of tests diseasesanatomy and anthropometry imports terms from availabledomain-specific ontologies and uses axiomatization tomaintain its structure Many MeSH terms were approxi-mately mapped and did not provide the precisionrequired by the curators For example the annotationlsquoC-reactive protein levelsrsquo in the Catalog was mapped toMeSH lsquoD002097 C-reactive proteinrsquo But the implicit infor-mation in the Catalog annotation is that the protein levelsare an indication of inflammatory response Whenmodelled in EFO a new term EFO lsquoEFO_0004458C-reactive protein measurementrsquo with a parent classlsquoinflammatory marker measurementrsquo allows queries forcombinations of all inflammatory diseases and allinflammatory markers and provides a more detailedquery result for the user

EFO is an application ontology that uses multiple refer-ence ontologies such as chemical entities of biologicalinterest (15) to produce a single ontology which can beviewed under one hierarchy and can be applied to visualizea data set Current GWAS traits range from simple diseasedescriptions such as lsquobreast cancerrsquo to measurements suchas lsquowaist-hip ratiorsquo or lsquoliver enzyme levelsrsquo to compound

traits like lsquobody mass in chronic obstructive pulmonarydiseasersquo which are hard to represent in an ontology andtrait definitions are often context-dependent EFO not onlycontains disease categories but also phenotypic descrip-tions compound treatments and so forth Traits notpresent in EFO were added either by importing existingterms from reference ontologies including the GeneOntology (GO) (16) and the Human Phenotype Ontology(12) or creating new concepts locally or by requestingthese from external ontologies A new TermGenietemplate was provided by the GO editors to provide therequired lsquoresponse to drugrsquo terms for the Catalog Theprocess of building the ontology improved both the con-sistency of term assignment for existing studies which werereviewed and the query capability for the CatalogThe availability of high-quality mappings between

GWAS traits and ontology terms facilitates the integrationof the GWAS Catalog with other resources The Catalog isused by Ensembl (17) through direct import of the tab-delimited file of all the Catalog content that can be down-loaded from the Catalog website Ensembl also is in theprocess of mapping phenotypes to EFO so adding theexisting GWAS mappings to Ensembl would immediatelyadd an entirely new layer of connectivity between conceptsthat are annotated with the same EFO terms This is par-ticularly useful at the cross-species mapping level wheremapping genetic and variation data from different speciessuch as mouse and human to common phenotypes providesa new way of identifying related genesA schema ontology was designed to model the key

concepts represented in the Catalog such as trait SNPstudy gene and chromosome and relationships that linkthese concepts eg lsquoSNP part of genersquo or lsquostudy describestrait associationrsquo This schema ontology forms the basis of

Figure 1 Studies traits and SNP-trait associations from 2005ndash2013 reveal the growth in eligible studies

Nucleic Acids Research 2014 Vol 42 Database issue D1003

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

an OWL knowledge base containing all the data in theGWAS Catalog in a format that can be processed by anontology reasoner and queried By representing the GWASCatalog as an OWL knowledge base and reasoning overthe asserted axioms it is possible to make inferences aboutthe SNP-trait associations that are not possible in the rela-tional database version of the Catalog This enables moreexpressive queries at different levels of granularity and theability to detect errors and inconsistencies in the dataWhere previously a comparison between gastric and oe-sophageal cancers would have required either all terms tobe enumerated and searched separately or a high-levelsearch for all cancer-related traits possible queries of theknowledge base range from lsquoFind all SNPs that areassociated with gastric cancerrsquo to lsquoFind all SNPs that areassociated with cancers located in the upper digestive tractrsquoto lsquoFind all SNPs that are associated with cancerrsquoSynonyms are both imported from reference ontologiesand added locally to support spelling variations andplurals The knowledge base currently contains the datathat are publicly available in the Catalog To effectivelyrepresent the ethnicity information extracted by curatorsan ethnicity ontology is currently under developmentFurthermore by extending the schema ontology toinclude additional concepts it would be relatively easy toinclude extra information in a knowledge base such aslinkage disequilibrium information or population-specificinformation in the future Equally other ontologies couldbe imported without much difficulty to allow visualizationof the data using alternative terminologies

DATA ACCESS

GWAS Catalog data can be accessed in three ways (i) Viathe web interface hosted at the NHGRI which providesaccess to data via a search form including traits and studypublication information as well as a tab-delimited file isavailable for download (ii) Via a new dynamic queryinterface of traits visualized on the human karyotypelinked to literature SNPs in Ensembl and ontologyterms in the Experimental Factor Ontology was imple-mented (described later) (iii) GWAS Catalog data areavailable through commonly used data portals such asEnsembl (17) UCSC Genome Browser (18) andPheGenI (19) and many other community toolsThe GWAS diagram shown in Figure 2 is a visualiza-

tion of all SNP-trait associations in the GWAS Catalogwith the P lt 5 108 mapped onto the human karyo-type Each dot represents an association between a SNP(or chromosomal region if multiple SNPs are in proxim-ity) and a disease or trait It has been produced quarterlyby the GWAS Catalog team since 2006 Until mid-2012this was done manually with the help of a medical illus-trator and required a considerable number of personhours each month The result was a static imagedistributed in PDF and PowerPoint formats Each traitassociation on the diagram was represented by a circleof a different colour While this generated an iconicdiagram for the Catalog the last published version ofthe hand-drawn diagram contained in excess of 200

colours To maintain visual consistency in the redevelopeddiagram the new diagram retains the fundamental prin-ciples of the manually drawn version by reusing the samechromosomal ideograms and mapping SNP-trait associ-ations to the level of cytogenetic bands The crucialchange from the old diagram is the design of the newcolour scheme based on 17 colours representing largerempirically determined and ontology-based traitcategories The high-level categories shown in thediagram were determined by analysis of the query logsto capture the most popular search terms and assignmentof colours by number of traits across the ontology hier-archy to ensure that colour distribution made the diagramcomprehensible Figure 2 shows that users can visualize allSNPs by a high-level term such as lsquometabolic diseasersquo andlinks are available to Ensembl the Catalog interface andPubMed Data can also be visualized by typing a querysuch as lsquoC-reactive proteinrsquo and using the NCBOautocomplete widget (20) An animation of SNPs andtraits per Catalog release and downloadable images forcommon queries are provided for user download asthese are often used in slide presentations It is possibleto generate a customized diagram pertaining to a categoryof interest eg cancer and identify any zones of particu-larly high density for that trait The nature of thecategories gives an interesting insight into which GWAStraits produce an especially high number of strongSNP-trait associations eg the category lsquoliver enzymemeasurementrsquo has a similar number of traits on thediagram as the much broader category lsquocardiovasculardiseasersquo The visualization and ontology in combinationallows users to access rich genomic information pertainingto a trait by a simple query for trait name eg Alzheimerrsquosdisease using the hierarchy eg all nervous systemdisease and the diagram links to Ensembl for RSIDsallowing access to genomic context and population levelinformation

CONCLUSION

We present the GWAS Catalog supporting infrastructureand new query curatorial and visualization toolsThe mapping of complex phenotypic traits to theExperimental Factor Ontology facilitates cross-studycomparisons within the Catalog as well as the integrationof Catalog data with other resources It also supports thegeneration of a semantic web technology-driven inter-active visualization of the GWAS Catalog data andoffers community access A number of improvements todata holdings extraction processes and data access areplanned based on recruitment of community feedback

A user survey (performed at ASHG in 2012 and byemailing owners of data sets included in the Catalog)provided useful community feedback and allowed priori-tization of future developments including the provisionand visualization of trait information at the SNP levelinclusion of queryable ethnicity information and enhance-ments to the usability of the web interfaces such as de-ployment of the ontology in the form-based web portalTo ensure that the infrastructure and curation processes

D1004 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

scale in the future we are investigating inclusion of textmining to support curation for example to more easilyidentify ethnicity and population information and use of atriple stores for data storage to improve performance asdata volumes grow

A major challenge for the Catalog is the increasing com-plexity of studies eg traits are more often compoundgene by environment studies are available and in thepast year several gene by gene interaction studies wereincluded in the Catalog Therefore the Catalog

Figure 2 The interactive GWAS diagram is a visualization of all SNP-trait associations with P lt5 108 mapped to the SNPrsquos cytogenetic bandVisualizations of SNPs with metabolic or immune system disease are highlighted Hovering over a SNP displays the trait name clicking on it returnsthe individual SNP-trait associations including P-value and publication as well as links to the relevant entry in the GWAS Catalog Europe PMCand Ensembl

Nucleic Acids Research 2014 Vol 42 Database issue D1005

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

infrastructure must evolve to support these cases improvedata extraction times and address changing technologiesfor GWA studies including the use of next-generationsequencing technology A recent webinar organized bythe Catalog explored these issues with a diverse range ofresearchers and provides background information oncommunity use of and future directions for the Catalog(httptinyurlcomnbxxm2e)Future technical developments include further enhance-

ments to ethnicity extraction including the development ofan ontology to model ethnicity improvements to thediagram to allow richer querying and filtering eg by dateranges publication and a combination of these Thediagram will also support better linking to externalresources (Ensembl dbSNP and PubMed) and moreprecise modelling of trait associations to distinguishbetween the case where a single SNP is associated withmultiple traits (myocardial infarction and high bloodpressure) andthecasewhereaSNPisshowntobeassociatedwith a trait in the context of a condition (eg myocardialinfarction in a cohort of patients with high bloodpressure) We plan to publish widgets that allow views ofthe GWAS diagram to be embedded in other resourcesFinally we would also like to improve the GWAS Catalogsearch interface to support ontology-driven querying andbetter integrate the visualization with the web form-basedinterface by developing a new user interface

ACKNOWLEDGEMENTS

The authors thank Francis Rowland (EMBL-EBI) foradvice on interface development and usability MaryTodd Bergman and Spencer Phillips (EMBL-EBIOutreach) for assistance with figures Fiona Cunningham(EMBL-EBI) and many users of the Catalog for theirfeedback They also thank their colleagues at the NCBIand Ensembl for distribution of data via their resourcesand feedback They thank Darryl Leja for advice on devel-opment of the karyotype visualization and Mike Feolo andMing Xu (NCBI) for informatics support

FUNDING

The National Institutes of Health [3U41-HG006104 toDW and JM-A] and EMBL core funds supporting(to PF HP and TB) EMBL Core Funds and by theHealth Thematic Area of the Cooperation Programme ofthe European Commission within the VII FrameworkProgramme for Research and Technological Developmentsupported (to JM) Grant number [200754] Gen2Phenconducted this work as employees of the NHGRI NIH(by HAJ PH KAH TAM and LAH) Fundingfor open access charge NIH

Conflict of interest statement None declared

REFERENCES

1 ManolioTA (2009) Cohort studies and the genetics of complexdisease Nat Genet 41 5ndash6

2 TroutbeckR Al-QureshiS and GuymerRH (2012) Therapeutictargeting of the complement system in age-related maculardegeneration a review Clin Exp Ophthalmol 40 18ndash26

3 VoightB ScottL SteinthorsdottirV MorrisA DinaCWelchR ZegginiE HuthC AulchenkoY ThorleifssonGet al (2010) Twelve type 2 diabetes susceptibility loci identifiedthrough large-scale association analysis Nat Genet 42 579ndash589

4 SchunkertH KonigIR KathiresanS ReillyMPAssimesTL HolmH PreussM StewartAFR BarbalicMGiegerC et al (2011) Large-scale association analysis identifies13 new susceptibility loci for coronary artery disease CircCardiovasc Genet 43 333ndash338

5 FrankeA McGovernDPB BarrettJC WangK Radford-SmithGL AhmadT LeesCW BalschunT LeeJRobertsR et al (2010) Genome-wide meta-analysis increases to71 the number of confirmed Crohnrsquos disease susceptibility lociNat Genet 42 1118ndash1125

6 SaxenaR VoightBF LyssenkoV BurttNP de BakkerPIWChenH RoixJJ KathiresanS HirschhornJN DalyMJ et al(2007) Genome-wide association analysis identifies loci for type 2diabetes and triglyceride levels Science 316 1331ndash1336

7 MailmanMD FeoloM JinY KimuraM TrykaKBagoutdinovR HaoL KiangA PaschallJ PhanL et al(2007) The NCBI dbGaP database of genotypes and phenotypesNat Genet 39 1181ndash1186

8 HindorffLA SethupathyP JunkinsHA RamosEMMehtaJP CollinsFS and ManolioTA (2009) Potentialetiologic and functional implications of genome-wide associationloci for human diseases and traits Proc Natl Acad Sci USA106 9362ndash9367

9 The World Factbook 2013-14 (2013) Washington DC CentralIntelligence Agency

10 SmithB AshburnerM RosseC BardJ BugW CeustersWGoldbergLJ EilbeckK IrelandA MungallCJ et al (2007)The OBO Foundry coordinated evolution of ontologies tosupport biomedical data integration Nat Biotechnol 251251ndash1255

11 SmithMK WeltyC and McGuinnessDL (2009) Web ontologylanguage In W3C Recommendation

12 RobinsonPN and MundlosS (2010) The human phenotypeontology Clin Genet 77 525ndash534

13 SchrimlLM ArzeC NadendlaS ChangY-WWMazaitisM FelixV FengG and KibbeWA (2012) DiseaseOntology a backbone for disease semantic integration NucleicAcids Res 40 D940ndashD946

14 MaloneJ RaynerTF Zheng BradleyX and ParkinsonH(2008) Developing an application focused experimental factorontology embracing the OBO Community In Proceedingsof the Eleventh Annual Bioontologies Meeting Toronto Canada

15 De MatosP DekkerA EnnisM HastingsJ HaugKTurnerS and SteinbeckC (2010) ChEBI a chemistry ontologyand database J Cheminform 2 P6

16 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification ofbiology The Gene Ontology Consortium Nat Genet 2525ndash29

17 FlicekP AmodeMR BarrellD BealK BrentSCarvalho-SilvaD ClaphamP CoatesG FairleySFitzgeraldS et al (2012) Ensembl 2012 Nucleic Acids Res 40D84ndashD90

18 KarolchikD HinrichsAS and KentWJ (2012) The UCSCGenome Browser Current Protocols in Bioinformatics 40141ndash1433

19 RamosEM HoffmanD JunkinsHA MaglottD PhanLSherryST FeoloM and HindorffLA (2013) Phenotype-Genotype Integrator (PheGenI) synthesizing genome-wideassociation study (GWAS) data with existing genomic resourcesEur J Hum Genet EJHG 22 144ndash147

20 WhetzelPL NoyNF ShahNH AlexanderPR NyulasCTudoracheT and MusenMA (2011) BioPortal enhancedfunctionality via new Web services from the National Center forBiomedical Ontology to access and use ontologies in softwareapplications Nucleic Acids Res 39 W541ndashW545

D1006 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

Page 3: The NHGRI GWAS Catalog, a curated resource of SNP-trait ... · ‘inflammatory marker measurement’ allows queries for combinations of all inflammatory diseases and all inflammatory

a study exploring the effect of SNP level variation onblood glucose levels may use this as a risk indicator fortype II diabetes but unless this is explicitly stated amapping to type II diabetes would be incorrect

We evaluated several domain ontologies for coverage ofthe GWAS data including Medical Subject Headings(MeSH) the Human Phenotype Ontology (12) theDisease Ontology (13) and the Experimental FactorOntology (EFO) (14) No single ontology addresses allthe representational needs of the GWAS CatalogHowever EFO has branches that can be extended toinclude new terms and allows modelling of tests diseasesanatomy and anthropometry imports terms from availabledomain-specific ontologies and uses axiomatization tomaintain its structure Many MeSH terms were approxi-mately mapped and did not provide the precisionrequired by the curators For example the annotationlsquoC-reactive protein levelsrsquo in the Catalog was mapped toMeSH lsquoD002097 C-reactive proteinrsquo But the implicit infor-mation in the Catalog annotation is that the protein levelsare an indication of inflammatory response Whenmodelled in EFO a new term EFO lsquoEFO_0004458C-reactive protein measurementrsquo with a parent classlsquoinflammatory marker measurementrsquo allows queries forcombinations of all inflammatory diseases and allinflammatory markers and provides a more detailedquery result for the user

EFO is an application ontology that uses multiple refer-ence ontologies such as chemical entities of biologicalinterest (15) to produce a single ontology which can beviewed under one hierarchy and can be applied to visualizea data set Current GWAS traits range from simple diseasedescriptions such as lsquobreast cancerrsquo to measurements suchas lsquowaist-hip ratiorsquo or lsquoliver enzyme levelsrsquo to compound

traits like lsquobody mass in chronic obstructive pulmonarydiseasersquo which are hard to represent in an ontology andtrait definitions are often context-dependent EFO not onlycontains disease categories but also phenotypic descrip-tions compound treatments and so forth Traits notpresent in EFO were added either by importing existingterms from reference ontologies including the GeneOntology (GO) (16) and the Human Phenotype Ontology(12) or creating new concepts locally or by requestingthese from external ontologies A new TermGenietemplate was provided by the GO editors to provide therequired lsquoresponse to drugrsquo terms for the Catalog Theprocess of building the ontology improved both the con-sistency of term assignment for existing studies which werereviewed and the query capability for the CatalogThe availability of high-quality mappings between

GWAS traits and ontology terms facilitates the integrationof the GWAS Catalog with other resources The Catalog isused by Ensembl (17) through direct import of the tab-delimited file of all the Catalog content that can be down-loaded from the Catalog website Ensembl also is in theprocess of mapping phenotypes to EFO so adding theexisting GWAS mappings to Ensembl would immediatelyadd an entirely new layer of connectivity between conceptsthat are annotated with the same EFO terms This is par-ticularly useful at the cross-species mapping level wheremapping genetic and variation data from different speciessuch as mouse and human to common phenotypes providesa new way of identifying related genesA schema ontology was designed to model the key

concepts represented in the Catalog such as trait SNPstudy gene and chromosome and relationships that linkthese concepts eg lsquoSNP part of genersquo or lsquostudy describestrait associationrsquo This schema ontology forms the basis of

Figure 1 Studies traits and SNP-trait associations from 2005ndash2013 reveal the growth in eligible studies

Nucleic Acids Research 2014 Vol 42 Database issue D1003

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

an OWL knowledge base containing all the data in theGWAS Catalog in a format that can be processed by anontology reasoner and queried By representing the GWASCatalog as an OWL knowledge base and reasoning overthe asserted axioms it is possible to make inferences aboutthe SNP-trait associations that are not possible in the rela-tional database version of the Catalog This enables moreexpressive queries at different levels of granularity and theability to detect errors and inconsistencies in the dataWhere previously a comparison between gastric and oe-sophageal cancers would have required either all terms tobe enumerated and searched separately or a high-levelsearch for all cancer-related traits possible queries of theknowledge base range from lsquoFind all SNPs that areassociated with gastric cancerrsquo to lsquoFind all SNPs that areassociated with cancers located in the upper digestive tractrsquoto lsquoFind all SNPs that are associated with cancerrsquoSynonyms are both imported from reference ontologiesand added locally to support spelling variations andplurals The knowledge base currently contains the datathat are publicly available in the Catalog To effectivelyrepresent the ethnicity information extracted by curatorsan ethnicity ontology is currently under developmentFurthermore by extending the schema ontology toinclude additional concepts it would be relatively easy toinclude extra information in a knowledge base such aslinkage disequilibrium information or population-specificinformation in the future Equally other ontologies couldbe imported without much difficulty to allow visualizationof the data using alternative terminologies

DATA ACCESS

GWAS Catalog data can be accessed in three ways (i) Viathe web interface hosted at the NHGRI which providesaccess to data via a search form including traits and studypublication information as well as a tab-delimited file isavailable for download (ii) Via a new dynamic queryinterface of traits visualized on the human karyotypelinked to literature SNPs in Ensembl and ontologyterms in the Experimental Factor Ontology was imple-mented (described later) (iii) GWAS Catalog data areavailable through commonly used data portals such asEnsembl (17) UCSC Genome Browser (18) andPheGenI (19) and many other community toolsThe GWAS diagram shown in Figure 2 is a visualiza-

tion of all SNP-trait associations in the GWAS Catalogwith the P lt 5 108 mapped onto the human karyo-type Each dot represents an association between a SNP(or chromosomal region if multiple SNPs are in proxim-ity) and a disease or trait It has been produced quarterlyby the GWAS Catalog team since 2006 Until mid-2012this was done manually with the help of a medical illus-trator and required a considerable number of personhours each month The result was a static imagedistributed in PDF and PowerPoint formats Each traitassociation on the diagram was represented by a circleof a different colour While this generated an iconicdiagram for the Catalog the last published version ofthe hand-drawn diagram contained in excess of 200

colours To maintain visual consistency in the redevelopeddiagram the new diagram retains the fundamental prin-ciples of the manually drawn version by reusing the samechromosomal ideograms and mapping SNP-trait associ-ations to the level of cytogenetic bands The crucialchange from the old diagram is the design of the newcolour scheme based on 17 colours representing largerempirically determined and ontology-based traitcategories The high-level categories shown in thediagram were determined by analysis of the query logsto capture the most popular search terms and assignmentof colours by number of traits across the ontology hier-archy to ensure that colour distribution made the diagramcomprehensible Figure 2 shows that users can visualize allSNPs by a high-level term such as lsquometabolic diseasersquo andlinks are available to Ensembl the Catalog interface andPubMed Data can also be visualized by typing a querysuch as lsquoC-reactive proteinrsquo and using the NCBOautocomplete widget (20) An animation of SNPs andtraits per Catalog release and downloadable images forcommon queries are provided for user download asthese are often used in slide presentations It is possibleto generate a customized diagram pertaining to a categoryof interest eg cancer and identify any zones of particu-larly high density for that trait The nature of thecategories gives an interesting insight into which GWAStraits produce an especially high number of strongSNP-trait associations eg the category lsquoliver enzymemeasurementrsquo has a similar number of traits on thediagram as the much broader category lsquocardiovasculardiseasersquo The visualization and ontology in combinationallows users to access rich genomic information pertainingto a trait by a simple query for trait name eg Alzheimerrsquosdisease using the hierarchy eg all nervous systemdisease and the diagram links to Ensembl for RSIDsallowing access to genomic context and population levelinformation

CONCLUSION

We present the GWAS Catalog supporting infrastructureand new query curatorial and visualization toolsThe mapping of complex phenotypic traits to theExperimental Factor Ontology facilitates cross-studycomparisons within the Catalog as well as the integrationof Catalog data with other resources It also supports thegeneration of a semantic web technology-driven inter-active visualization of the GWAS Catalog data andoffers community access A number of improvements todata holdings extraction processes and data access areplanned based on recruitment of community feedback

A user survey (performed at ASHG in 2012 and byemailing owners of data sets included in the Catalog)provided useful community feedback and allowed priori-tization of future developments including the provisionand visualization of trait information at the SNP levelinclusion of queryable ethnicity information and enhance-ments to the usability of the web interfaces such as de-ployment of the ontology in the form-based web portalTo ensure that the infrastructure and curation processes

D1004 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

scale in the future we are investigating inclusion of textmining to support curation for example to more easilyidentify ethnicity and population information and use of atriple stores for data storage to improve performance asdata volumes grow

A major challenge for the Catalog is the increasing com-plexity of studies eg traits are more often compoundgene by environment studies are available and in thepast year several gene by gene interaction studies wereincluded in the Catalog Therefore the Catalog

Figure 2 The interactive GWAS diagram is a visualization of all SNP-trait associations with P lt5 108 mapped to the SNPrsquos cytogenetic bandVisualizations of SNPs with metabolic or immune system disease are highlighted Hovering over a SNP displays the trait name clicking on it returnsthe individual SNP-trait associations including P-value and publication as well as links to the relevant entry in the GWAS Catalog Europe PMCand Ensembl

Nucleic Acids Research 2014 Vol 42 Database issue D1005

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

infrastructure must evolve to support these cases improvedata extraction times and address changing technologiesfor GWA studies including the use of next-generationsequencing technology A recent webinar organized bythe Catalog explored these issues with a diverse range ofresearchers and provides background information oncommunity use of and future directions for the Catalog(httptinyurlcomnbxxm2e)Future technical developments include further enhance-

ments to ethnicity extraction including the development ofan ontology to model ethnicity improvements to thediagram to allow richer querying and filtering eg by dateranges publication and a combination of these Thediagram will also support better linking to externalresources (Ensembl dbSNP and PubMed) and moreprecise modelling of trait associations to distinguishbetween the case where a single SNP is associated withmultiple traits (myocardial infarction and high bloodpressure) andthecasewhereaSNPisshowntobeassociatedwith a trait in the context of a condition (eg myocardialinfarction in a cohort of patients with high bloodpressure) We plan to publish widgets that allow views ofthe GWAS diagram to be embedded in other resourcesFinally we would also like to improve the GWAS Catalogsearch interface to support ontology-driven querying andbetter integrate the visualization with the web form-basedinterface by developing a new user interface

ACKNOWLEDGEMENTS

The authors thank Francis Rowland (EMBL-EBI) foradvice on interface development and usability MaryTodd Bergman and Spencer Phillips (EMBL-EBIOutreach) for assistance with figures Fiona Cunningham(EMBL-EBI) and many users of the Catalog for theirfeedback They also thank their colleagues at the NCBIand Ensembl for distribution of data via their resourcesand feedback They thank Darryl Leja for advice on devel-opment of the karyotype visualization and Mike Feolo andMing Xu (NCBI) for informatics support

FUNDING

The National Institutes of Health [3U41-HG006104 toDW and JM-A] and EMBL core funds supporting(to PF HP and TB) EMBL Core Funds and by theHealth Thematic Area of the Cooperation Programme ofthe European Commission within the VII FrameworkProgramme for Research and Technological Developmentsupported (to JM) Grant number [200754] Gen2Phenconducted this work as employees of the NHGRI NIH(by HAJ PH KAH TAM and LAH) Fundingfor open access charge NIH

Conflict of interest statement None declared

REFERENCES

1 ManolioTA (2009) Cohort studies and the genetics of complexdisease Nat Genet 41 5ndash6

2 TroutbeckR Al-QureshiS and GuymerRH (2012) Therapeutictargeting of the complement system in age-related maculardegeneration a review Clin Exp Ophthalmol 40 18ndash26

3 VoightB ScottL SteinthorsdottirV MorrisA DinaCWelchR ZegginiE HuthC AulchenkoY ThorleifssonGet al (2010) Twelve type 2 diabetes susceptibility loci identifiedthrough large-scale association analysis Nat Genet 42 579ndash589

4 SchunkertH KonigIR KathiresanS ReillyMPAssimesTL HolmH PreussM StewartAFR BarbalicMGiegerC et al (2011) Large-scale association analysis identifies13 new susceptibility loci for coronary artery disease CircCardiovasc Genet 43 333ndash338

5 FrankeA McGovernDPB BarrettJC WangK Radford-SmithGL AhmadT LeesCW BalschunT LeeJRobertsR et al (2010) Genome-wide meta-analysis increases to71 the number of confirmed Crohnrsquos disease susceptibility lociNat Genet 42 1118ndash1125

6 SaxenaR VoightBF LyssenkoV BurttNP de BakkerPIWChenH RoixJJ KathiresanS HirschhornJN DalyMJ et al(2007) Genome-wide association analysis identifies loci for type 2diabetes and triglyceride levels Science 316 1331ndash1336

7 MailmanMD FeoloM JinY KimuraM TrykaKBagoutdinovR HaoL KiangA PaschallJ PhanL et al(2007) The NCBI dbGaP database of genotypes and phenotypesNat Genet 39 1181ndash1186

8 HindorffLA SethupathyP JunkinsHA RamosEMMehtaJP CollinsFS and ManolioTA (2009) Potentialetiologic and functional implications of genome-wide associationloci for human diseases and traits Proc Natl Acad Sci USA106 9362ndash9367

9 The World Factbook 2013-14 (2013) Washington DC CentralIntelligence Agency

10 SmithB AshburnerM RosseC BardJ BugW CeustersWGoldbergLJ EilbeckK IrelandA MungallCJ et al (2007)The OBO Foundry coordinated evolution of ontologies tosupport biomedical data integration Nat Biotechnol 251251ndash1255

11 SmithMK WeltyC and McGuinnessDL (2009) Web ontologylanguage In W3C Recommendation

12 RobinsonPN and MundlosS (2010) The human phenotypeontology Clin Genet 77 525ndash534

13 SchrimlLM ArzeC NadendlaS ChangY-WWMazaitisM FelixV FengG and KibbeWA (2012) DiseaseOntology a backbone for disease semantic integration NucleicAcids Res 40 D940ndashD946

14 MaloneJ RaynerTF Zheng BradleyX and ParkinsonH(2008) Developing an application focused experimental factorontology embracing the OBO Community In Proceedingsof the Eleventh Annual Bioontologies Meeting Toronto Canada

15 De MatosP DekkerA EnnisM HastingsJ HaugKTurnerS and SteinbeckC (2010) ChEBI a chemistry ontologyand database J Cheminform 2 P6

16 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification ofbiology The Gene Ontology Consortium Nat Genet 2525ndash29

17 FlicekP AmodeMR BarrellD BealK BrentSCarvalho-SilvaD ClaphamP CoatesG FairleySFitzgeraldS et al (2012) Ensembl 2012 Nucleic Acids Res 40D84ndashD90

18 KarolchikD HinrichsAS and KentWJ (2012) The UCSCGenome Browser Current Protocols in Bioinformatics 40141ndash1433

19 RamosEM HoffmanD JunkinsHA MaglottD PhanLSherryST FeoloM and HindorffLA (2013) Phenotype-Genotype Integrator (PheGenI) synthesizing genome-wideassociation study (GWAS) data with existing genomic resourcesEur J Hum Genet EJHG 22 144ndash147

20 WhetzelPL NoyNF ShahNH AlexanderPR NyulasCTudoracheT and MusenMA (2011) BioPortal enhancedfunctionality via new Web services from the National Center forBiomedical Ontology to access and use ontologies in softwareapplications Nucleic Acids Res 39 W541ndashW545

D1006 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

Page 4: The NHGRI GWAS Catalog, a curated resource of SNP-trait ... · ‘inflammatory marker measurement’ allows queries for combinations of all inflammatory diseases and all inflammatory

an OWL knowledge base containing all the data in theGWAS Catalog in a format that can be processed by anontology reasoner and queried By representing the GWASCatalog as an OWL knowledge base and reasoning overthe asserted axioms it is possible to make inferences aboutthe SNP-trait associations that are not possible in the rela-tional database version of the Catalog This enables moreexpressive queries at different levels of granularity and theability to detect errors and inconsistencies in the dataWhere previously a comparison between gastric and oe-sophageal cancers would have required either all terms tobe enumerated and searched separately or a high-levelsearch for all cancer-related traits possible queries of theknowledge base range from lsquoFind all SNPs that areassociated with gastric cancerrsquo to lsquoFind all SNPs that areassociated with cancers located in the upper digestive tractrsquoto lsquoFind all SNPs that are associated with cancerrsquoSynonyms are both imported from reference ontologiesand added locally to support spelling variations andplurals The knowledge base currently contains the datathat are publicly available in the Catalog To effectivelyrepresent the ethnicity information extracted by curatorsan ethnicity ontology is currently under developmentFurthermore by extending the schema ontology toinclude additional concepts it would be relatively easy toinclude extra information in a knowledge base such aslinkage disequilibrium information or population-specificinformation in the future Equally other ontologies couldbe imported without much difficulty to allow visualizationof the data using alternative terminologies

DATA ACCESS

GWAS Catalog data can be accessed in three ways (i) Viathe web interface hosted at the NHGRI which providesaccess to data via a search form including traits and studypublication information as well as a tab-delimited file isavailable for download (ii) Via a new dynamic queryinterface of traits visualized on the human karyotypelinked to literature SNPs in Ensembl and ontologyterms in the Experimental Factor Ontology was imple-mented (described later) (iii) GWAS Catalog data areavailable through commonly used data portals such asEnsembl (17) UCSC Genome Browser (18) andPheGenI (19) and many other community toolsThe GWAS diagram shown in Figure 2 is a visualiza-

tion of all SNP-trait associations in the GWAS Catalogwith the P lt 5 108 mapped onto the human karyo-type Each dot represents an association between a SNP(or chromosomal region if multiple SNPs are in proxim-ity) and a disease or trait It has been produced quarterlyby the GWAS Catalog team since 2006 Until mid-2012this was done manually with the help of a medical illus-trator and required a considerable number of personhours each month The result was a static imagedistributed in PDF and PowerPoint formats Each traitassociation on the diagram was represented by a circleof a different colour While this generated an iconicdiagram for the Catalog the last published version ofthe hand-drawn diagram contained in excess of 200

colours To maintain visual consistency in the redevelopeddiagram the new diagram retains the fundamental prin-ciples of the manually drawn version by reusing the samechromosomal ideograms and mapping SNP-trait associ-ations to the level of cytogenetic bands The crucialchange from the old diagram is the design of the newcolour scheme based on 17 colours representing largerempirically determined and ontology-based traitcategories The high-level categories shown in thediagram were determined by analysis of the query logsto capture the most popular search terms and assignmentof colours by number of traits across the ontology hier-archy to ensure that colour distribution made the diagramcomprehensible Figure 2 shows that users can visualize allSNPs by a high-level term such as lsquometabolic diseasersquo andlinks are available to Ensembl the Catalog interface andPubMed Data can also be visualized by typing a querysuch as lsquoC-reactive proteinrsquo and using the NCBOautocomplete widget (20) An animation of SNPs andtraits per Catalog release and downloadable images forcommon queries are provided for user download asthese are often used in slide presentations It is possibleto generate a customized diagram pertaining to a categoryof interest eg cancer and identify any zones of particu-larly high density for that trait The nature of thecategories gives an interesting insight into which GWAStraits produce an especially high number of strongSNP-trait associations eg the category lsquoliver enzymemeasurementrsquo has a similar number of traits on thediagram as the much broader category lsquocardiovasculardiseasersquo The visualization and ontology in combinationallows users to access rich genomic information pertainingto a trait by a simple query for trait name eg Alzheimerrsquosdisease using the hierarchy eg all nervous systemdisease and the diagram links to Ensembl for RSIDsallowing access to genomic context and population levelinformation

CONCLUSION

We present the GWAS Catalog supporting infrastructureand new query curatorial and visualization toolsThe mapping of complex phenotypic traits to theExperimental Factor Ontology facilitates cross-studycomparisons within the Catalog as well as the integrationof Catalog data with other resources It also supports thegeneration of a semantic web technology-driven inter-active visualization of the GWAS Catalog data andoffers community access A number of improvements todata holdings extraction processes and data access areplanned based on recruitment of community feedback

A user survey (performed at ASHG in 2012 and byemailing owners of data sets included in the Catalog)provided useful community feedback and allowed priori-tization of future developments including the provisionand visualization of trait information at the SNP levelinclusion of queryable ethnicity information and enhance-ments to the usability of the web interfaces such as de-ployment of the ontology in the form-based web portalTo ensure that the infrastructure and curation processes

D1004 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

scale in the future we are investigating inclusion of textmining to support curation for example to more easilyidentify ethnicity and population information and use of atriple stores for data storage to improve performance asdata volumes grow

A major challenge for the Catalog is the increasing com-plexity of studies eg traits are more often compoundgene by environment studies are available and in thepast year several gene by gene interaction studies wereincluded in the Catalog Therefore the Catalog

Figure 2 The interactive GWAS diagram is a visualization of all SNP-trait associations with P lt5 108 mapped to the SNPrsquos cytogenetic bandVisualizations of SNPs with metabolic or immune system disease are highlighted Hovering over a SNP displays the trait name clicking on it returnsthe individual SNP-trait associations including P-value and publication as well as links to the relevant entry in the GWAS Catalog Europe PMCand Ensembl

Nucleic Acids Research 2014 Vol 42 Database issue D1005

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

infrastructure must evolve to support these cases improvedata extraction times and address changing technologiesfor GWA studies including the use of next-generationsequencing technology A recent webinar organized bythe Catalog explored these issues with a diverse range ofresearchers and provides background information oncommunity use of and future directions for the Catalog(httptinyurlcomnbxxm2e)Future technical developments include further enhance-

ments to ethnicity extraction including the development ofan ontology to model ethnicity improvements to thediagram to allow richer querying and filtering eg by dateranges publication and a combination of these Thediagram will also support better linking to externalresources (Ensembl dbSNP and PubMed) and moreprecise modelling of trait associations to distinguishbetween the case where a single SNP is associated withmultiple traits (myocardial infarction and high bloodpressure) andthecasewhereaSNPisshowntobeassociatedwith a trait in the context of a condition (eg myocardialinfarction in a cohort of patients with high bloodpressure) We plan to publish widgets that allow views ofthe GWAS diagram to be embedded in other resourcesFinally we would also like to improve the GWAS Catalogsearch interface to support ontology-driven querying andbetter integrate the visualization with the web form-basedinterface by developing a new user interface

ACKNOWLEDGEMENTS

The authors thank Francis Rowland (EMBL-EBI) foradvice on interface development and usability MaryTodd Bergman and Spencer Phillips (EMBL-EBIOutreach) for assistance with figures Fiona Cunningham(EMBL-EBI) and many users of the Catalog for theirfeedback They also thank their colleagues at the NCBIand Ensembl for distribution of data via their resourcesand feedback They thank Darryl Leja for advice on devel-opment of the karyotype visualization and Mike Feolo andMing Xu (NCBI) for informatics support

FUNDING

The National Institutes of Health [3U41-HG006104 toDW and JM-A] and EMBL core funds supporting(to PF HP and TB) EMBL Core Funds and by theHealth Thematic Area of the Cooperation Programme ofthe European Commission within the VII FrameworkProgramme for Research and Technological Developmentsupported (to JM) Grant number [200754] Gen2Phenconducted this work as employees of the NHGRI NIH(by HAJ PH KAH TAM and LAH) Fundingfor open access charge NIH

Conflict of interest statement None declared

REFERENCES

1 ManolioTA (2009) Cohort studies and the genetics of complexdisease Nat Genet 41 5ndash6

2 TroutbeckR Al-QureshiS and GuymerRH (2012) Therapeutictargeting of the complement system in age-related maculardegeneration a review Clin Exp Ophthalmol 40 18ndash26

3 VoightB ScottL SteinthorsdottirV MorrisA DinaCWelchR ZegginiE HuthC AulchenkoY ThorleifssonGet al (2010) Twelve type 2 diabetes susceptibility loci identifiedthrough large-scale association analysis Nat Genet 42 579ndash589

4 SchunkertH KonigIR KathiresanS ReillyMPAssimesTL HolmH PreussM StewartAFR BarbalicMGiegerC et al (2011) Large-scale association analysis identifies13 new susceptibility loci for coronary artery disease CircCardiovasc Genet 43 333ndash338

5 FrankeA McGovernDPB BarrettJC WangK Radford-SmithGL AhmadT LeesCW BalschunT LeeJRobertsR et al (2010) Genome-wide meta-analysis increases to71 the number of confirmed Crohnrsquos disease susceptibility lociNat Genet 42 1118ndash1125

6 SaxenaR VoightBF LyssenkoV BurttNP de BakkerPIWChenH RoixJJ KathiresanS HirschhornJN DalyMJ et al(2007) Genome-wide association analysis identifies loci for type 2diabetes and triglyceride levels Science 316 1331ndash1336

7 MailmanMD FeoloM JinY KimuraM TrykaKBagoutdinovR HaoL KiangA PaschallJ PhanL et al(2007) The NCBI dbGaP database of genotypes and phenotypesNat Genet 39 1181ndash1186

8 HindorffLA SethupathyP JunkinsHA RamosEMMehtaJP CollinsFS and ManolioTA (2009) Potentialetiologic and functional implications of genome-wide associationloci for human diseases and traits Proc Natl Acad Sci USA106 9362ndash9367

9 The World Factbook 2013-14 (2013) Washington DC CentralIntelligence Agency

10 SmithB AshburnerM RosseC BardJ BugW CeustersWGoldbergLJ EilbeckK IrelandA MungallCJ et al (2007)The OBO Foundry coordinated evolution of ontologies tosupport biomedical data integration Nat Biotechnol 251251ndash1255

11 SmithMK WeltyC and McGuinnessDL (2009) Web ontologylanguage In W3C Recommendation

12 RobinsonPN and MundlosS (2010) The human phenotypeontology Clin Genet 77 525ndash534

13 SchrimlLM ArzeC NadendlaS ChangY-WWMazaitisM FelixV FengG and KibbeWA (2012) DiseaseOntology a backbone for disease semantic integration NucleicAcids Res 40 D940ndashD946

14 MaloneJ RaynerTF Zheng BradleyX and ParkinsonH(2008) Developing an application focused experimental factorontology embracing the OBO Community In Proceedingsof the Eleventh Annual Bioontologies Meeting Toronto Canada

15 De MatosP DekkerA EnnisM HastingsJ HaugKTurnerS and SteinbeckC (2010) ChEBI a chemistry ontologyand database J Cheminform 2 P6

16 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification ofbiology The Gene Ontology Consortium Nat Genet 2525ndash29

17 FlicekP AmodeMR BarrellD BealK BrentSCarvalho-SilvaD ClaphamP CoatesG FairleySFitzgeraldS et al (2012) Ensembl 2012 Nucleic Acids Res 40D84ndashD90

18 KarolchikD HinrichsAS and KentWJ (2012) The UCSCGenome Browser Current Protocols in Bioinformatics 40141ndash1433

19 RamosEM HoffmanD JunkinsHA MaglottD PhanLSherryST FeoloM and HindorffLA (2013) Phenotype-Genotype Integrator (PheGenI) synthesizing genome-wideassociation study (GWAS) data with existing genomic resourcesEur J Hum Genet EJHG 22 144ndash147

20 WhetzelPL NoyNF ShahNH AlexanderPR NyulasCTudoracheT and MusenMA (2011) BioPortal enhancedfunctionality via new Web services from the National Center forBiomedical Ontology to access and use ontologies in softwareapplications Nucleic Acids Res 39 W541ndashW545

D1006 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

Page 5: The NHGRI GWAS Catalog, a curated resource of SNP-trait ... · ‘inflammatory marker measurement’ allows queries for combinations of all inflammatory diseases and all inflammatory

scale in the future we are investigating inclusion of textmining to support curation for example to more easilyidentify ethnicity and population information and use of atriple stores for data storage to improve performance asdata volumes grow

A major challenge for the Catalog is the increasing com-plexity of studies eg traits are more often compoundgene by environment studies are available and in thepast year several gene by gene interaction studies wereincluded in the Catalog Therefore the Catalog

Figure 2 The interactive GWAS diagram is a visualization of all SNP-trait associations with P lt5 108 mapped to the SNPrsquos cytogenetic bandVisualizations of SNPs with metabolic or immune system disease are highlighted Hovering over a SNP displays the trait name clicking on it returnsthe individual SNP-trait associations including P-value and publication as well as links to the relevant entry in the GWAS Catalog Europe PMCand Ensembl

Nucleic Acids Research 2014 Vol 42 Database issue D1005

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

infrastructure must evolve to support these cases improvedata extraction times and address changing technologiesfor GWA studies including the use of next-generationsequencing technology A recent webinar organized bythe Catalog explored these issues with a diverse range ofresearchers and provides background information oncommunity use of and future directions for the Catalog(httptinyurlcomnbxxm2e)Future technical developments include further enhance-

ments to ethnicity extraction including the development ofan ontology to model ethnicity improvements to thediagram to allow richer querying and filtering eg by dateranges publication and a combination of these Thediagram will also support better linking to externalresources (Ensembl dbSNP and PubMed) and moreprecise modelling of trait associations to distinguishbetween the case where a single SNP is associated withmultiple traits (myocardial infarction and high bloodpressure) andthecasewhereaSNPisshowntobeassociatedwith a trait in the context of a condition (eg myocardialinfarction in a cohort of patients with high bloodpressure) We plan to publish widgets that allow views ofthe GWAS diagram to be embedded in other resourcesFinally we would also like to improve the GWAS Catalogsearch interface to support ontology-driven querying andbetter integrate the visualization with the web form-basedinterface by developing a new user interface

ACKNOWLEDGEMENTS

The authors thank Francis Rowland (EMBL-EBI) foradvice on interface development and usability MaryTodd Bergman and Spencer Phillips (EMBL-EBIOutreach) for assistance with figures Fiona Cunningham(EMBL-EBI) and many users of the Catalog for theirfeedback They also thank their colleagues at the NCBIand Ensembl for distribution of data via their resourcesand feedback They thank Darryl Leja for advice on devel-opment of the karyotype visualization and Mike Feolo andMing Xu (NCBI) for informatics support

FUNDING

The National Institutes of Health [3U41-HG006104 toDW and JM-A] and EMBL core funds supporting(to PF HP and TB) EMBL Core Funds and by theHealth Thematic Area of the Cooperation Programme ofthe European Commission within the VII FrameworkProgramme for Research and Technological Developmentsupported (to JM) Grant number [200754] Gen2Phenconducted this work as employees of the NHGRI NIH(by HAJ PH KAH TAM and LAH) Fundingfor open access charge NIH

Conflict of interest statement None declared

REFERENCES

1 ManolioTA (2009) Cohort studies and the genetics of complexdisease Nat Genet 41 5ndash6

2 TroutbeckR Al-QureshiS and GuymerRH (2012) Therapeutictargeting of the complement system in age-related maculardegeneration a review Clin Exp Ophthalmol 40 18ndash26

3 VoightB ScottL SteinthorsdottirV MorrisA DinaCWelchR ZegginiE HuthC AulchenkoY ThorleifssonGet al (2010) Twelve type 2 diabetes susceptibility loci identifiedthrough large-scale association analysis Nat Genet 42 579ndash589

4 SchunkertH KonigIR KathiresanS ReillyMPAssimesTL HolmH PreussM StewartAFR BarbalicMGiegerC et al (2011) Large-scale association analysis identifies13 new susceptibility loci for coronary artery disease CircCardiovasc Genet 43 333ndash338

5 FrankeA McGovernDPB BarrettJC WangK Radford-SmithGL AhmadT LeesCW BalschunT LeeJRobertsR et al (2010) Genome-wide meta-analysis increases to71 the number of confirmed Crohnrsquos disease susceptibility lociNat Genet 42 1118ndash1125

6 SaxenaR VoightBF LyssenkoV BurttNP de BakkerPIWChenH RoixJJ KathiresanS HirschhornJN DalyMJ et al(2007) Genome-wide association analysis identifies loci for type 2diabetes and triglyceride levels Science 316 1331ndash1336

7 MailmanMD FeoloM JinY KimuraM TrykaKBagoutdinovR HaoL KiangA PaschallJ PhanL et al(2007) The NCBI dbGaP database of genotypes and phenotypesNat Genet 39 1181ndash1186

8 HindorffLA SethupathyP JunkinsHA RamosEMMehtaJP CollinsFS and ManolioTA (2009) Potentialetiologic and functional implications of genome-wide associationloci for human diseases and traits Proc Natl Acad Sci USA106 9362ndash9367

9 The World Factbook 2013-14 (2013) Washington DC CentralIntelligence Agency

10 SmithB AshburnerM RosseC BardJ BugW CeustersWGoldbergLJ EilbeckK IrelandA MungallCJ et al (2007)The OBO Foundry coordinated evolution of ontologies tosupport biomedical data integration Nat Biotechnol 251251ndash1255

11 SmithMK WeltyC and McGuinnessDL (2009) Web ontologylanguage In W3C Recommendation

12 RobinsonPN and MundlosS (2010) The human phenotypeontology Clin Genet 77 525ndash534

13 SchrimlLM ArzeC NadendlaS ChangY-WWMazaitisM FelixV FengG and KibbeWA (2012) DiseaseOntology a backbone for disease semantic integration NucleicAcids Res 40 D940ndashD946

14 MaloneJ RaynerTF Zheng BradleyX and ParkinsonH(2008) Developing an application focused experimental factorontology embracing the OBO Community In Proceedingsof the Eleventh Annual Bioontologies Meeting Toronto Canada

15 De MatosP DekkerA EnnisM HastingsJ HaugKTurnerS and SteinbeckC (2010) ChEBI a chemistry ontologyand database J Cheminform 2 P6

16 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification ofbiology The Gene Ontology Consortium Nat Genet 2525ndash29

17 FlicekP AmodeMR BarrellD BealK BrentSCarvalho-SilvaD ClaphamP CoatesG FairleySFitzgeraldS et al (2012) Ensembl 2012 Nucleic Acids Res 40D84ndashD90

18 KarolchikD HinrichsAS and KentWJ (2012) The UCSCGenome Browser Current Protocols in Bioinformatics 40141ndash1433

19 RamosEM HoffmanD JunkinsHA MaglottD PhanLSherryST FeoloM and HindorffLA (2013) Phenotype-Genotype Integrator (PheGenI) synthesizing genome-wideassociation study (GWAS) data with existing genomic resourcesEur J Hum Genet EJHG 22 144ndash147

20 WhetzelPL NoyNF ShahNH AlexanderPR NyulasCTudoracheT and MusenMA (2011) BioPortal enhancedfunctionality via new Web services from the National Center forBiomedical Ontology to access and use ontologies in softwareapplications Nucleic Acids Res 39 W541ndashW545

D1006 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019

Page 6: The NHGRI GWAS Catalog, a curated resource of SNP-trait ... · ‘inflammatory marker measurement’ allows queries for combinations of all inflammatory diseases and all inflammatory

infrastructure must evolve to support these cases improvedata extraction times and address changing technologiesfor GWA studies including the use of next-generationsequencing technology A recent webinar organized bythe Catalog explored these issues with a diverse range ofresearchers and provides background information oncommunity use of and future directions for the Catalog(httptinyurlcomnbxxm2e)Future technical developments include further enhance-

ments to ethnicity extraction including the development ofan ontology to model ethnicity improvements to thediagram to allow richer querying and filtering eg by dateranges publication and a combination of these Thediagram will also support better linking to externalresources (Ensembl dbSNP and PubMed) and moreprecise modelling of trait associations to distinguishbetween the case where a single SNP is associated withmultiple traits (myocardial infarction and high bloodpressure) andthecasewhereaSNPisshowntobeassociatedwith a trait in the context of a condition (eg myocardialinfarction in a cohort of patients with high bloodpressure) We plan to publish widgets that allow views ofthe GWAS diagram to be embedded in other resourcesFinally we would also like to improve the GWAS Catalogsearch interface to support ontology-driven querying andbetter integrate the visualization with the web form-basedinterface by developing a new user interface

ACKNOWLEDGEMENTS

The authors thank Francis Rowland (EMBL-EBI) foradvice on interface development and usability MaryTodd Bergman and Spencer Phillips (EMBL-EBIOutreach) for assistance with figures Fiona Cunningham(EMBL-EBI) and many users of the Catalog for theirfeedback They also thank their colleagues at the NCBIand Ensembl for distribution of data via their resourcesand feedback They thank Darryl Leja for advice on devel-opment of the karyotype visualization and Mike Feolo andMing Xu (NCBI) for informatics support

FUNDING

The National Institutes of Health [3U41-HG006104 toDW and JM-A] and EMBL core funds supporting(to PF HP and TB) EMBL Core Funds and by theHealth Thematic Area of the Cooperation Programme ofthe European Commission within the VII FrameworkProgramme for Research and Technological Developmentsupported (to JM) Grant number [200754] Gen2Phenconducted this work as employees of the NHGRI NIH(by HAJ PH KAH TAM and LAH) Fundingfor open access charge NIH

Conflict of interest statement None declared

REFERENCES

1 ManolioTA (2009) Cohort studies and the genetics of complexdisease Nat Genet 41 5ndash6

2 TroutbeckR Al-QureshiS and GuymerRH (2012) Therapeutictargeting of the complement system in age-related maculardegeneration a review Clin Exp Ophthalmol 40 18ndash26

3 VoightB ScottL SteinthorsdottirV MorrisA DinaCWelchR ZegginiE HuthC AulchenkoY ThorleifssonGet al (2010) Twelve type 2 diabetes susceptibility loci identifiedthrough large-scale association analysis Nat Genet 42 579ndash589

4 SchunkertH KonigIR KathiresanS ReillyMPAssimesTL HolmH PreussM StewartAFR BarbalicMGiegerC et al (2011) Large-scale association analysis identifies13 new susceptibility loci for coronary artery disease CircCardiovasc Genet 43 333ndash338

5 FrankeA McGovernDPB BarrettJC WangK Radford-SmithGL AhmadT LeesCW BalschunT LeeJRobertsR et al (2010) Genome-wide meta-analysis increases to71 the number of confirmed Crohnrsquos disease susceptibility lociNat Genet 42 1118ndash1125

6 SaxenaR VoightBF LyssenkoV BurttNP de BakkerPIWChenH RoixJJ KathiresanS HirschhornJN DalyMJ et al(2007) Genome-wide association analysis identifies loci for type 2diabetes and triglyceride levels Science 316 1331ndash1336

7 MailmanMD FeoloM JinY KimuraM TrykaKBagoutdinovR HaoL KiangA PaschallJ PhanL et al(2007) The NCBI dbGaP database of genotypes and phenotypesNat Genet 39 1181ndash1186

8 HindorffLA SethupathyP JunkinsHA RamosEMMehtaJP CollinsFS and ManolioTA (2009) Potentialetiologic and functional implications of genome-wide associationloci for human diseases and traits Proc Natl Acad Sci USA106 9362ndash9367

9 The World Factbook 2013-14 (2013) Washington DC CentralIntelligence Agency

10 SmithB AshburnerM RosseC BardJ BugW CeustersWGoldbergLJ EilbeckK IrelandA MungallCJ et al (2007)The OBO Foundry coordinated evolution of ontologies tosupport biomedical data integration Nat Biotechnol 251251ndash1255

11 SmithMK WeltyC and McGuinnessDL (2009) Web ontologylanguage In W3C Recommendation

12 RobinsonPN and MundlosS (2010) The human phenotypeontology Clin Genet 77 525ndash534

13 SchrimlLM ArzeC NadendlaS ChangY-WWMazaitisM FelixV FengG and KibbeWA (2012) DiseaseOntology a backbone for disease semantic integration NucleicAcids Res 40 D940ndashD946

14 MaloneJ RaynerTF Zheng BradleyX and ParkinsonH(2008) Developing an application focused experimental factorontology embracing the OBO Community In Proceedingsof the Eleventh Annual Bioontologies Meeting Toronto Canada

15 De MatosP DekkerA EnnisM HastingsJ HaugKTurnerS and SteinbeckC (2010) ChEBI a chemistry ontologyand database J Cheminform 2 P6

16 AshburnerM BallCA BlakeJA BotsteinD ButlerHCherryJM DavisAP DolinskiK DwightSS EppigJTet al (2000) Gene ontology tool for the unification ofbiology The Gene Ontology Consortium Nat Genet 2525ndash29

17 FlicekP AmodeMR BarrellD BealK BrentSCarvalho-SilvaD ClaphamP CoatesG FairleySFitzgeraldS et al (2012) Ensembl 2012 Nucleic Acids Res 40D84ndashD90

18 KarolchikD HinrichsAS and KentWJ (2012) The UCSCGenome Browser Current Protocols in Bioinformatics 40141ndash1433

19 RamosEM HoffmanD JunkinsHA MaglottD PhanLSherryST FeoloM and HindorffLA (2013) Phenotype-Genotype Integrator (PheGenI) synthesizing genome-wideassociation study (GWAS) data with existing genomic resourcesEur J Hum Genet EJHG 22 144ndash147

20 WhetzelPL NoyNF ShahNH AlexanderPR NyulasCTudoracheT and MusenMA (2011) BioPortal enhancedfunctionality via new Web services from the National Center forBiomedical Ontology to access and use ontologies in softwareapplications Nucleic Acids Res 39 W541ndashW545

D1006 Nucleic Acids Research 2014 Vol 42 Database issue

Dow

nloaded from httpsacadem

icoupcomnararticle-abstract42D

1D10011062755 by U

niversity of Connecticut user on 29 August 2019