uniprot to mesh mapping proteins to disease terminologies yum l. yip, anaïs mottaz, patrick ruch,...

22
UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 Bio-Ontologies Vienna, July 20

Upload: joshua-ohara

Post on 28-Mar-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

UniProt to MeSH

mapping proteins to disease terminologies

Yum L. Yip, Anaïs Mottaz, Patrick Ruch,

Anne-Lise Veuthey

ISMB/ECCB 2007 – Bio-Ontologies – Vienna, July 20

Page 2: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

The role of bioinformatics in biomedical research and future clinical patient care

Health problemin a patient

Bioinformatics:-Data storage and representation-Large-scale data generation-Large-scale data analysis

Basic research: -what is the mechanism?-Epidemiological studies

Basic research: -what is the mechanism?-Epidemiological studies

Basic research results stored in databases

up-to-date knowledge and large-scale results:-research direction-New hypothesis

Drug developmentClinical trials

Clinical patient care:Doctor prescribes an individualized treatment plan.

Molecular-level decision-support tools:Molecular-level decision-support tools:-Structured knowledge representationsStructured knowledge representations-‘‘Filtered’ information on fundamental Filtered’ information on fundamental biological mechanisms and significantbiological mechanisms and significant

Treatment outcome

Page 3: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Disease:Pathology, diagnosis/prognosis,

Treatment, risk factor

Biological processes:Biological pathway/network,Protein-protein interaction

Proteins:Sequence, Function, structure,

modifications

Genes:Sequence, chromosomal

location, regulation, expression

Biomedical knowledge: a protein-centric view

Page 4: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Biomedical knowledge: a protein-centric view

High quality manual annotation.Protein name, sequence, function,Domain, features and references.

16,702 human proteins

Proteins:Sequence, Function, structure,

modifications

Disease:Pathology, diagnosis/prognosis,

Treatment, risk factor

Disease annotation:-Link to 12,603 OMIM entries-Link to other specialized databases-32,921 variants (or polymorphisms)->3’000 associated diseases

Biological processes:Biological pathway/network,Protein-protein interaction

Biological process/proteomic:-Pathway annotation-Protein-protein interaction (DIP, INTACT)-protein 2D gel (Swiss-2DPAGE)

ReferencesLinks to >100 other databasesOver 82’420 journal references

Genes:Sequence, chromosomal

location, regulation, expression

Genomic data:-Genew, GeneCards, GenAtlas-Expression data (e.g. CleanEx)-Genome details: Ensembl

Page 5: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Objective

Increase the accessibility of molecular biology resources to clinical researchers by indexing

UniProtKB/Swiss-Prot with the MeSH terminology

Page 6: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Why UniProt KB/Swiss-Prot ?

Most comprehensive warehouse of protein sequences

With a high level of annotation and highly cross-linked with other biological databases.

Includes data on more than 30’000 variants, mostly c-SNPsc-SNPs (coding SNPs) or SAPs SAPs (Single Amino-acid Polymorphisms)

More than 3’000 Diseases associated with a protein are also described (mostly genetic diseases associated with SAPs)

http://beta.uniprot.org/

Page 7: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Disease annotation

UniProtKB/Swiss-Prot entry P35240

Page 8: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Why MeSH?

Controlled vocabulary thesaurus structured in a hierarchy of concepts

Each concept includes a set of terms -synonyms and lexical variants

MeSH is part of the UMLS, and, thus, linked to other medical terminologies

MeSH is used to index the biomedical literature

Page 9: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

The structure of MeSH

Page 10: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Mapping procedure

UniProtKB/Swiss-Prot entryDisease comment line

Extracted disease name OMIM: title/alternative titles

Exact match Exact match

Partial match Partial match

Same descriptor

MeSH

Page 11: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Disease extraction

Extraction using regular expressions‘are the cause of’‘involved in’etc.

MeSH‘Neurofibromatosis 2’

Page 12: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Term matching procedure

• Exact matches: same length, same word order, case insensitive

• Partial matches: calculation of a similarity score between terms based of the IDF used in information retrieval:

The term with the highest score was chosen.

)(

))(

1log()

)(1

log(

diseasesizencwfreqcwfreq

S cw ncw

Page 13: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Benchmark

• Used to evaluate the procedure in terms of recall and precision

• Used to set up a score threshold

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60%

Recall

Pre

cisi

on SP

OMIM

92 disease names from 43 Swiss-Prot entries manually mapped to MeSH terms

Page 14: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

92 disease comment

lines(82 OMIM)

Exact match Partial match Total

Retrieval Recall Precision Retrieval Recall Precision Retrieval Recall Precision

SP16

(17%)16

(17%)100%

20(22%)

16(17%)

80%36

(39%)32

(35%)89%

OMIM21

(23%)21

(23%)100%

21(23%)

19(21%)

90%42

(46%)40

(43%)95%

SP OMIM10

(11%)10

(11%)100%

8(9%)

8(9%)

100%18

(20%)18

(20%)100%

SP OMIM27

(29%)27

(29%)100%

23(25%)

19(21%)

83%50

(54%)46

(50%)92%

Results on the Benchmark

Page 15: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Analysis of the results (1/3)

‘muscle liver brain eye nanism’

Disease

MeSH term ‘abnormalities, multiple’

‘muscle-eye-brain disease’

Manual mapping Automatic mapping

• Problems in granularity difference

Page 16: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

‘b-cell lymphoma’‘hematologic neoplasms’

‘hematopoietic tumors such as b-cell lymphomas’Disease(extracted)

MeSH term

Manual mapping Automatic mapping

Analysis of the results (2/3)

• Problems in disease name extraction

Page 17: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

‘epidermolysis bullosa dystrophica’‘epidermolysis bullosa simplex’

‘epidermolysis bullosa dystrophica, Cockayne-Touraine type’Disease(OMIMalternative title)

MeSH term

Manual mapping Automatic mapping

Analysis of the results (3/3)

• Problems inherent to the resources

‘epidermolysis bullosa simplex, Weber-Cockayne type’Disease SP

Page 18: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Results on all Swiss-Prot

3197 disease comment lines

2398 OMIMSP OMIM SP OMIM SP OMIM

Exact match577

(18%)655

(20%)354

(11%)866

(27%)

Partial match691

(22%)600

(19%)317

(10%)751

(23%)

Total1268(40%)

1225(39%)

844(26%)

1617(51%)

Page 19: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Discussion The mapping system was tuned for high precision to

provide a fully automated procedure. But we need to improve the recall by:

Including NLP techniques in the disease extraction and matching procedures;

Refining the score with other parameters (e.g. coming from information from the hierarchical structure of the MeSH)

Permitting a mapping to several MeSH terms; Trying to map to other terminologies such as ICD-10,

SnoMed-CT; Using information from the literature which is indexed with

MeSH terms.

Page 20: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Benchmark extended to 200 diseases

Work in progress

200 disease comment

lines(173 OMIM)

Exact match Partial match Total

Retrieval Recall Precision Retrieval Recall Precision Retrieval Recall Precision

SP35

(18%)35

(18%)100%

54(27%)

47(24%)

87%89

(45%)82

(41%)92%

OMIM40

(20%)38

(19%)95%

56(28%)

48(24%)

86%96

(48%)86

(43%)90%

SP OMIM22

(11%)22

(11%)100%

28(14%)

26(13%)

93%62

(31%)60

(30%)97%

SP OMIM52

(26%)51

(26%)98%

65(33%)

56(28%)

86%117

(59%)107

(54%)91%

Page 21: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Work in progress

Extract MeSH terms using full text from disease comment lines + references in Swiss-Prot + references in OMIM calculate frequency This frequency is used to refine the score for partial

match

Preliminary results:The recall was successfully increased to 62 % without

losing precision.

Page 22: UniProt to MeSH mapping proteins to disease terminologies Yum L. Yip, Anaïs Mottaz, Patrick Ruch, Anne-Lise Veuthey ISMB/ECCB 2007 – Bio-Ontologies – Vienna,

July 20Bio-Ontologies –ISMB 2007

Conclusion

We developped a generic terminology mapping procedure which can be used to link various biomedical resources.

Indexing UniProtKB with medical terms opens new possibilities of searching and mining data relevant for clinical research.

These results will help improve the interoperability between medical informatics and bioinformatics