curang curated databases - casimircurang curated databases peter buneman university of edinburgh...

32
Cura%ng Curated Databases Peter Buneman University of Edinburgh & Digital Cura%on Centre

Upload: others

Post on 21-Jan-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Cura%ng Curated Databases 

Peter Buneman University of Edinburgh 

& Digital Cura%on Centre 

What is digital cura%on? 

•   “…maintaining and adding value to a trusted body of digital informa%on for current and future use” [DCC mission statement] 

•  What is done by digital librarians and archivists 

2 Casimir 

Casimir  3 

What is a curated database? 

•  One that is maintained with a lot of human effort 

•  Prime concern is quality of data 

Some issues in curated DBs •  Provenance 

–  Where does your data come from? How was it formed? A huge problem now recognized in many areas of CS 

•  Annota%on –  We all do it, but is there a principled, generic approach 

•  Archiving* –  How to preserve something that is evolving in content and 

structure •  Database Cita%on 

–  Building stable and informa%ve links to components of DBs •  Collabora%ve construc%on of databases 

–  Again, we need some useful technology 

Casimir  4 

By a database I mean anything that has structure and evolves over %me: ontologies, XML, stuff in scien%fic data formats as well as tradi%onal (rela%onal) DBs 

Casimir  5 

The cost of curated data 

10-7 Big physics (LHC) data* 10-3 [Movie] 10-1 Book 1 “Production” code/Curated data 10 “Reliable” code / Curated data

In $/€/£ per byte 

With apologies to [Hey & Trevethen, “The Data Deluge”  Southampton, 2003] 

A change for the beber? 

Storage: •  Redundant •  Persistent •  Distributed •  Readable by people Clear standards for citation Historical record (old data is useful) Well understood ownership/IP

Storage: •  Single-source •  Volatile •  Centralised •  Internal DBMS format No standards for citation No historical record Mind-boggling legal issues

20th century libraries did some things better!

6 Casimir 

CIA World Factbook Swiss-Prot

Some well-known curated databases ID 11SB_CUCMA STANDARD; PRT; 480 AA. AC P13744; DT 01-JAN-1990 (REL. 13, CREATED) DT 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) DT 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) DE 11S GLOBULIN BETA SUBUNIT PRECURSOR. OS CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). OC EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; OC VIOLALES; CUCURBITACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=CV. KUROKAWA AMAKURI NANKIN; RX MEDLINE; 88166744. RA HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; RL EUR. J. BIOCHEM. 172:627-632(1988). RN [2] RP SEQUENCE OF 22-30 ND 297-302. RA OHMIYA M., HARA I., MASTUBARA H.; RL PLANT CELL PHYSIOL. 21:157-167(1980). CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. CC -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A CC BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A CC DISULFIDE BOND. CC -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). DR EMBL; M36407; G167492; -. DR PIR; S00366; FWPU1B. DR PROSITE; PS00305; 11S_SEED_STORAGE; 1. KW SEED STORAGE PROTEIN; SIGNAL. FT SIGNAL 1 21 FT CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. FT CHAIN 22 296 GAMMA CHAIN (ACIDIC). FT CHAIN 297 480 DELTA CHAIN (BASIC). FT MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. FT DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). FT CONFLICT 27 27 S -> E (IN REF. 2). FT CONFLICT 30 30 E -> S (IN REF. 2). SQ SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE //

7 Casimir 

Archiving / Database Preserva%on 

•  How do we preserve something that evolves (both in content and structure) 

•  Keep snapshots? –  frequent: space consuming –  infrequent: lose “history” 

Most curated databases have a hierarchical structure that we can exploit… 

8 Casimir 

A Sequence of Versions 

9 Casimir 

This relies on a determinis%c / keyed model – there’s a unique path to every data item. 

Pushing %me down 

10 

[B., Khanna, Tajima, Tan, TODS 27,2 (2004)] Casimir 

An ini%al experiment 

•  Grabbed the last 20 available versions of Swiss‐Prot •  XML‐ized all of them •  Also recorded all OMIM versions for about 14 weeks (100 of them) 

•  Combined into archive XML format file by pushing %me down.  

11 Casimir 

100 days of OMIM 

Siz

e (b

ytes

) x

106

XMill(archive)

gzip(inc diff)

version archive, inc diff

Legend • archive • inc diff • version • compressed inc diff • compressed archive

Uncompressed

•  Archive size is

–  ≤ 1.01 times diff repository size

–  ≤ 1.04 times size of largest version

Compressed

•  archive size is between 0.94 and 1 times compressed diff repository size

•  gzip - unix compression tool

•  XMill - XML compression tool

12 Casimir 

~ 5 years of Swiss‐Prot 

Siz

e (b

ytes

) x

106

arch

ive

XMill(archive)

inc

diff

Legend • archive • inc diff • version • compressed inc diff • compressed archive

Uncompressed

•  Archive size is

–  ≤ 1.08 times diff repository size

–  ≤ 1.92 times size of largest version

•  Compressed

•  archive size is between 0.59 and 1 times compressed diff repository size

13 Casimir 

Snapshots are immediate.  Longitudinal/temporal queries are also easy 

Factbook 

Demography 

Andorra 

Liechtenstein China 

Economy 

Popula%on 

[1990‐2006] 

*  *  * 

* * 

[1990]  [1991]  [2006] … 

Plot, by year, the popula%on of Liechtenstein since 1990 

34,247 28,292   28,476 

14 Casimir 

•  Implemented by Heiko Müller 

•  For scale, we require external sor%ng of large XML files 

• Designed and implemented by Ioannis Koltsidas Heiko Müller and Stra%s Viglas 

•  Has a simple temporal query language 

•  Experimented with recent (HTML) versions of CIA world factbook 

15 Casimir 

<T t="2002-2007"> <FACTBOOK> <COUNTRY> <NAME>Afghanistan</NAME> <CATEGORY> <NAME>Communications</NAME> <PROPERTY> <NAME>Internet users</NAME> <TEXT> <T t="2004-2005">1,000 (2002)</T> <T t="2006-2007">30,000 (2005)</T> <T t="2002-2003">NA</T> </TEXT> </PROPERTY> <PROPERTY> <NAME>Radios</NAME> <TEXT>167,000 (1999)</TEXT> </PROPERTY> <PROPERTY> <NAME>Telephones - main lines in use</NAME> <TEXT> <T t="2006">100,000 (2005)</T> <T t="2007">280,000 (2005)</T> <T t="2002-2003">29,000 (1998)</T> <T t="2004-2005">33,100 (2002)</T> …

16 Casimir 

<T t="2002-2007"> <FACTBOOK> <COUNTRY> <CATEGORY> <PROPERTY> <NAME>Population</NAME> <TEXT> <T t="2002">1,284,303,705 (July 2002 est.)</T> <T t="2003">1,286,975,468 (July 2003 est.)</T> <T t="2004">1,298,847,624 (July 2004 est.)</T> <T t="2005">1,306,313,812 (July 2005 est.)</T> <T t="2006">1,313,973,713 (July 2006 est.)</T> <T t="2007">1,321,851,888 (July 2007 est.)</T> </TEXT> </PROPERTY> </CATEGORY> </COUNTRY> </FACTBOOK> </T>

17 Casimir 

How did land area of countries change in 2002‐2007? <T t="2002-2007">

<FACTBOOK KEY=""> … <COUNTRY KEY="NAME Austria"> <CATEGORY KEY="NAME Geography"> <PROPERTY KEY="NAME Area"> <SUBPROP> <NAME>land</NAME> <TEXT> <T t="2004-2007">82,444 sq km</T> <T t="2002-2003">82,738 sq km</T> </TEXT> </SUBPROP> </PROPERTY> </CATEGORY> </COUNTRY> … <COUNTRY KEY="NAME France"> <CATEGORY KEY="NAME Geography"> <PROPERTY KEY="NAME Area"> <SUBPROP> <NAME>land</NAME> <TEXT> <T t="2002-2006">545,630 sq km</T> <T t="2007">640,053 sq km; 545,630 sq km (metropolitan France)</T> </TEXT> … 18 Casimir 

<T t="21/08/2007-10/09/2007"> <CIAWFB KEY=""> <COUNTRY KEY="NAME Afghanistan"> <CATEGORY KEY="NAME Communications"> <PROPERTY KEY="NAME Internet users"> <T t="21/08/2007"> <TEXT>30,000 (2005)</TEXT> </T> <T t="10/09/2007"> <TEXT>535,000 (2006)</TEXT> </T> </PROPERTY> <PROPERTY KEY="NAME Telephones - mobile cellular"> <T t="21/08/2007"> <TEXT>1.4 million (2005)</TEXT> </T> <T t="10/09/2007"> <TEXT>2.52 million (2006)</TEXT> </T> …

19 Casimir 

A case study: IUPHAR database  ‐‐ curated by Tony Harmar and team •  “Standard” curated database  •  Labour‐intensive (hundreds of contributors) 

•  Valuable (supported by drug companies) 

Casimir 

•  Simple, clean structure – as seen by users 

•  Payroll: 1‐2 people  for data checking, entry and sovware maintenance. 

21 Casimir 

22 Casimir 

23 Casimir 

24 Casimir 

We wanted to use IUPHAR as a guinea‐pig 

•  Our first task was to convert the database into a hierarchical structure (following the web presenta%on) so that we could archive it. 

•  We used the Prata XML (Fan et al) publishing sovware 

•  This had some unexpected benefits… 

25 Casimir 

•  We can preserve all versions of the data (as intended) •  Tony can trace the history of entries •  We can generate sta%c web pages (less sovware, more 

efficient)  •  We can make the database citable •  We make the database exportable •  We have a “community model” for data exchange •  The data got cleaned up in the process •  The representa%on informa%on (required by archivists) is 

greatly simplified •  Tony can generate an old‐fashioned book (yes, he wants to do 

this!) 

26 Casimir 

The IUPHAR Receptor Database

Tony Harmar and Ed Rosser

c! Draft date August 24, 2006

ii CONTENTS

5.1.8 BENZODIAZEPINE INSENSITIVITY . . . . . . . . . . . . . . . . . . . . 195.1.9 BENZODIAZEPINE INSENSITIVITY . . . . . . . . . . . . . . . . . . . . 205.1.10 THE rho;-CONTAINING RECEPTORS . . . . . . . . . . . . . . . . . . . . 205.1.11 OTHER MODULATORY SITES . . . . . . . . . . . . . . . . . . . . . . . . 205.1.12 OTHER MODULATORY SITES . . . . . . . . . . . . . . . . . . . . . . . . 215.1.13 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Prokineticin receptors 256.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Prolactin-releasing peptide receptor 277.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8 Acetylcholine receptors (nicotinic) 318.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8.1.1 GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1.2 FUNCTIONAL ROLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318.1.3 PHARMACOLOGICAL CHARACTERISTICS . . . . . . . . . . . . . . . . 328.1.4 RECEPTOR SUBUNIT ASSEMBLY . . . . . . . . . . . . . . . . . . . . . 328.1.5 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8.2 2* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328.3 6* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328.4 9* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338.5 1* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338.6 3* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338.7 4* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338.8 7* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

9 P2X receptors 379.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

9.1.1 GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379.1.2 OVERALL STRUCTURE OF THE P2X RECEPTOR FAMILY . . . . . . 379.1.3 THE PORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379.1.4 STOICHIOMETRY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389.1.5 HETEROPOLYMERISATION OF P2X SUBUNITS . . . . . . . . . . . . . 389.1.6 OPERATIONAL CHARACTERISTICS . . . . . . . . . . . . . . . . . . . . 39

9.2 P2X2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.3 P2X4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.4 P2X5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.5 P2X6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.6 P2X7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Selected pages from the book – generated by a 100‐line style sheet 

 27 Casimir 

208 CHAPTER 27. ENDOTHELIN RECEPTORS

Structural Information

Species TM AA Accession Number Chromosomal Location Gene Name ReferencesHuman 7 427 P25101 4q31.22 EDNRA 1,5,57,58Rat 7 426 P26684 19q11 Ednra 4Mouse 7 427 Q61614 8? Ednra 59

Functional Assays

Isolated ring preparations of human coronary arterySpecies: HumanTissue: VasoconstrictionResponse measured: Coronary arteryReferences: 60

Isolated ring preparations of rat thoracic aortaSpecies: RatTissue: VasoconstrictionResponse measured: AortaReferences: 77-79

Antagonist Ligands

R’tive Endog. Alt. Species Name A!nity Action Units References

YES NO YES Human [125I]PD164333 9.8-9.6 Antagonist pKd 47YES NO YES Human [125I]PD151242 9.1-9 Antagonist pKd 23YES NO YES Rat [3H]BQ123 8.5 Antagonist pKd 21NO NO YES Human A127742 10.5 Antagonist pIC50 29NO NO YES Human PD156707 9.2-8.7 Antagonist pIC50 70,71NO NO YES Human SB234551 9 Antagonist pIC50 26NO NO YES Human FR139317 7.9-7.3 Inverse Agonist pIC50 60NO NO YES Human BQ123 7.4-6.4 Antagonist pIC50 60

Agonist Ligands

R’tive Endog. Alt. Species Name A!nity Action Units References

YES NO NO Human [125I]ET-1 10.5-9.1 Full Agonist pKd 62-64YES NO YES Human [18F]ET-1 8.2 Full Agonist pKd 67YES NO YES Human [125I]ET-2 9.1-8.9 Full Agonist pKd 68YES NO YES Human [125I]sarafotoxin S6B 9.8-9.6 Full Agonist pKd 68NO YES YES Human ET-2 8.2 Full Agonist pIC50 60NO NO YES Human sarafotoxin S6b 8.1-7.5 Full Agonist pIC50 60NO YES YES Human ET-1 8.5-7.8 Full Agonist pIC50 60

27.4 References

1 S.Kimura, M.Yanagisawa, T.Masaki, K.Goto, H.Kurihara, Y.Tomobe, M.Kobayashi, Y.Mitsuiand Y.YazakiNature, 332, 411 - 415.

2 S.Kimura, M.Yanagisawa, Y.Kasuya, K.Goto, T.Masaki, A.Inoue and T.MiyauchiProc. Natl.Acad. Sci. U.S.A., 86, 2863 - 2867.

Chapter 27

Endothelin receptors

Contributors: Anthony P. Davenport, Thophile Godfraind, Eliot H. Ohlstein, RobertR. Ru!olo, Pedro D’Orleacute;ans-Juste and Janet J. Maguire

27.1 Introduction

27.1.1 GENERAL

In mammals, the endothelin (ET) family comprises three endogenous isoforms, ET-1, ET-2 andET-3 (refs. 1,2), and the receptors that mediate their e!ects have been classified as the endothelinETA and ETB receptors.

27.1.2 RECEPTOR STRUCTURE

The two endothelin receptors have been isolated and cloned from mammalian tissues1-9. Thestructures of the mature receptors have been deduced from the nucleotide sequences of the cDNAs.The encoded proteins contain seven stretches of 20-27 hydrophobic amino acid (aa) residues in bothreceptors. This structure is consistent with a seven-transmembrane domain (7TM), G protein-coupled receptor belonging to the rhodopsin-type receptor superfamily. Both receptors have anN-terminal signal sequence, which is rare among heptahelical receptors, with a relatively longextracellular N-terminal portion preceding the first transmembrane domain. There are two separateligand-interaction sub-domains on each endothelin receptor. The extracellular loops, particularlybetween TM4-TM6, determine selectivity.

27.1.3 RECEPTOR SIGNALLING

Endothelin is able to activate a number of signal transduction processes including phospholipase(PL) A2, PLC and PLD, as well as cytosolic protein kinase activation. The receptors are able tocouple to various types of G protein. Both ETA and ETB receptors expressed in COS7 cells wereshown to couple to Gq, G11, Gs and Gi2, suggesting that endothelin receptors may simultaneouslystimulate multiple e!ectors via several types of G protein10. ETA receptors expressed in CHOcells couple to Gq and Gs but not Gi. ETB receptors couple to Gq and Gi. Coupling to Gs

occurs through the second and third intracellular loops of the receptor. In order to couple withGi through the third intracellular loop, palmitoylation of the C-terminal cysteine residues and C-terminus are necessary, whereas to couple with Gq only palmitoylation of the C-terminal domainis important11,12.

205

Our library would “host” the book, but not the database! 

 28 Casimir 

Centralized vs. distributed publishing 20th century libraries provided robust, distributed dissemina%on and preserva%on of reference material 

Valuable informa%on was lost in earlier “data centers” .  Is this s%ll happening? 

Replica%on and distribu%on has always been the best guarantee of preserva%on. We should do the same for curated databases – a database LOCKSS ? 

29 Casimir 

Many of the issues are non‐technical 

•  A good economic model for sustainability –  Open access works for journal papers –  Can it work for curated DBs?  They require long‐term support.  And people who write reference manuals some%mes expect to make money out of them. 

•  Intellectual property in some curated databases is a nightmare –  legisla%on s%ll largely based on the no%on of copying. 

•  We can s%ll help by providing good models of the processes in cura%ng and publishing databases 

30 Casimir 

Further inflammatory thoughts 

•  The grand unified database is probably a bad idea –  large complex DBs collapse under their own weight –  federa%on, cross‐linking, and copying  (with provenance) is good 

•  DB replica%on and distribu%on is good –  but make sure that is distributed in a usable form 

•  Is a “self‐cura%ng” and “self‐preserving” possible? •  We need to unify the two     no%ons of cura%on 

Casimir  31 

Notes 

•  Paul Schofield.  Slide that showed how li*le was spent on informa%on infrastructure •  Tom Weaver (Harwell/MRC). Men%on data saving requirements of MRC funding? 

•  Janan – Excellent examples of cura%on costs.  In line with my own coarse es%mates. 

•  Helen Parkinson & Paul.  Men%on the idea of pu|ng everything in a “magically sustainable database”.  Could such a thing work?.  Large complex databases collapse under their own weight 

•  Alan Bridge – what’s the rela%onship between Swiss‐Prot and what is done at EBI? 

Casimir  32