e’ vero che tutta l’informazione sta nel genoma . 17 f.magni - siena 2014 swiss-prot: release...

47
1 F.Magni - Siena 2014 Proteomica e Spettrometria di Massa: applicazioni biochimiche e cliniche Fulvio Magni Dipartimento di Medicina Sperimentale Facoltà di Medicina e Chirurgia Università degli Studi di Milano-Bicocca Dipartimento di Medicina Sperimentale F.Magni - Siena 2014 Genome 1 Gene Single Protein Disease E’ vero che tutta l’informazione sta nel genoma ?

Upload: buiquynh

Post on 23-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

1

F.Magni - Siena 2014

Proteomica e Spettrometria di Massa: applicazioni biochimiche e

clinicheFulvio Magni

Dipartimento di Medicina Sperimentale

Facoltà di Medicina e Chirurgia

Università degli Studi di Milano-Bicocca

Dipartimento di Medicina Sperimentale

F.Magni - Siena 2014

Genome1 Gene Single Protein Disease

E’ vero che tutta l’informazione sta nel genoma ?

Page 2: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

2

F.Magni - Siena 2014

Differenze nel Genoma ≈1%

Differenze nel Proteoma >1 %

F.Magni - Siena 2014

Identico Genoma Differente Proteoma

Page 3: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

3

F.Magni - Siena 2014

Only 2% of human disease results from a single gene defect

Alternative splicingDNA mRNA Protein5-10.000 activated genes 15-30.000 proteins

Final form of proteins (3D and function) cannot be predicted with certainty from the linear codes of genes

Most proteins are modified after they are synthesised

Why do we analyse proteins ?

Proteins are the molecules of the correct cellular function

Single gene defect => Which proteins is altered ?

F.Magni - Siena 2014

Genome Proteome

Proteome indicates the PROTEins expressed by a genOME or tissue

PROTEOME

Proteomics is the large-scale study of gene expression at the proteinlevel.

PROTEOMICS

Page 4: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

4

F.Magni - Siena 2014

Alternative splicingDNA mRNA5-10.000 activated genes

Why do we analyse

proteins ?

Protein15-30.000 proteins

PTMs

F.Magni - Siena 2014

Proteome, unlike genome, is not a fixed feature of an organ

A single genome can give rise to an essential infinitive number of qualitatively and quantitatively different proteomes depending on:…...

The simultaneous study of the whole range of proteins expressed in acell at any given time

PROTEOME

Which components of proteomic profile:

-are relevant for human disease Diagnosis

-are excellent therapeutic target Therapy

Aim

Page 5: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

5

F.Magni - Siena 2014

Expression Proteomics

Normal sample Disease sample

1st Separated and visualised by 2D-gel electrophoresis

F.Magni - Siena 2014

Expression Proteomics

Normal sample2nd Gel images are compared with a special software

Disease sample

Not present in normal sample:Prostate cancer

Low levels in normal sample:Alzheimer

High levels in normal sample:Parkinson

Page 6: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

6

F.Magni - Siena 2014

Proteomics has now branched into two specific disciplines:

Expression proteomics (classical):qualitative

displaying on 2D-PAGE or alternative technique, identification by mass spectrometry

quantitative evaluation of different expression

Functional proteomics: localization or identification studies of proteins with specific biological activities and interaction studies.

General strategy for proteins study:Steps: Methodology1- Purification – Isolation 2D-PAGE2- Identification – Characterization Mass spectrometry3- Database searching Bioinformatics

F.Magni - Siena 2014

Expression Proteomics:3rd Proteins that differ in abundance between the gels are identified by MS

2D-PAGE

Protein Band

Cut out

1- Enzymaticdigestion(i.e. trypsin)

2-Peptide extraction

3-Peptide mass fingerprint by MSanalysis

4-Database search

Evaluation of the Mr of each tryptic peptide

Identification of the protein by databasesearching

Page 7: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

7

F.Magni - Siena 2014

PARTE ANALITICA:Protocollo di digestione con tripsina

Per ridurre S-S

Per rimuovere i reagenti in eccesso

Riduzione edalchilazione

Taglio delle bande proteiche (spots) dal gel

Lavaggio del gel Per rimuovere colorante e SDS

Aggiunta di ditiotreitolo (DTT)

Aggiunta di iodoacetamide Per alchilare S-H

Lavaggio del gel

Incubazione a 37°C overnight

Aggiunta di TripsinaDigestione

in-gel

1

2

3

4 Estrazione

F.Magni - Siena 2014

Proteomics: MALDI-TOF ? Peptides mass fingerprint:

A set of peptide molecular weight from an enzyme digestion of a protein are evaluated by mass spectrometry.

MALDI-TOF:

Analyze high masses >100kDa

Measure entire mass range

Compatible with many buffers

Applicable to variety of compound types

High sensitivity (low fmole)

Molecular weight and structure info (PSD, TOF/TOF)

EASY TO DO

Page 8: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

8

F.Magni - Siena 2014

MALDI-TOF: How’s It Work?

•Pulsed laser

2. Target is introduced into high vacuum of MS

4. Ions are accelerated by an electrical field to the same kinetic energy, and they drift (or fly) down a field free flight tube where they are separated in space.

el Flight tube

1. Sample is mixed with matrix& dried on sample plate

High vacuum

Time

High voltage

3. Sample spot is irradiated with laser, desorbing ions into the gas phase and starting the clock measuring the time of flight.

20 - 30 kV

6. A data system controls all instrument parameters, acquires the signal vs. time, and permits data processing.

5. Ions strike the detector at different times, depending on the mass to charge ratio of the ion.

F.Magni - Siena 2014

Tripsina:Lys, Arg

S

S

m/z Intens.824.478 2122.39842.504 1293.31940.328 26750.83947.487 1837.601293.649 1323.951376.528 877.151448.721 2550.851544.626 1001.621790.921 580.691877.937 418.261907.895 742.162211.220 630.932225.172 465.532355.091 835.772528.241 3874.78

Page 9: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

9

F.Magni - Siena 2014

Peptide Mass Fingerprint (PMF)Protein indentification by Database search

F.Magni - Siena 2014

Identificazione delle proteine

Come si può arrivare alla identità CERTA di una proteina ?

1- Determino sperimentalmente TUTTE le informazioni riguardanti la/le proteina/e:APPROCCIO INTEGRALE

2- Determino sperimentalmente PARTE delle informazioni riguardanti la(le) proteina(e) e da queste cerco di ricavare le informazioni mancanti:APPROCCIO PER APPROSSIMAZIONE

Page 10: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

10

F.Magni - Siena 2014

Identificazione delle proteineInformazioni sperimentali:

F.Magni - Siena 2014

Identificazione delle proteine

Informazioni sperimentali (complete o parziali)

Identità

Informazioni Programmi di ricercaArchiviate

Banche dati Bioinformatica

Page 11: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

11

F.Magni - Siena 2014

Identificazione delle proteine

1 – Banche dati

2 - Identificazione e caratterizzazione delle proteine: Metodiche Analitiche

3 - Identificazione e caratterizzazione delle proteine:Programmi

Molecular & Cellular Proteomics 2009 Vol 8 : 2827 - 2842

F.Magni - Siena 2014

Page 12: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

12

F.Magni - Siena 2014

F.Magni - Siena 2014

Page 13: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

13

F.Magni - Siena 2014

F.Magni - Siena 2014

Page 14: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

14

Nature Methods 6, 423–430 (1 June 2009)Alexander W Bell , Eric W Deutsch , Catherine E Au , Robert E Kearney , Ron Beavis , SalvatoreSechi , Tommy Nilsson , John J M Bergeron , Thomas A Beardslee , Thomas Chappell , GavinMeredith , Peter Sheffield , Phillip Gray , Mahbod Hajivandi , Marshall Pope , Paul Predki , MajlindaKullolli , Marina Hincapie , William S Hancock , Wei Jia , Lina Song , Lei Li , Junying Wei , BingYang , Jinglan Wang , Wantao Ying , Yangjun Zhang , Yun Cai , Xiaohong Qian , Fuchu He , HelmutE Meyer , Christian Stephan , Martin Eisenacher , Katrin Marcus , Elmar Langenfeld , Caroline May ,Steve A Carr , Rushdy Ahmad , Wenhong Zhu , Jeffrey W Smith , Samir M Hanash , Jason J Struthers, Hong Wang , Qing Zhang , Yanming An , Radoslav Goldman , Elisabet Carlsohn , Sjoerd van derPost , Kenneth E Hung , David A Sarracino , Kenneth Parker , Bryan Krastins , Raju Kucherlapati ,Sylvie Bourassa , Guy G Poirier , Eugene Kapp , Heather Patsiouras , Robert Moritz , RichardSimpson , Benoit Houle , Sylvie LaBoissiere , Pavel Metalnikov , Vivian Nguyen , Tony Pawson ,Catherine C L Wong , Daniel Cociorva , John R Yates III , Michael J Ellison , Ana Lopez-Campistrous , Paul Semchuk , Yueju Wang , Peipei Ping , Giuliano Elia , Michael J Dunn , KieranWynne , Angela K Walker , John R Strahler , Philip C Andrews , Brian L Hood , William L Bigbee ,Thomas P Conrads , Derek Smith , Christoph H Borchers , Gilles A Lajoie , Sean C Bendall , Kaye DSpeicher , David W Speicher , Masanori Fujimoto , Kazuyuki Nakamura , Young-Ki Paik , Sang YunCho , Min-Seok Kwon , Hyoung-Joo Lee , Seul-Ki Jeong , An Sung Chung , Christine A Miller ,Rudolf Grimm , Katy Williams , Craig Dorschel , Jayson A Falkner , Lennart Martens & JuanAntonio Vizca F.Magni - Siena 2014

F.Magni - Siena 2014

Page 15: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

15

Quale insegnamento traiamo da questi articoli ?

1_Importante separare le proteine

2_Importante ottenere buoni/ottimi dati in spettrometria di massa (SM).

3_Tutti gli sforzi fatti nei punti 1 e 2 SONO INUTILI se non li sappiamo utilizzare correttamente per mancaza di conoscenza :

3_1 Gli algoritmi per la identificazione

3_2 Le banche dati

3_3 Tutte le possibilità offerte dalla SM e i suoi avanzamenti

F.Magni - Siena 2014

F.Magni - Siena 2014

Page 16: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

16

Problemi:

Algoritmi x elaborazione dati

Banche dati

Tipo di strumento:analizzatorebassa o alta risoluzione

Modificazioni Post-traduzionali

F.Magni - Siena 2014

F.Magni - Siena 2014

Banche dati

IN CONTINUO AGGIORNAMENTO

Esempi:

http://www.roseindia.net/bioinformatics/biologicaldatabases.shtml

http://molbio.info.nih.gov/molbio/Index.htm

Page 17: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

17

F.Magni - Siena 2014

Swiss-Prot:Release 50.9 of 17-Oct-06

Release 2011_12 of 14-Dec-11 of UniProtKB/Swiss-Prot contains 533657

sequence entries,

F.Magni - Siena 2014

Page 18: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

18

Release 2013_01 of 09-Jan-13 of UniProtKB/Swiss-Prot contains 538849 sequence entries, comprising 191337357

amino acids abstracted from 215706 references

F.Magni - Siena 2014

14-Dec-2011 Swiss-Prot

contains 533.657 sequence entries,

14-Dec-2011TrEMBL

contains 18.510.272 sequence entries,

F.Magni - Siena 2014

Page 19: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

19

Release 2013_01 of 09-Jan-2013

of TrEMBL

Release 2013_01 of 09-Jan-13 of Swiss-Prot

F.Magni - Siena 2014

Factors relevant to the utility of a database

1. Number of entries

2. Frequency of errors

3. Redundancy of the entries

4. Presence of ancillary infromation

5. Frequency at which the database is update

F.Magni - Siena 2014

Page 20: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

20

Description Elimination• gi|2947219|gb|AAC39645.1|

UDP-galactose 4' epimerase [Homo sapiens]

• gi|1119217|gb|AAB86498.1| UDP-galactose-4-epimerase [Homo sapiens]

• gi|14277913|pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site

• gi|14277912|pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site

• gi|2494659|sp|Q14376|GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase)

• gi|1585500|prf||2201313AUDP galactose 4'-epimerase

F.Magni - Siena 2014

F.Magni - Siena 2014

B- The comprehensive protein sequence databases, derived by translation of all of

the entries in the NSDs

GenPep: translated from the GenBankdatabase (NCBI)

Protein DataBank: translated from the DNADataBank of Japan

TrEMBL: translated from the EMBLnucleotide Sequence database.

Page 21: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

21

F.Magni - Siena 2014

C- The curated protein sequence databases

NCBInr

UniProtKBwww.expasy.ch

•Protein knowledgebase, consists of two sections: Swiss-Prot, which is manually annotated and reviewed.

•TrEMBL, which is automatically annotated and is not reviewed.

F.Magni - Siena 2014

Swiss-ProtSWISS-PROT1. is a curated protein sequence database which strives to provide

a high level of annotations (such as the description of thefunction of a protein, its domains structure, post-translationalmodifications, variants, etc),

2. a minimal level of redundancy3. and high level of integration with other databases.

It was established in 1986 and has been maintainedcollaboratively, since 1987, by the Department of MedicalBiochemistry of the University of Geneva and the EMBL DataLibrary (now the EMBL Outstation of The EuropeanBioinformatics Institute - EBI).

Page 22: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

22

F.Magni - Siena 2014

Swiss-Prot SwissProt is a high quality, curated protein database. On this

server, the database has been expanded using the SwissknifeVARSPLIC utility. This parses the annotation text and createsnew entries for any splice variants, sequence variants, orsequence conflicts. Original entries have a standard Swiss-Protaccession string, such as P13813. New entries, created byvarsplic, have accession numbers in the form P13813-00-00-01.The title line describes the nature of the differences betweenthe new entry and the parent entry.

Swiss-Prot VarSplic OutputP13746-00-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-01-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-00-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-03-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-04-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ

P13746-01-05-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-01-00-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ

P13746-00-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ

P13746-01-02-00 SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ

************************************* *******:*********

F.Magni - Siena 2014

Page 23: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

23

F.Magni - Siena 2014

F.Magni - Siena 2014

Page 24: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

24

F.Magni - Siena 2014

F.Magni - Siena 2014

Page 25: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

25

F.Magni - Siena 2014

F.Magni - Siena 2014

Page 26: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

26

F.Magni - Siena 2014

TrEMBLEMBL: The EMBL Nucleotide Sequence Database is acomprehensive database of DNA and RNA sequences collectedfrom the scientific literature and patent applications and directlysubmitted from researchers and sequencing groups. Datacollection is done in collaboration with GenBank (USA) and theDNA Databank of Japan (DDBJ).

TrEMBL is a computer-annotated supplement of SWISS-PROT thatcontains all the translations of EMBL nucleotide sequence entriesnot yet integrated in SWISS-PROT. TrEMBL_New files areidentical in format and contain very recent, unannotatedsequences.TrEMBL is developed by the SWISS-PROT groups at SIB and EBI.

F.Magni - Siena 2014

NCBInrNCBI (National Center for Biotechnology Information)maintains composite, non-identical protein and nucleicacid databases for their search tools BLAST and Entrez.

The nr database is compiled by the NCBI (National Centerfor Biotechnology Information) as a protein database forBlast searches. It contains non-identical sequences fromGenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF.One of the main advantages of nr is that it is updated veryfrequently. NCBI has made strong efforts to cross-reference the sequences in these databases in order toavoid duplication.

Banche dati

Page 27: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

27

F.Magni - Siena 2014

nr

The nr database is compiled by the NCBI (National Center forBiotechnology Information) as a protein database for Blastsearches. It contains non-identical sequences from GenBank CDStranslations, PDB, Swiss-Prot, PIR, and PRF. One of the mainadvantages of nr is that it is updated very frequently. NCBI hasmade strong efforts to cross-reference the sequences in thesedatabases in order to avoid duplication.

Banche dati miste

F.Magni - Siena 2014

IPI(International Protein Index) is compiled by the EBI (European

Bioinformatics Institute) to provide a top level guide to themain databases that describe the human and mouseproteomes: SWISS-PROT, TrEMBL, NCBI RefSeq and Ensembl.The aim is to:

1. effectively maintain a database of cross references betweenthe primary data sources

2. provide a minimally redundant yet maximally complete set ofproteins (one sequence per transcript)

3. maintain stable identifiers (with incremental versioning) toallow the tracking of sequences in IPI between IPI releases.

4. IPI is updated monthly in accordance with the latest datareleased by the primary data sources. There are currentlytwo IPI databases, Human and Mouse.

Page 28: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

28

F.Magni - Siena 2014

dbESTThe EST database represents a final type of sequencedatabases:

-dbEST is composed of a large number of entries

-each entry is a short piece of nucleotide sequence,typically about 300 bases in lenght.

-this type of nucleotide sequence is produced by highlyautomated sequencing of randomly selected portions ofthe expressed DNS of a given tissue.

-the advantage of this approach to genomic sequencing isthat a large amount of sequence data is produced at arelatively low cost.

F.Magni - Siena 2014

dbEST

This is a nucleic acid database which is translated by Mascotin all six reading frames. This generates a very large database,so that dbEST searches take far longer than a search of one ofthe non-redundant protein databases.

You should only search dbEST if a search of a protein database has failed to find a

match.

Page 29: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

29

Decoy database

F.Magni - Siena 2014

For large scale experiments:provide the results of any additional statistical analyses that

indicate or establish a measure of identification certainty, or allow adetermination of the false-positive rate, e.g., the results of randomizeddatabase searches or other computational approaches."

This is a recommendation to repeat the search, using identicalsearch parameters, against a database in which the sequences havebeen reversed or randomised.

You do not expect to get any true matches from the "decoy"database. So, the number of matches that are found is an excellentestimate of the number of false positives that are present in the resultsfrom the real or "target" database.

Elias, J. E., et al., Nature Methods 2 667-675 (2005).

F.Magni - Siena 2014

Page 30: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

30

F.Magni - Siena 2014

Database searchMASCOThttp://www.matrixscience.com

PROTEIN PROSPECTORhttp://prospector.ucsf.edu/

PEPTIDE SEARCHhttp://www.mann.embl-heidelberg.de/

MOWSEhttp://www.hgmp.mrc.ac.uk/Bioinformatics/

ProFoundhttp://prowl.rockefeller.edu/cgi-bin/ProFound

SEQUESThttp://thompson.mbt.washington.edu/sequest/

F.Magni - Siena 2014

Peptide Mass Fingerprint (PMF)Protein indentification by Database search

Page 31: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

31

F.Magni - Siena 2014

Proteoma: strategia 2

F.Magni - Siena 2014

Proteoma: strategia 2

Page 32: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

32

F.Magni - Siena 2014

Proteoma: strategia 2

Massa monoisotopica(monoisotopic mass) = la massadello ione molecolare calcolatautilizzando il valore esatto dellamassa dell’isotopo piùabbondante di ogni elemento (esH=1.007825, 12C=12.000000)

948.5

949.5

950.5

951.5

F.Magni - Siena 2014

Proteoma: strategia 2

Adenylate kinase tryptic digested ==> 17 peptidesMr 23634 ==> MALDI-TOF

==> Database search

MASS TOLERANCE in ppm No of peptide matched

1000 700 400 200 100 75 50 30

5 429 136 51 39 29 19 3 1 6 163 54 9 10 7 7 82 16 6 6 8 36 2 9 9 1 10 8 1 1 1

Page 33: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

33

F.Magni - Siena 2014

MA

SC

OT

F.Magni - Siena 2014

Fingerprint Search Results

1300 1800 2300 2800 3300 3800 m/z

5000

10000

15000

20000

25000

30000

35000

40000

45000

a.i.

/S=/010427Italien/Sample1/0_L14_1SRef/pdata/1 Hufnagel Fri May 4 13:25:28 2001

MALDI-TOF Mass Spectrum

Risultato della ricerca

Codice e nome proteina con score maggiore

Dettagli sulla identificazione

Elenco proteine

Nuova ricerca con i dati non utilizzati

Page 34: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

34

F.Magni - Siena 2014

IDENTIFICATION PARAMETERS

• Score > 65 (www.matrixscience.com)

• MS match for at least 4-5 peptides

• Mass accuracy lower than 150ppm (external calibration)

• Mass accuracy lower than 50ppm (internal calibration)

• Sequence coverage: at least 20%

• Mr and pI should match the estimates or published values

F.Magni - Siena 2014

Ion trap: MS/MS scan mode

1. Inject

2. Isolate

3. Fragment

4. Detect

Tri

psin

a:Ly

, Arg

S

S

Page 35: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

35

F.Magni - Siena 2014

Product Ion Spectrum from T7 of Human Growth Hormone

Product Ion Spectrum from T7 of Human Growth Hormone

1000 1100 1200 1300 1400 1500 1600 1700 1800

19.3

14.4

9.6

4.8

0.0

21189

Re

l. In

t. (

%)

10

01

.8

11

87

.4 12

74

.8

12

96

.2

13

85

.81

40

2.9

15

15

.9

16

28

.7

17

42

.2

600 700 800 900

21189

65

3.3

73

4.0

75

9.4

78

1.8

86

8.4

88

8.6

92

9.0

66

2.4

100 200 300 400 500

19.3

14.4

9.6

4.8

0.0

Re

l. In

t. (

%)

86

.2

17

5.4

22

7.4 29

6.8 3

14

.4

34

1.0

40

8.6

42

7.4

51

2.6

54

0.2

56

3.4

43

5.6

28

8.2

20

1.1

10

54

.4

11

67

.4

1942 1855 1742

Y1

Ile --- Ser --- Leu --- Leu --- Leu --- Ile --- Gln --- Ser --- Trp --- Leu --- Glu --- Pro --- Val --- Gln --- Phe --- Leu --- Arg

Y16

B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13

Y15 Y14 Y13 Y12 Y11 Y10 Y9 Y8 Y7 Y6 Y5 Y4

114 201 314 427 540 653 781 868 1054 1168 1297 1394 1493

1629 1516 1403 1274 1187 1001 888 759 662 563

B14

Y3

1621

435

B15

Y2

1768

288

B161881

175

F.Magni - Siena 2014

Page 36: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

36

F.Magni - Siena 2014

F.Magni - Siena 2014

– peptide molecular weight

– partial sequence (region 2)

– molecular wt before partial sequence (region 1)

– molecular wt after partial sequence (region 3)

Protein ID by Sequence Tags:1 Tag uses 5 components

A V I/L T

Peptide measured molecular wt = 1927.2

1108.13Partial Sequence- A-V-I/L-T- 1546.11Da381.1

region 1 region 2 region 3

Page 37: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

37

F.Magni - Siena 2014

Mass accuracy in database searching (2 AA sequence tag)

1488.754

1000ppm = 202

100 ppm = 29

10 ppm = 1 hit

1237.661

1000ppm = 39

100ppm = 9

10ppm = 1 hit

1925.837

1000ppm = 738

100ppm = 15

10ppm = 1 hit

1981.035

1000ppm = 412

100ppm = 38

10ppm = 2

5ppm = 1 hit

1171.591

1000ppm = 573

100ppm = 71

10 ppm = 5 hits

1213.207

1000ppm = 314

100ppm = 65

10ppm = 1 hit

F.Magni - Siena 2014

Three ways to use mass spectrometry data for protein ID:

2. Sequence Query Database search 4.

Mass values combined with amino acid sequence or

composition data

1. Peptide Mass Fingerprint MALDI-TOF

A set of peptide molecular weights from an enzyme digest

of a protein

3. MS/MS Ions Search HPLC-ESI-MS/MS

MS/MS data from a single peptide or from a complete

LC-MS/MS run: complete or partial aminoacid sequence

Page 38: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

38

F.Magni - Siena 2014

Proteomics: clinical studies.Identification of disease-specific proteins for dilated cardiomyopathy

Electrophoresis 1999Anti-endothelial cell antibodies as a potential predictive test for chronic

heart transplantation rejection Hum. Immunol. 1999Identification of several disease-specific protein for cell carcinoma of

bladder Cancer Res. 1999Potential marker for prostate and ovarian cancer

Mol. Med. Today 1999Identification of 18 proteins with abnormal expression in schizophrenics

Mol. Psychiatry 2000Defining urinary proteome… Proteomics 2001Proteome of human cerebrospinal fluid Proteomics 2001Clinical proteomics for cancer biomarker discovery and therapeutic

targeting. Technol. Cancer Res Treat. 2002.The human plasma proteome: history, character, and diagnostic

prospects. Mol Cell Proteomics. 2003

F.Magni - Siena 2014

Healthy population Breast cancer population

Page 39: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

39

Fulvio Magni

Dipartimento di Medicina Sperimentale (DIMS)

Università degli Studi di Milano-Bicocca

ATB 2003

PROTEOMICS IN CLINICAL LABORATORYAPPLICAZIONI BIOCHIMICHE E CLINICHE:Tecniche SELDI e ClinProt

F.Magni - Siena 2014

La gran parte delle malattie sono poligeniche quindi un singolo antigene e’ insuffciente alla individuazione sicura della malattia (CA 125, PSA)

le modificazioni del proteoma di un organo possono dar luogo a un pattern proteico caratteristico nei fluidi biologici ci sono le prime evidenze sulla possibilità di individuare

marcatori multipli, che consistono in un insieme di proteine sovra- o sottoespresse nel soggetto malato rispetto al soggetto sano

BIOMARCATORI

La gran parte delle malattie si originano da modificazioni del metabolismo proteico quindi si possono individuare proteine che fungano da marcatori della malattia

F.Magni - Siena 2014

Page 40: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

40

MASS SPECTROMETRY BASED PROTEOMICS: CURRENT STATUS AND POTENTIAL USE IN CLINICAL CHEMISTRY

P-A Binz , DF Hochstrasser and RD AppelClin. Chem. Lab. Med, 2003,41,1540

Proteomica classica

Scanner Molecolare

Identificazione multidimensionale (MuDPIT)

Marcatura con ICAT

SELDI - ClinProt

F.Magni - Siena 2014

STRATEGIA

fornire una “immagine” da interpretare in modo semplice. (gel view)

distinguere il profilo proteico normale da uno alteratomediante appositi algoritmi.

identificare con tandem MS le proteine espresse in modo differenziato

costruire i profili di proteine in un campione “normale” e di uno “patologico” ed individuare le differenze

F.Magni - Siena 2014

Page 41: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

41

PROTEIN-CHIP Ciphergen Biosystems, Inc., California

SELDICiphergen Biosystems, Inc., California

SI BASA SULLA COMBINAZIONE DI DUE TECNICHE:

spettrometria di massa

barrette « ProteinChip Array » che permettono di separare gruppi di proteine con caratteristiche simili

UTILIZZA:

il lettore « ProteinChip Reader » che utilizza la spettrometria di massa SELDI-TOF

separazione cromatografica con fasi diverse

F.Magni - Siena 2014

PROTEIN-CHIP ARRAY

Presentato da B. Reed Ciphergen BiosystemsMeeting/Conference: Swiss Proteomics, 2001http://www.ciphergen.com/pub/showPubInfo.asp?id=117

INTERAZIONE BIOLOGICA

PS-1 or PS-2 Antibody-Antigen Receptor-Ligand DNA-Protein

superfici biochimiche (anticorpi, recettori, DNA,etc.) trattengono una sola proteina

superfici chimiche (ioniche, idrofobiche, idrofile, ecc..) che trattengono classi di proteine

INTERAZIONE CHIMICAReverse phase Cation Anion Metal Ions Normal

Exchange Exchange

F.Magni - Siena 2014

Page 42: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

42

Ricerca di markers proteici

SCELTA DEL PROTEIN-CHIP

http://www.ciphergen.com/techapps/pc/tech/arrays.aspF.Magni - Siena 2014

PROTEIN-CHIP A SCAMBIO IONICO

Massa Molecolare/ Carica

Lavaggio con tamponi diversi

Analisi con SELDI

profili proteici diversi

http://www.ciphergen.com/techapps/pc/tech/arrays.aspF.Magni - Siena 2014

Page 43: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

43

CLINPROT ™ MAGNETIC BEADS

Wash

LEGAME SPECIFICO

Elute

ELUZIONE E PREPARAZIONE DEL TARGET SEPARAZIONE MAGNETICA

Profile

MALDI-TOF MS

Bind

MISCELA DI PEPTIDI O PROTEINE

F.Magni - Siena 2014

CLINPROT™Bruker Daltonics

Biglie magneticheClinProt

Automazione

ClinProToolsClustering e

Classificazione

Cluster analysis

Disease

Normal

Cluster analysis

Disease

Normal

Sano

Malato

autoflex MALDI-TOF

AnchorChip™

Target

ultraflex

TOF/TOF

Profili ProteiciClinProToolsAnalisi dei Dati

F.Magni - Siena 2014

Page 44: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

44

RISULTATI CON PROTEIN-CHIP

Bruker Daltonics

Verde/Malato Rosso/Sano

F.Magni - Siena 2014

PROTEINE SOBRAESPRESSE

Box-e-whiskers

controlli

pazienti

F.Magni - Siena 2014

Page 45: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

45

controlli

pazienti

PROTEINE SOTTOESPRESSE

Box-e-whiskers

F.Magni - Siena 2014

12 MALATO312 SANO3

12 MALATO312 SANO3

12 MALATO312 SANO3

BREAST CANCER RESULTSJinong Li et al. Clinical Chemistry, 48, 1296-1304 (2002)

SPETTRI GEL CAMPIONI

F.Magni - Siena 2014

Page 46: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

46

CANCRO DELL’OVAIO: IDENTIFICAZIONE DEI MARKERS(Petricoin, E.F. et al., The Lancet, 359, 572-577, (2002)

Samples from unaffected subjects

Samples from cancer patients

Genetic algorithm + self-organising cluster analysis

Generate protein mass spectra (15200 m/z values)

Discriminatory pattern: plot of relative abundance of 5-20 key proteins (m/z values) that best distinguish cancer from non-cancer

Phase I: pattern discovery

Obtain mass spectrum from masked serum test sample

Generate signature pattern from test sample: plot relative abundance of 5-20 specific key discriminatory proteins identified in phase I

Pattern matching:Compare unknown test sample signature pattern for likeness to previously found discriminatory pattern

Unaffected Cancer New cluster(no match)

Phase II: pattern matching

F.Magni - Siena 2014

VALIDAZIONE DEL SET DI BIOMARCATORI MEDIANTECLASSIFICAZIONE DI SIERI ANALIZZATI IN CIECO

0/320/3232/32STADIO II, III, IV

0/180/1818/18STADIO I

0/70/7

1/100/10

6/60/6

18/191/19CISTI OVARICA BENIGNA <2cm

22/242/24

0/3232/32

0/1818/18

DONNE SENZA TUMORE OVARICO

7/70/70/7NESSUN DISTURBO GINECOLOGICO

9/101/100/10PATOLOGIA GINECOLOGICA BENIGNA

0/66/60/6

0/1918/191/19

0/2422/242/24N ESSUNA CISTI OVARICA

CANCRO NO CANCRONUOVO CLUSTER

CISTI OVARICA BENIGNA >2cm

DONNE CON TUMORE OVARICO

F.Magni - Siena 2014

Page 47: E’ vero che tutta l’informazione sta nel genoma . 17 F.Magni - Siena 2014 Swiss-Prot: Release 50.9 of 17

47

Step 1: Discovery

Training data set

Pattern discoveryxy

Profile 1 Profile 2

**

Disease Normal

Use biomarker pattern for step 2.

Step 2: Evaluation

Test data set

x

y

Cluster analysis

Determination of:• Sensitivity• Specificity• Positive predictive value• Negative predictive value

Disease Normal

Profile 1 Profile 2

**

Step 3: Class prediction

Unknown data set

Profile 1 Profile 2

**

x

y

Cluster analysis

Disease Normal

Il Problema dell’analisi statistica:

F.Magni - Siena 2014