prompt protein mapping and comparison tool by thorsten schmidt and dmitrij frishman free for...

22
PROMPT Protein Mapping and Comparison Tool By Thorsten Schmidt and Dmitrij Frishman Free for academic. Website http://webclu.bio.wzw.tum.de/prompt/ (Binary + Source)

Post on 18-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

PROMPT

Protein Mapping and Comparison Tool

By Thorsten Schmidt and Dmitrij Frishman

Free for academic. Website http://webclu.bio.wzw.tum.de/prompt/ (Binary + Source)

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Motivation

Past:

Sparse data available

single pairwise comparison

Present + Future:

High-throughput technologies

weighting large protein datasets against each other

Differences between individuals

Differences between populations

Hundreds of questions:

• Do Germans drive faster than Americans?

• Is one gene group significantly enriched in certain functional categories?

• Do GroEL depending proteins prefer certain structural folds?

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Input

FASTA   x x  

GenBank   x x x

EMBL   x x  

Swiss-Prot x x x x

UniProt XML x x x x

Generic XML   x x  

Generic XML Input allows to import any numeric or nominal data

Folder with multiple files

File with single (protein) entry

File with multiple (protein) entries

List of identifiers

Analyse annotations

Additionally

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPTProtein set A(SwissProt, EMBL, GenBank,

PEDANT, SIMAP, FASTA, XML)

Protein set B(SwissProt, EMBL, GenBank,

PEDANT, SIMAP, FASTA, XML)

Dataset A Dataset B

ProcessingLayer ComparisonMapping

Statistical testing

InputLayer

User Input

Parsing CachingRetrieval

Results

Presentation Layer

Figure Plotting

Export

Export

ExportView

Within PROMPT

Spreadsheet

Import

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Thorsten Schmidt
TODO: Neuer Screenshot

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Statistical tests Help about each test and its parameter.

Although you can apply any test manually,in the most cases appropriate tests are performed automatically.

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Built-in help

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Case study: SCOP fold comparison GroEL depending substrats vs. Lysate

Background:Around 200 proteins in E.coli depend on the

GroEL chaperon for folding. Questions

What distinguish the GroEL depending proteins?

Data:PEDANT genome from clu1.gsf.de E.coli K12

(updated version) Assignment threshold 1 E-4 for SCOP folds

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Symbolic Frequency Comparison (Symbolic), (Symbolic)

Fraction relative to the number of proteins with annotations in each set

P-value* < 0.05** < 0.001*** < 0.0001

user
16: Protein with binding function30: cellular communication01: Metabolism20,CELLULAR TRANSPORT32 CELL RESCUE..12 PROTEIN SYNTHESIS

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Case study:Comparison of pI distributions

Question:Do the proteins of E.coli and H.pylori differ with

respect of their isoelectric points?

Data: Protein sequences of H.pylori and E.coli The pI is calculated by PROMPT automatically (as many

other sequence based properties too)

user
update text

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Numeric Distribution Comparison (Numeric), (Numeric)

Statistical tests:

•Kolmogorov-Smirnov test

•Mann-Whitney

•Chi Square Test

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Case study:Protein length and hydrophobicity Question:

Is there any relationship between protein length and hydrophobicity in membrane proteins?

Data: 2 multi FASTA files with amino acid sequences

membrane.fasta contains all membrane* proteins of E.coli fullgenome.fasta all proteins of E.coli *) all proteins with more than 6 membrane spanning regions

predicted by TMHMM 2.0

The GRAVY (grand average hydrophobicity) value and a lot of other computable properties are calculated from the sequence by PROMPT automatically

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Numeric Correlation

200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

length

Hyd

roph

obic

ityA

vg

0 500 1000 1500 2000

-2-1

01

length

Hyd

roph

obic

ityA

vg

New research result: The longer membrane proteins are the less hydrophobic they are

X-Axes: Protein length

Hyd

roph

obic

ity:

GR

AV

Y v

alue

Numeric property

Numeric property

[ Pearson coefficient -0.69; p-value 2.8 E-54 ]

A. All E.coli proteins B. Membrane proteins only

(Numeric x Numeric)

Thorsten Schmidt
farbe punkte zu blau aendern, punkte & linien dicker (via R code export)

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Protein set A(SwissProt, EMBL, GenBank,

PEDANT, SIMAP, FASTA, XML)

Protein set B(SwissProt, EMBL, GenBank,

PEDANT, SIMAP, FASTA, XML)

IDs +sequences IDs only IDs +sequencesIDs only

Sequences are retrieved automatically

Web-services

Web-services DB

Query

Compare A and B by BLAST,find equivalent sequences

Mapped identifiers

Set BSet A

ID5ID3

No equivalentID2

ID3ID1

A: IDs + sequences

B: IDs + sequences

Use

r In

pu

t

PR

OM

PT

Res

ult

s

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Data Import and Mapping

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Blast parameter dialog

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

View Mapping Results

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Mapping filteringChoose correct assignments by 2 ways:

•Manually e.g. expert knowledge

•Automatic filter with user specific parameters e.g.

Select SUBJECT_ID where IDENTITY>99 and MISMATCHES<5

Manual further processing e.g.save GIs to text file

Generic XML file:Symbolic property holds mapping informationVFDB1 <-> GI_1234VFDB3 <-> GI_3456…

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Case studies summary

Example Type of Data used PROMPT Method:

FunCat distribution in Human (*) (Symbolic) Symbolic feature frequencies

Scop Fold enrichment of GroEL depending substrates

(Symbolic) , (Symbolic) Symbolic feature comparison of two sets

Fold bias of virulence factor proteins (*)

(Symbolic) subset of (Symbolic)

Symbolic feature enrichment in subset vs. set

pI comparison of H.pylori and E.coli

(Numeric) , (Numeric) Numeric feature comparison

Protein length and hydrophobicity (Numeric x Numeric) Numeric feature correlation

Essentiality and protein (*) abundance

(Symbolic x Numeric) Numeric distribution within categories

Note: x means corresponding data pairs e.g. here describing two values of the same protein(*) not shown in this talk

As the generic XML input allows the processing of any kind of nominal or numeric data, PROMPT can be applied to nearly any problem domain

As the generic XML input allows the processing of any kind of nominal or numeric data, PROMPT can be applied to nearly any problem domain

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Scripting

Scripting ways: Interactive Console Stream (e.g. from pipeline) File

Scripting commands Beanshell = simplified Java Or full Java code

Advantages Run Java-code directly No compilation necessary All PROMPT classes are available from the scripts „Classpath hell“ was yesterday

Just call:./prompt.sh Filename.java

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Conclusions

PROMPT can map, compare and analyse protein sets

Easy-to-use interactively Large-scale batch processing Automatical or manual testing for significance Helps to avoid to reinvent the wheel Graphical visualisations pointing up results Generic

application even beyond bioinformatics

Dig our data gold mine efficiently

PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT PROMPT

Acknowledgements

Dmitrij Frishman Hans-Werner Mewes All MIPSies and Lehrstuhl-people for valuable

discussions

http://webclu.bio.wzw.tum.de/prompt/