protein sequence databases petri törönen shamelessly copied from material done by eija korpelainen...

Download Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis toronen/Gradu_verkkoon.zip

If you can't read please download the document

Upload: osborn-pierce

Post on 18-Jan-2018

224 views

Category:

Documents


0 download

DESCRIPTION

But there are drawbacks -divergence in codons => same protein, different nucleotide sequence! -similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level!

TRANSCRIPT

Protein sequence databases Petri Trnen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesisand from CSC bio-opas Why protein sequences? most (laboratory) analysis is done with nucleotide sequences therefore the analysis at the nucleotide level is natural But there are drawbacks -divergence in codons => same protein, different nucleotide sequence!-similarity between different aminoacids Therefore all the similarity is not visible at the nucleotide level! more Protein databases also include often more detailed information. Protein (not the RNA) is often the actual functional unit that has a biological function. -note the exceptions like structural RNAs. Protein databases SwissProt TrEMBL PIR-PSD Swissprot and TrEMBL (Translated EMBL) have been unified to UniProt THIS INFO IN PART ERRONEOUS! SwissProt still also available as a separate entity. Differences between databases Some include all the available information (more or less reliable information) large coverage, everything is stored in the database small reliablity, information has not been confirmed computer annotation => updating fast Some cover only the reliable information small coverage information is reliable expert curation => updating slow SwissProt TREMBL RemTREMBL Why Swissprot is nice? Sequences are manually annotated and checked No multiple entries for the same sequence Annotations include protein function, modifications after translation, active sites etc. Linked to many other databases So how to search protein sequences from available databases? Search with a protein name Search with a proteins function/derscriptive words Search with a protein/RNA sequence Next slides handle first two options Ways to access Swiss/UniProtExpasy server for Uniprot Note that the page includes links to full text search and to advanced searchPower Search to UniProt databaseOne of the SRS servers availble in WWW SRS Sequence Retrieval System Allows search from several databases not limited to SwissProt! AND, OR, BUTNOT type boolean operations can be used in the search (useful with keywords) => Works with sequence name and with complex keyword queries. Obtained results can be further processed: linking to new set of databases includes sequence analysis, sequence alingment Select start a temporary project Select database(s). Here I select SwissProt Note that also other databases can be searched with SRS! Available databases vary between the different SRS servers. Insert the query for looking the sequence. Here I search with the sequence name (csk_mouse). Search goes through all the text fields (AllText) in the SwissProt files These are available fields that can be searched with the search term obtained result Available information on the sequence. More information from here Obtained result demonstrated the detailed information available from the SwissProt Note that the stored information includes information on the organism gene name, gene description links to the articles discussing about the seq. part comments has a detailed description on function tissue localization part features has a detailed description on domains various functional components SRS Search with boolean operators (AND, OR, BUTNOT) Queries can be combined with & (= AND), | (= OR), ! (=NOT) Different rows are also combined (by default) with AND The example looks for proteins with organism Name either mouse OR rat. Also the description field must include words receptor AND kinase BUTNOT tyrosine. Further linking to other databases We can link the obtained results with the other databases by going further from this link Go to the results of the previous search.. Selection of sequences that have a known 3D structure 2. The box next to PDB database is selected with mouse 1. The sub folder with protein databases is opened by selecting protein function structure and interactions databases 3. Lets select here the filtering of the obtained results to the ones that have a link to 3D structure Summary protein databases show detailed information of protein sequences Uniprot/Swissprot is recommended protein database -manually curated -non-overlapping SRS is a method for searching information from selected databases with search terms Word of warning: Sometimes SRS does not work as nicely as hoped! Search of the protein databases with sequences So what can be done if we have a sequence that we do not know nothing about? We can look for similar known protein from databases. This can be done directly with protein sequences. (Database searching is probably handled more later. Sorry for wrong order!) Nucleotide to amino acids If you have produced a nucleotide seq. in laboratory you might still want to compare it to protein sequences for previous reasons (slide n. 3). Youll have two options: 1.Use tools (like BLASTX, FastX) that automatically compare the nucleotide seq. to amino acid databases. These can search sequence similarities going from one reading frame to another. => Simple, You dont have to worry about translating the sequence (see below) BLASTX and FastX are explained more in detail later 2.Translate the seq. using available tools (for example)http://www.ebi.ac.uk/emboss/transeq/ -required with tools that accept only protein sequence -remember that you do not know the reading frame! Correct reading frame can move from one frame to another (sequencing errors like addition or deletion of nucleotides)!! Automatic tools comparing nucl. seq. with protein database BLASTX -looks for most similar protein sequences for your nucleotide sequence by comparing all possible reading frames. -Member of BLAST program family For nucleotide sequences BLASTX can be obtained here If you do a query with a protein sequence then use this SEQUENCE: >embl|AB029485|AB Mus musculus ARIP1 mRNA for activin receptor interacting protein protein database (SwissProt) can be selected here You can find the seq from google with AB029485 Next Window is opened here Web page that is given while the results are being waited. Colour figure presents where the match to the database was in our query sequence. colour presents the goodness of score. E value tells how many similar results can be expected by random The alingment can be viewed from this link The alingment enables the manual evaluation of the result This is the link to database that we searched giving the full information on the sequence Changing the nucleotides to amino acids Transeq requires you to paste the nucleotide sequence, to select the reading frame (1, 2 or 3) and to select forward or reverse direction An example sequence obtained with randomly typed g,a,c,t: DQLTCQSTVSAGLAWLAG MA The obtained sequences from different reading frames can be used to search protein databases... Motif databases Motifs are conserved areas in the functionally similar proteins These are crucial parts for protein function protein cannot change them without changing the function Analysis of sequences with motifs can be more efficient when no close sequence relatives are found recommended when normal sequence search gives no results What is motif? modified from Terri Attwood, 2002 modified from Eija korpelainen... Areas with strong conservation between alingned sequences Motif databases BLOCKSPROSITEmore... Subgroup Pattern and profile searches shows the list of protein motif analysis tools INTERPROCombines many motif databases in one search can take DNA or protein sequence. Fragment of the BLASTX test sequence Kinase associated motifs PDZ domains Important for protein-interactions WW domains Important for binding proteins