note - gep community...
TRANSCRIPT
GK/NYLMU
1
General Annotation Protocol Noncoding exons
1.) Open Mozilla FirefoxàOpen the following in tabs http://goose.wustl.edu/ http://flybase.org/ http://www.ensembl.org/index.html http://www.ncbi.nlm.nih.gov/blast/ http://www.dnalc.org/bioinformatics/ select “translator”
optional: http://align.genome.jp/ used for clustal analysis
2.)Open UCSC browser at http://goose.wustl.edu/ a.)In goose, click on “Genome Browser” b.)Choose D. erecta from the genome dropdown list c.)Type in the number of the desired contig in the box, which says position above it
ex.) contig7:150,000 d.)Click on one of the brown boxes in the Genscan gene line, this will bring up the contig numbers next to the boxes ex.) contig7.1 e.)Click on the desired contig (ex. Contig 7.1) f.)Click “predicted protein”àcopy this sequence
5.) Go to http://www.flybase.org/ and select BLAST under Tools àpaste this sequence into the box. àRemember to change the Blast Database (to Annotated proteins) and the Program (to blastp AA>AA). Click BLAST. 4.) Chose the best result, the one with the lowest E value (the top match).
In the top match is the first in the list with an E value of 7e80 àcopy the gene name ex.) FBpp0088281
6.) Go to http://www.ensembl.org/ àChoose the Drosophila melanogaster database. Then paste the gene name into the search box. Click go.
à Your result should look similar to the figure below:
GK/NYLMU
2
àclick on the link FlyBase protein_coding Gene: CG18069 (FlyBaseName gene: CaMKII)
8.) Either one or multiple isoforms (splice variants will appear).
àClick on [Peptide info]to get the D. melanogaster coding region sequence, the exons. Note: there are 6 splice variants for this gene.
GK/NYLMU
3
The exons are separated by black and blue colors(shown below). Red amino acids indicates that there is a split condon: Drosopholia melanogaster ortholog MAAPAACTRFSDNYDIKEELGKGAFSIVKRCVQKSTGFEFAAKIINTKKLTARDFQKLER EARICRKLHHPNIVRLHDSIQEENYHYLVFDLVTGGELFEDIVAREFYSEADASHCIQQI LESVNHCHQNGVVHRDLKPENLLLASKAKGAAVKLADFGLAIEVQGDHQAWFGFAGTPGY LSPEVLKKEPYGKSVDIWACGVILYILLVGYPPFWDEDQHRLYSQIKAGAYDYPSPEWDT VTPEAKNLINQMLTVNPNKRITAAEALKHPWICQRERVASVVHRQETVDCLKKFNARRKL KGAILTTMLATRNFSSRSMITKKGEGSQVKESTDSSSTTLEDDDIKAARRQEIIKITEQL IEAINSGDFDGYTKICDPHLTAFEPEALGNLVEGIDFHKFYFENVLGKNCKAINTTILNP HVHLLGEEAACIAYVRLTQYIDKQGHAHTHQSEETRVWHKRDNKWQNVHFHRSASAKISG ATTFDFIPQK à Copy the first exon:MAAPAACTRFSDNYDIKEELGK
Note: include the split codons in each exon: exon1 MAAPAACTRFSDNYDIKEELGK
exon2 KGAFSIVKRCVQKSTGFEFAAKIINTKKLTARD exon3 DFQKLEREARICRKLHHPNIV
NB: Some Mac OS MS Word systems drop the color on the font; paste exon separately to keep the exons distinctive.
9.) Go to http://www.ncbi.nlm.nih.gov/BLAST/. Choose Align 2 sequences (bl2seq) under the Special header.
GK/NYLMU
4
10.) Paste your exon into the first box. Select the tblastn program, because you are aligning an amino acid sequence against a nucleotide sequence. Uncheck the filter box. Now get the D. erecta entire fosmid nucleotide sequence by going back to goose, clicking the back button until you are at the screen with the D. erecta map:
Make sure you have the correct fosmid with the entire DNA sequence in the request box, e.g. “contig7:150,000”, hit return to “load sequence” Select DNA in the top blue bar (make sure the position box begins with 1 and that the reverse
complement box is not checked off). Select all and Copy this entire sequence.
Paste it into the second box of the bl2seq
àClick align àThe resulting alignments quality will vary. (Note: If no significant similarity is found, try increasing the expected value from 10 to 100 or 1000 or higher as needed to get a similar sequence) àThis will give you the predicted start and stop sites of your first exon as well as the frame either + or . Copy the result by highlighting the text and paste it into your documentation. Ex.) Exon1 Score = 49.3 bits (116), Expect = 2e04 Identities = 22/22 (100%), Positives = 22/22 (100%), Gaps = 0/22 (0%) Frame = +3
Query 1 MAAPAACTRFSDNYDIKEELGK 22 MAAPAACTRFSDNYDIKEELGK
Make sure the filter is not checked off
Program: set to tblastn
Dm orthologue exon
De entire contig sequence
Expected value (see below)
GK/NYLMU
5
Sbjct 5310 MAAPAACTRFSDNYDIKEELGK 5375
àfrom this note the frame, identity and if the entire query is found.
11.) Go back to goose where you can see you entire contig. Change base position to full and predicted splice to dense.
àClick refresh to apply these changes.
12.) Enter the coordinates from BLAST above and click jump: ex. 53105375
à (Note!!!!! if you have an exon in a negative frame make sure to change the direction of the reading frame by clicking on the arrow under “Base Position”and changing it from > to <).
Base position: full
Predicted splice sites: dense
Click on the top area where the number are located to zoom in on you predicted start site which almost always starts with a methionine.
GK/NYLMU
6
Ex.) Zoomed in region
This is exon1 and here we see the start codon methionine at 5310 in frame +3 followed by AAP (it’s always good to verify that you are in frame before proceeding) which is also found in our BLAST search above.
12.) Look for Donor sites (GT) at the end of exons. Note that GC may be used as a splice site but this is very rare (<1%).
àAt the end of exon 1 there is a GT donor site at 5376, but the exon coding sequence ends at 5374. This is a considered to be a split codon. It is in phase 2 because the exon is in reading frame 3 ending with LGK. The GT is not included in the coding region which leaves 2 nucleotides in exon 1 with the 3 rd in exon 2. Note: phase is the number on nucleotides left inside of the coding exon, excluding GT or AG. Phase must always equal to 3
ex.) end of exon one is 2 and beginning of exon two is 1 which equals to three. It is very important to track the Phase to ensure that your gene model has inframe exons.
the start and stop coordinates for exon 1 are: 53105374(2)
14.) Now look for exon 2 by repeating steps 8 and 9. Exon2 used for search in blast: KGAFSIVKRCVQKSTGFEFAAKIINTKKLTARD Score = 64.3 bits (155), Expect = 8e09 Identities = 31/32 (96%), Positives = 32/32 (100%), Gaps = 0/32 (0%) Frame = +1
Query 1 KGAFSIVKRCVQKSTGFEFAAKIINTKKLTAR 32
Indicates phase
GK/NYLMU
7
+GAFSIVKRCVQKSTGFEFAAKIINTKKLTAR Sbjct 5434 RGAFSIVKRCVQKSTGFEFAAKIINTKKLTAR 5529
àthis is the beginning of exon2, look for an AG acceptor site at the beginning of exons. Note that AC may be used as a splice site but this is rare.
Ex2 AG acceptor at 5434, exon2 starts at 5436(1).
Ex2 GT donor at 5532, exon 2 end at 5530(1)
15.) Repeat this for each exon.
Last coding exon 16.) End of coding region is exon 13 in 7.1 variant A. Exon13 Score = 107 bits (266), Expect = 1e21 Identities = 47/48 (97%), Positives = 48/48 (100%), Gaps = 0/48 (0%) Frame = +3
Query 1 KQGHAHTHQSEETRVWHKRDNKWQNVHFHRSASAKISGATTFDFIPQK 48 +QGHAHTHQSEETRVWHKRDNKWQNVHFHRSASAKISGATTFDFIPQK
Sbjct 19650 RQGHAHTHQSEETRVWHKRDNKWQNVHFHRSASAKISGATTFDFIPQK 19793
GK/NYLMU
8
EX13 AG acceptor at 19650(1), ànote that end coordinate for exon 12 is 19594(2).
Exon13 ends at 19793. The stop codon in not included in the coordinates.
Please direct any questions or concerns to: Dr. Gary Kuleck (gkuleck @lmu.edu) 3103387496 orNicole Yu ([email protected])