annotating nc-rnas with rfam

30
Luca Cozzuto @ Bioinformatics Core http://rfam.sanger.ac.uk/

Upload: luca-cozzuto

Post on 11-May-2015

1.407 views

Category:

Education


0 download

DESCRIPTION

Rfam is an open access database (hosted at the Wellcome Trust Sanger Institute) containing information for RNA families and annotations for millions of RNA genes. Designed to work in a similar way to the Pfam database of protein families, Rfam uses a similar model for annotation and display and is built on the same principle of open access to the data. Each entry in the Rfam database includes multiple sequence alignments, a secondary structure and probabilistic models known as covariance models (CMs), these models can simultaneously handle an RNA sequence and its structure. In conjunction with the Infernal software package, Rfam CMs can be used to search genomes or other DNA sequence databases for homologs to known structural RNA families. You can find more about Rfam at http://rfam.janelia.org/

TRANSCRIPT

Page 1: Annotating nc-RNAs with Rfam

Luca Cozzuto @ Bioinformatics Core

http://rfam.sanger.ac.uk/

Page 2: Annotating nc-RNAs with Rfam

Non-coding RNA genes codify for a functional RNA product rather than for a protein.

Page 3: Annotating nc-RNAs with Rfam

Non-coding genes codify for a functional RNA product rather than for a protein.

Family of functional RNAs:

Biological function RNA family

Involved in protein synthesis tRNA, rRNA, SRP RNA, tmRNA

Post-trascriptional modification or DNA replication

snRNA, snoRNA, SmY, scaRNA, gRNA, RNAse P, RNAse MRP, Y RNA, telomerase RNA

Regulatory RNAs aRNA, NAT, crRRNA, long ncRNA, miRNA, piRNA, siRNA, tasiRNA, rasiRNA, 7SK

Parasitic RNA Retrotransposon, Viroid, satellite RNA

Page 4: Annotating nc-RNAs with Rfam

The majority of functional RNAs fold in stable structures that are essential for their biological activity.

Micro-RNA precursor

tRNA U2 spliceosomal

RNA

Part of Riboswitch

Page 5: Annotating nc-RNAs with Rfam

Unlike protein-coding genes functional RNAs often show no significant sequence similarity but preserve a base-paired secondary structure.

This makes very difficult to search for those genes looking only for sequence similarity (i.e. by using BLAST, FASTA…)

ncRNA_1 AAAAAAGGGGTTTTTT!

ncRNA_2 AAATAAGGGGTTATTT!

Struct ((((((....)))))) !

Page 6: Annotating nc-RNAs with Rfam

For Rfam database a functional RNA family is represented by a multiple sequence alignment and a covariance model.

The model takes into account both sequence and structure and can be used to scan a genomic sequence to detect new members of the same family.

Page 7: Annotating nc-RNAs with Rfam

The Rfam Seed alignment for the U12 minor spliceosomal RNA family.

Page 8: Annotating nc-RNAs with Rfam

Only one sequence, up to 10 kb

Search methodology

The query sequence is scanned against a library of Rfam sequences using WU-BLAST, with an E-value threshold of 1.0. Any matches to this are then scanned against the corresponding covariance model using the hand-curated threshold for that family.

Page 9: Annotating nc-RNAs with Rfam

Results Positive hits are reported together with the score, e-value and alignment to the family CM.

Page 10: Annotating nc-RNAs with Rfam

Bit score: how well the sequence matches your model. The score reflects whether the sequence matches better to the profile model (positive score) or to the null model of nonhomologous sequences (negative score).

E-value: expected number of false positives with bit scores at least high as your hit. The value is related to the size of database used for the search.

Page 11: Annotating nc-RNAs with Rfam

I Predicted secondary structure “<> [ ] { }” base pairs “_” hairpin loop “-” interior bulge and loop “,” single stranded multifurcation loop “:” external single stranded residues “.” insertion to the consensus.

II Consensus of the query model

III Alignment to the model and scoring system “Capital letter” = max score. “: +” score >=0 for base pairs and single stranded. “ ” negative score

IV Target sequence

Page 12: Annotating nc-RNAs with Rfam

Going to the family information A summary written in wikipedia about the family is shown together with information stored into the database.

Page 13: Annotating nc-RNAs with Rfam

Going to the family information Sequences part of that family can be viewed (if they are not so much)

Page 14: Annotating nc-RNAs with Rfam

Going to the family information Both seed and full alignments of members can be displayed.

Page 15: Annotating nc-RNAs with Rfam

Going to the family information Both seed and full alignments of members can be displayed.

Page 16: Annotating nc-RNAs with Rfam

Going to the family information The secondary structure can be viewed.

Page 17: Annotating nc-RNAs with Rfam

Going to the family information The secondary structure can be viewed.

Page 18: Annotating nc-RNAs with Rfam

Going to the family information Also the tree of genomes containing members of that family can be browsed

Page 19: Annotating nc-RNAs with Rfam

Going to the family information If a PDB entry is available it is possible to see also the three-dimensional structure.

Page 20: Annotating nc-RNAs with Rfam

Going to the family information If a PDB entry is available it is possible to see also the three-dimensional structure.

Page 21: Annotating nc-RNAs with Rfam

Going to the family information You can reach some publication on the family.

Page 22: Annotating nc-RNAs with Rfam

Problems in searching sequences

-  To speed up the searching it is necessary a filtering step based on blast search. This will decrease the sensitivity in finding true homologues of the functional RNA family.

-  The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats that looks like structured functional RNAs.

Gardner PP, et al. Bateman A. Rfam: updates to the RNA families database. Nucleic Acids Res. 2009

Page 23: Annotating nc-RNAs with Rfam

Batch search You can upload a file containing several sequences in fasta format. Generally a job takes 48 hours.

Files must have fewer than 100,000 lines and fewer

than 1000 sequences with a size shorter than

200,000 nucleotides

Page 24: Annotating nc-RNAs with Rfam

Browsing for genome Genomes scanned for the presence of a Rfma family are reported in Browse tab.

Page 25: Annotating nc-RNAs with Rfam

Browsing for genome Species, kingdom, number of Rfam families and members found within the specie (Regions) are reported.

Page 26: Annotating nc-RNAs with Rfam

Browsing for genome

Page 27: Annotating nc-RNAs with Rfam

Browsing for genome

Page 28: Annotating nc-RNAs with Rfam

You may install locally the infernal program available at http://infernal.janelia.org/.

To speed up the search you may install also the rfam_scan.pl script available at ftp://ftp.sanger.ac.uk/pub/databases/Rfam/tools/ that relies on Blast program.

Running a complete search for a whole genome.

Page 29: Annotating nc-RNAs with Rfam

Typical usage of infernal.

cmsearch -o output.aln --tabfile output.tab infile.fna Rfam.cm!

Typical usage of rfam_scan.pl

Perl rfam_scan.pl – blastdb Rfam.fasta -outfile.out Rfam.cm infile.fna !

Running a complete search for a whole genome.

Page 30: Annotating nc-RNAs with Rfam

Thanks!