analysis and visualization of single copy orthologs in arabidopsis, lettuce, sunflower and other...
TRANSCRIPT
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES.
Alexander Kozik and Richard W. MichelmoreUniversity of California, Davis, Dept. of Vegetable Crops, Davis, CA 95616, USA
Approximately 3,700 of the genes in the Arabidopsis Col-0 genome are single copy. These genes were used to identify conserved orthologs in several other plant species. Using computational approaches we identified 1104 lettuce, 686 sunflower, 1704 tomato, 2016 soybean, 1701 maize and 1290 rice ESTs that are conserved orthologs to these Arabidopsis genes. Each EST sequence from these sets has an unambiguous single strong BLAST hit to the Arabidopsis genome. Reciprocal BLAST searches (Arabidopsis single copy genes versus EST assemblies) showed that more than 80% of BLAST hits had only a single strong hit. It indicated that the majority of these conserved orthologs are represented by single genes in multiple plant species. The total number of Arabidopsis genes that have similarity (BLAST score 1e-20 or better) to at least one of these selected ESTs is 2205, which is 60% of total number of single copy genes in Arabidopsis. Only 248 sequences were in common between EST collections from different species and Arabidopsis single copy genes. This can be partially explained by the incomplete representation within each EST collection. Analysis and visualization of single copy genes over Arabidopsis chromosomes (http://cgpdb.ucdavis.edu/COS_Arabidopsis/arabidopsis_single_copy_genes_2003.html) revealed that these genes were distributed throughout the genome regardless of large scale chromosomal duplications. This indicates that deduction of order of genes in common ancestors is required for informative analyses of synteny.
SINGLE COPY ORTHOLOGS SUMMARY
sourcenumber of single copy orthologs
lettuce 1104
sunflower 686
tomato 1704
soybean 2016
maize 1701
rice 1290
common between all
248
common between
lettuce and sunflower
431
Arabidopsis
(total)
2205
(out of 3,714 single copy
genes)
Graphical representation of BLAST search of lettuce, sunflower, tomato, soybean, maize and rice ESTs against Arabidopsis genome. The picture displays potential conserved orthologs (single copy genes in Arabidopsis).
Each box (element) is a single copy Arabidopsis gene having homology to selected sets of plant ESTs.Genes are plotted along five Arabidopsis chromosomes according to their physical positions.
Patterns of segmental duplications in Arabidopsis genome (generated by GenomePixelizer
http://www.atgc.org/). Regions selected by white boxes are shown in large scale above.
CHRM 5CHRM 4
Segmental duplication between Arabidopsis chromosomes 4 and 5
Color Scheme:
Black - single copy genes Purple - kinases Green - cytochrome Red - resistance genes Yellow - ribosomal proteins
Gray lines connect genes with sequence identity 40% or greater
Note: Single copy genes are distributed evenly through both segments of the duplicated region. Image was generated by GenomePixelizer using the “locus zoomer” function. Additional information is available at: http://www.atgc.org/GP_Ref/presentation/
Credits:This work was funded by USDA IFAFS Plant Genome Program to the Compositae Genome Project
Questions and comments to Alexander Kozik, email: [email protected]
Raw data and detailed description of the sequence extraction pipeline is available at:http://cgpdb.ucdavis.edu/COS_Arabidopsis/
PIPELINE TO IDENTIFY SINGLE COPY ORTHOLOGS
PIPELINE TO EXTRACT ALIGNMENTS AT NUCLEOTIDE LEVEL
http://cgpdb.ucdavis.edu/COS_Arabidopsis/Codon_Usage_Pipeline.html
MULTIPLE ALIGNMENT VISUALIZED WITH TkLife ( http://www.atgc.org/TkLife/ )
Arabidopsis
lettuce sunflower
alignmentsummary
codon mismatchand
amino acid mismatch(non-synonymous substitutions)
codon match(and amino acid match)
codon mismatchand
amino acid match(synonymous substitutions)
Putative scenario of gene loss after segmental duplication
Because of extensive gene loss after duplication, deduction of gene order in ancestral genomes is required for informative
synteny analysis between different genomes.
GenBank files ofArabidopsis genome
(DNA sequences of entire chromosomes and
corresponding annotation)
GenBank Parser
spliced DNA sequences
corresponding to ORFs
translation
translated (protein)
sequences[subject]
ESTs (unigene) set[query]
BLASTX search[ESTs vs proteins]
[step 1]
[step 2] [step 3]
[step 4]
SeqsExtractorFromBlastX(Python script)
BLAST output
(alignment)
extraction of DNA sequences corresponding to BLAST alignments from “spliced DNA” (subject) and EST (query)
files.
Script automatically counts codon usage. Output: spreadsheet with info about
codon usage
BLAST parser(Tcl/Tk script)
tab-delimited file with info about BLAST alignments (start points
and end points for each sequence in BLAST report)
[step 5]
final step of the pipeline:
Arabidopsis predicted proteins
(27,169 seqs)
BLAST searchArabidopsis
proteins against themselves
andselection of Arabidopsis
single copy genes
[step 1]
Arabidopsissingle copy
genes(3,714 seqs)
lettuce ESTs(68,197 seqs) sunflower ESTs
(67,180 seqs)
tomato ESTs(113,932 seqs)
maize ESTs(362,510 seqs)
soybean ESTs(341,564 seqs)
rice ESTs(107,329 seqs)
BLAST search of selected ESTs versus all Arabidopsis predicted
proteins and selection of ESTswith a single strong hit to
Arabidopsis genome(Exp cutoff 1e-20)
[step 3]
BLAST search of Arabidopsis single copy genes versus
full sets of ESTs
selection of ESTs with BLAST hits to Arabidopsis single
copy subset
[step 2]