chris penkett wellcome trust sanger institute overview: web-based software for orthologs and primers...
TRANSCRIPT
Chris Penkett
Wellcome Trust Sanger Institute
Overview:
• Web-based software for orthologs and primers
• In-house microarray processing
• Stationary phase experiments with fission yeast
• Analysis of introns and expression data
YOGY: a web-based database for protein orthologs and associated GO terms
Can be used to search for the most of the major eukaryotic model organisms using a variety of gene IDs. Data is stored in a MySQL database and results are shown on the web.
Includes data from various ortholog prediction results including KOGs, Inparanoid, OrthoMCL and HomoloGene.
Also allows Gene Ontology (GO) terms to be retrieved for each ortholog along with evidence codes giving an overview of the protein function.
It is now being used by the GO Reference Genome Consortium to aid with assigning GO terms between the model organisms.
Overview of output for search with cdc22
OrthoMCL results: example for
cdc22 ortholog prediction output
Part of the GO output for cdc22
PPPP: a web-based primer design program for gene tagging/deletion
Scripts to design primers for N- and C-terminal tagging/deletion of genes using the method of homologous recombination.
Primers are integrated into a kanamycin-resistant plasmid using PCR, and then transformed into fission yeast cells.
In addition to gene deletion, a gene can be tagged with an inducible promoter, a tag that is recognised by antibodies, or with a fluorescently labelled protein (GFP).
Primers can also be designed that allow checking of correct integration of the plasmid into the chromosomal location using PCR.
Primers for homologous recombination
Primers for checking integration
Data flow for in-house arrays
The group has two PCR-based, spotted arrays for ORF’s (and non-coding RNA’s) and intergenic regions. The ORF array was originally produced back in 2000, and is still used today.
The advantage is that data from a wide range of experiments (environmental stress, cell cycle, mating, sporulation, translation data, RNA half-lives, etc.) have been done under nearly identical conditions.
Needed to produce a robust, easily maintainable pipeline to get the data from these arrays in a windows-based environment where ~1000 arrays are used per year in the lab. Also needed to design new primers to obtain a nearly complete set of sequences.
Recently, the biggest problem was the amount of data in GeneSpring – and it was necessary to upgrade to the Oracle-based version.
GeneDB: Sequences Annotation
Primer design scripts:
ORF/tiling arrays
Primers: 96-well plate
format
GAL file: microarray
layout
Images/GPR files
Normalised all/spot/gene
files
GeneSpring/ R (BioConductor)
SPGE data viewer
ArrayExpress
Hyb Info DB: experiment info
Primers: 384-well plate
format
96-well to 384-well
conversion program
TAS software
GenePix image analysis
software
Local normalisation
script
Tab2Mage
SPGE loaders
Tab2Mage
Microarray primer DB
Spotted array data flow
Initiated as a pipeline to check that we had a complete set of valid primers for all ORFs and intergenic regions on the in-house S. pombe arrays.
Stored the data from this pipeline in a MySQL database, which is managed and viewed on the web with Perl scripts using CGI/DBI modules.
Contains information about 96-plate info together with primer information: sequence (including for primers and final amplicon), mapping information, melting temperature, % GC content, PCR result, etc.
Microarray primer database
ORF array Intergenic array
GeneDB: Sequences Annotation
Primer design scripts:
ORF/tiling arrays
Primers: 96-well plate
format
GAL file: microarray
layout
Images/GPR files
Normalised all/spot/gene
files
GeneSpring/ R (BioConductor)
SPGE data viewer
ArrayExpress
Hyb Info DB: experiment info
Primers: 384-well plate
format
96-well to 384-well
conversion program
TAS software
GenePix image analysis
software
Local normalisation
script
Tab2Mage
SPGE loaders
Tab2Mage
Microarray primer DB
Spotted array data flow
• Java program that works for both ORF and intergenic arrays.
• Two conversion patterns used by array makers.
• Can also add any number of bacterial plates anywhere on array.
96-well to 384-well conversion program
GeneDB: Sequences Annotation
Primer design scripts:
ORF/tiling arrays
Primers: 96-well plate
format
GAL file: microarray
layout
Images/GPR files
Normalised all/spot/gene
files
GeneSpring/ R (BioConductor)
SPGE data viewer
ArrayExpress
Hyb Info DB: experiment info
Primers: 384-well plate
format
96-well to 384-well
conversion program
TAS software
GenePix image analysis
software
Local normalisation
script
Tab2Mage
SPGE loaders
Tab2Mage
Microarray primer DB
Spotted array data flow
• Perl/Tk script that works on both arrays.
• Uses a sliding window around each spot for normalisation.
• Works with bacterial spikes using various algorithms.
Local normalisation script
GeneDB: Sequences Annotation
Primer design scripts:
ORF/tiling arrays
Primers: 96-well plate
format
GAL file: microarray
layout
Images/GPR files
Normalised all/spot/gene
files
GeneSpring/ R (BioConductor)
SPGE data viewer
ArrayExpress
Hyb Info DB: experiment info
Primers: 384-well plate
format
96-well to 384-well
conversion program
TAS software
GenePix image analysis
software
Local normalisation
script
Tab2Mage
SPGE loaders
Tab2Mage
Microarray primer DB
Spotted array data flow
Hyb Info DB for MIAME experiment annotation
Starvation/stationary phase study
- Rationale: most cells in our body have stopped growing, or yeast in the wild, on a grape for example, also no longer in growth.
- Grow WT cells from mid exponential phase (OD ~ 0.3) to stationary phase in minimal medium at 32 C (OD ~ 3).
- Experimental issues:
• Different numbers of cells at different time points.
• Less total RNA per cell in stationary phase.
• Normalise to cell numbers (by counting cells) RNA amounts, and relative mRNA levels (by using bacterial spikes).
• Need to extract consistent amounts of RNA to normalise using RNA yield.
• pH and other factors change during experiment.
Fission yeast - life cycle (partial)
Mitoticcell
cycle
StationaryPhase
Meiosis/sporulation
Dormantascospores
Re-supplynutrients
Nutrient (nitrogen)deprivation
Conjugationof h+/h- cells
Zygoteformation
Zygoticascus
Environmentalfactors (stress)
Re-supplynutrients
Nutrient (glucose)deprivation
0.1
1
10
100
1000
0 2 4 6 8 10 12
OD
Cells (1*10^6)
Time (hours)
Stationary phase expression profile: data up to 11 days
-2.00
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
1 10 100 1000
3
3.5
4
4.5
5
5.5
6Glucose (mgs per ml)
pH
Time (hours)
pH
Overall normalisation
0.010
0.100
1.000
10.000
0 2 4 6 8 10 12
RNA per cell
Normalised bacterial controls
Overall
Time points
Large scale expression and cell morphology changes
Time points normalised using bacterial controls (difficult to get accurate) and cell counts.
Induced CESR (common) stress genes
Ribosomal genes
Pathways related to sugar metabolism
Citric acid cycle
Starch and sugar metabolism
Glycolysis pathway
Pombe seems to use non-glucose sources for energy in stat. phase and stores sugar and starch.
tRNA coupling genes
Genes that are up-regulated 2-fold in low glucose medium
Genes associated with RNA pol II
More pathways of interest
Mitochondrial electron transport
Budding yeast findings
>1800 genes increase 5-10 mins after refeeding.
Mitochondrial function is important for stat. phase entry.
2 out of 3 stat. phase genes have human orthologs.
>2500 genes up-regulated.
ChIP-chip reveals RNA pol II present in intergenic sites upstream of genes are induced upon stat. phase exit.
Transcription factors that come up early in stat. phase
rsv1: cell viability in low glucose
C1105.14: selected for deletion studies
pcr1: meiosis
rst2: meiosis
res2: DNA synthesis/meiotic division - goes down later
atf1: meiosis/stat. phase/stress response
pap1: stress response
hsf: binds to heat shock elements
mbx2: cell wall synthesis
php2: respiration/mitochondrial electron transport
jmj2: chromatin remodelling
C320.03
C2H10.01
C25B8.19c
C19C7.10
Gene deletion of C1105.14
Exponential phase – 5 repeats Stationary phase – 2 repeats
Green is for down-regulated genes in the WT time course.
Red is for up-regulated genes.
50 most repressed genes in C1105.14 mutant in WT stationary phase time course
Genes regulated in stationary phase
Up-regulated:
• Stress MAPK pathway and stress response genes.
• Citric acid cycle and mitochondrial transport genes.
• Starch and sugar metabolism genes.
• Genes that are 2-fold up-regulated in low glucose medium (including sugar transporter genes).
• Genes involved with RNA polymerase II.
• Transcription factors known to be involved in starvation, stress, meiosis.
• Some unknown TF’s that are now being investigated further in the lab.
Down-regulated:
• Ribosomal proteins.
• Glycolysis pathway.
• Fatty acid synthesis genes.
• tRNA coupling genes.
• Amino acid and nucleotide metabolism genes.
Effect of introns in up-regulated stress-response (CESR) genes in stationary phase time course
Genes without intronsGenes with introns
Gene with and without introns in different oxidative stress conditions
Seems to be general in pombe stress experiments.
Comparing data sets from different organisms
It seems that in pombe the transcriptional response to stress conditions is governed by a need to produce functional mRNA’s quickly (without the need for splicing) – is this common to other organisms?
As studies in different organisms use various time points, need a way to compare data both within and between time courses using a standard common metric – expression change within unit time.
E1
E2
E3
t1t2 t3
R2-1 = E2 – E1
t2 – t1
R3-2 = E3 – E2
t3 – t2
Rmax = abs{max(R2-1, R3-2) }
Rmax for stress data against number of introns in pombe
Data is from Chen et al. (2003), for 5 different stresses with t = 0, 15, 60 mins.
Data is from 2-colour microarrays, so is relative expression levels, compared to t = 0.
Median of Rmax for all data
Correlation of max value against intron number using Spearman’s rank: P = 7.6 x 10-6
Correlation of all values against intron number using Spearman’s rank: P < 2.2 x 10-16
Compare genes without introns against genes with introns
Compare two data sets using Wilcoxon (Mann-Whitney) non-parametric rank test: P < 2.2 x 10-16
Compare with cell cycle data
Data from 3 elutriations of wt cells over 2 cell cycles (Rustici et al., 2004).
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
W (0 vs >0): 3.9 x 10-5
S (all data): 4.4 x 10-5
Compare with Arabidopsis stress data
Data from various stresses in Arabidopsis for both roots and shoots including drought, UV-B, cold, heat, genotoxic, salt, wounding and osmotic from ? et al.
Time points include 30, 60, 180, 360, 720, 1440 mins (plus 15 and 240 for some).
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
Considerations with Arabidopsis data
Data collected using Affymetrix chips, so get absolute expression levels.
Hence can use data that is from the absolute values or a ratio to t = 0 (to compare with the 2-colour pombe data).
Also they did a time course with control untreated plants, so can also compare stress data using ratios to the control time points.
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
Rmax for Arabidopsis data using different methods
Absolute and data vs t = 0 look virtually identical – as have taken logs of expression values.
2 Affymetrix data sets in GEO: GDS1015 (fetal bovine serum factor; Philippar et al., 2004) and GDS683 (oxidative stress; Madsen et al., 2004). GDS1015 has better time points: t = 0, 10, 30, 50, 180 mins. GDS683: t = 0, 15, 60, 480, 1008 mins.
Some genes have >100 introns, so put into 22 equi-spaced bins (0-4, 5-9, 10-14, etc. introns).
Mouse stress data
W (0 introns vs >0 introns): - (mean/median less for 0 introns)
S (all data): - (positive gradient)
Similar poor stats for GDS683.
Use new metric for mouse data
As the transcripts are generally very long in mouse, the amount of time taken to transcribe the pre mRNA is also going to be a factor as well as the time taken to splice out introns.
Additionally, number of introns correlates with transcript length for Arabidopsis and mouse (only get a small correlation in pombe).
Can use an alternative metric called intron density, which correlates positively with intron number and inversely with transcript length:
Number of introns
Genomic length of transcriptIntron density =
Mouse stress data using intron densities
W (0 introns vs >0 introns): - (mean/median less)
S (all data): - (positive gradient)
W (<1/10th max vs >1/10th max): 4.0 x 10-10
S (all data): 1.1 x 10-6
As intron density is a continuous variable, put into 10 equi-spaced bins.
Check Arabidopis stress data for trend with intron densities
W (0 vs >0): <2.2 x 10-16
S (all data): <2.2 x 10-16
W (<1/10 vs >1/10): <2.2 x 10-16
S (all data): <2.2 x 10-16
Intron density still significant for Arabidopsis and pombe.
Pombe stats: W (<1/10 vs >1/10): <2.2 x 10-16; S (all data): < 2.2 x 10-16
Transcription proceeds at 1200-1500 bp/minute (Izban & Luse, 1992),
Pombe: mean gene length ~1,500 bp – time to transcribe ~1 min,mean intron number ~1 per gene.
Arabidopsis: mean gene length ~1,900 bp – time ~1.5 mins,mean intron number ~4.4 per gene.
Mouse: mean gene length ~33,000 bp – time ~20 mins,mean intron number ~9 per gene.
Half-lives for splicing reactions are considerably longer, under a minute for the first intron, but of the order of 2-8 mins for the second and third introns (Audibert et al., 2002).
Intron splicing may be the rate limiting factor, since new spliceosomal ‘speckles’ form ~15-20 mins after gene activation in mammalian cells, and speckle morphology changes on the order of 5-7 mins (Misteli et al., 1997).
With time scales on this order, it appears that the assembly splicing and release of the spliceosome may be limiting for rapid changes in gene expression.
Transcription and splicing kinetics
Acknowledgements
Jürg Bähler – Supervisor
Valerie Wood – Suggested adding GO into YOGY
Daniel Jeffares – Intron data
Gavin Burns – Laboratory help for pombe and arrays
Luis López – Stationary phase mutant data
Matloob Qureshi – 96 to 384 well program
Juan Mata – Normalisation script
Zoë Birtles, James Morris – Summer students