chris penkett wellcome trust sanger institute overview: web-based software for orthologs and primers...

Chris Penkett

Wellcome Trust Sanger Institute

Overview:

• Web-based software for orthologs and primers

• In-house microarray processing

• Stationary phase experiments with fission yeast

• Analysis of introns and expression data

YOGY: a web-based database for protein orthologs and associated GO terms

Can be used to search for the most of the major eukaryotic model organisms using a variety of gene IDs. Data is stored in a MySQL database and results are shown on the web.

Includes data from various ortholog prediction results including KOGs, Inparanoid, OrthoMCL and HomoloGene.

Also allows Gene Ontology (GO) terms to be retrieved for each ortholog along with evidence codes giving an overview of the protein function.

It is now being used by the GO Reference Genome Consortium to aid with assigning GO terms between the model organisms.

Overview of output for search with cdc22

OrthoMCL results: example for

cdc22 ortholog prediction output

Part of the GO output for cdc22

PPPP: a web-based primer design program for gene tagging/deletion

Scripts to design primers for N- and C-terminal tagging/deletion of genes using the method of homologous recombination.

Primers are integrated into a kanamycin-resistant plasmid using PCR, and then transformed into fission yeast cells.

In addition to gene deletion, a gene can be tagged with an inducible promoter, a tag that is recognised by antibodies, or with a fluorescently labelled protein (GFP).

Primers can also be designed that allow checking of correct integration of the plasmid into the chromosomal location using PCR.

Primers for homologous recombination

Primers for checking integration

Data flow for in-house arrays

The group has two PCR-based, spotted arrays for ORF’s (and non-coding RNA’s) and intergenic regions. The ORF array was originally produced back in 2000, and is still used today.

The advantage is that data from a wide range of experiments (environmental stress, cell cycle, mating, sporulation, translation data, RNA half-lives, etc.) have been done under nearly identical conditions.

Needed to produce a robust, easily maintainable pipeline to get the data from these arrays in a windows-based environment where ~1000 arrays are used per year in the lab. Also needed to design new primers to obtain a nearly complete set of sequences.

Recently, the biggest problem was the amount of data in GeneSpring – and it was necessary to upgrade to the Oracle-based version.

GeneDB: Sequences Annotation

Primer design scripts:

ORF/tiling arrays

Primers: 96-well plate

format

GAL file: microarray

layout

Images/GPR files

Normalised all/spot/gene

files

GeneSpring/ R (BioConductor)

SPGE data viewer

ArrayExpress

Hyb Info DB: experiment info


format

96-well to 384-well

conversion program

TAS software

GenePix image analysis

software

Local normalisation

script

Tab2Mage

SPGE loaders

Tab2Mage

Microarray primer DB

Spotted array data flow

Initiated as a pipeline to check that we had a complete set of valid primers for all ORFs and intergenic regions on the in-house S. pombe arrays.

Stored the data from this pipeline in a MySQL database, which is managed and viewed on the web with Perl scripts using CGI/DBI modules.

Contains information about 96-plate info together with primer information: sequence (including for primers and final amplicon), mapping information, melting temperature, % GC content, PCR result, etc.

Microarray primer database

ORF array Intergenic array



ORF/tiling arrays


format


layout

Images/GPR files


files


SPGE data viewer

ArrayExpress



format

96-well to 384-well

conversion program

TAS software


software

Local normalisation

script

Tab2Mage

SPGE loaders

Tab2Mage



• Java program that works for both ORF and intergenic arrays.

• Two conversion patterns used by array makers.

• Can also add any number of bacterial plates anywhere on array.

96-well to 384-well conversion program



ORF/tiling arrays


format


layout

Images/GPR files


files


SPGE data viewer

ArrayExpress



format

96-well to 384-well

conversion program

TAS software


software

Local normalisation

script

Tab2Mage

SPGE loaders

Tab2Mage



• Perl/Tk script that works on both arrays.

• Uses a sliding window around each spot for normalisation.

• Works with bacterial spikes using various algorithms.

Local normalisation script



ORF/tiling arrays


format


layout

Images/GPR files


files


SPGE data viewer

ArrayExpress



format

96-well to 384-well

conversion program

TAS software


software

Local normalisation

script

Tab2Mage

SPGE loaders

Tab2Mage



Hyb Info DB for MIAME experiment annotation

Starvation/stationary phase study

- Rationale: most cells in our body have stopped growing, or yeast in the wild, on a grape for example, also no longer in growth.

- Grow WT cells from mid exponential phase (OD ~ 0.3) to stationary phase in minimal medium at 32 C (OD ~ 3).

- Experimental issues:

• Different numbers of cells at different time points.

• Less total RNA per cell in stationary phase.

• Normalise to cell numbers (by counting cells) RNA amounts, and relative mRNA levels (by using bacterial spikes).

• Need to extract consistent amounts of RNA to normalise using RNA yield.

• pH and other factors change during experiment.

Fission yeast - life cycle (partial)

Mitoticcell

cycle

StationaryPhase

Meiosis/sporulation

Dormantascospores

Re-supplynutrients

Nutrient (nitrogen)deprivation

Conjugationof h+/h- cells

Zygoteformation

Zygoticascus

Environmentalfactors (stress)

Re-supplynutrients

Nutrient (glucose)deprivation

0.1

1

10

100

1000

0 2 4 6 8 10 12

OD

Cells (1*10^6)

Time (hours)

Stationary phase expression profile: data up to 11 days

-2.00

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

1 10 100 1000

3

3.5

4

4.5

5

5.5

6Glucose (mgs per ml)

pH

Time (hours)

pH

Overall normalisation

0.010

0.100

1.000

10.000

0 2 4 6 8 10 12

RNA per cell

Normalised bacterial controls

Overall

Time points

Large scale expression and cell morphology changes

Time points normalised using bacterial controls (difficult to get accurate) and cell counts.

Induced CESR (common) stress genes

Ribosomal genes

Pathways related to sugar metabolism

Citric acid cycle

Starch and sugar metabolism

Glycolysis pathway

Pombe seems to use non-glucose sources for energy in stat. phase and stores sugar and starch.

tRNA coupling genes

Genes that are up-regulated 2-fold in low glucose medium

Genes associated with RNA pol II

More pathways of interest

Mitochondrial electron transport

Budding yeast findings

>1800 genes increase 5-10 mins after refeeding.

Mitochondrial function is important for stat. phase entry.

2 out of 3 stat. phase genes have human orthologs.

>2500 genes up-regulated.

ChIP-chip reveals RNA pol II present in intergenic sites upstream of genes are induced upon stat. phase exit.

Transcription factors that come up early in stat. phase

rsv1: cell viability in low glucose

C1105.14: selected for deletion studies

pcr1: meiosis

rst2: meiosis

res2: DNA synthesis/meiotic division - goes down later

atf1: meiosis/stat. phase/stress response

pap1: stress response

hsf: binds to heat shock elements

mbx2: cell wall synthesis

php2: respiration/mitochondrial electron transport

jmj2: chromatin remodelling

C320.03

C2H10.01

C25B8.19c

C19C7.10

Gene deletion of C1105.14

Exponential phase – 5 repeats Stationary phase – 2 repeats

Green is for down-regulated genes in the WT time course.

Red is for up-regulated genes.

50 most repressed genes in C1105.14 mutant in WT stationary phase time course

Genes regulated in stationary phase

Up-regulated:

• Stress MAPK pathway and stress response genes.

• Citric acid cycle and mitochondrial transport genes.

• Starch and sugar metabolism genes.

• Genes that are 2-fold up-regulated in low glucose medium (including sugar transporter genes).

• Genes involved with RNA polymerase II.

• Transcription factors known to be involved in starvation, stress, meiosis.

• Some unknown TF’s that are now being investigated further in the lab.

Down-regulated:

• Ribosomal proteins.

• Glycolysis pathway.

• Fatty acid synthesis genes.

• tRNA coupling genes.

• Amino acid and nucleotide metabolism genes.

Effect of introns in up-regulated stress-response (CESR) genes in stationary phase time course

Genes without intronsGenes with introns

Gene with and without introns in different oxidative stress conditions

Seems to be general in pombe stress experiments.

Comparing data sets from different organisms

It seems that in pombe the transcriptional response to stress conditions is governed by a need to produce functional mRNA’s quickly (without the need for splicing) – is this common to other organisms?

As studies in different organisms use various time points, need a way to compare data both within and between time courses using a standard common metric – expression change within unit time.

E1

E2

E3

t1t2 t3

R2-1 = E2 – E1

t2 – t1

R3-2 = E3 – E2

t3 – t2

Rmax = abs{max(R2-1, R3-2) }

Rmax for stress data against number of introns in pombe

Data is from Chen et al. (2003), for 5 different stresses with t = 0, 15, 60 mins.

Data is from 2-colour microarrays, so is relative expression levels, compared to t = 0.

Median of Rmax for all data

Correlation of max value against intron number using Spearman’s rank: P = 7.6 x 10-6

Correlation of all values against intron number using Spearman’s rank: P < 2.2 x 10-16

Compare genes without introns against genes with introns

Compare two data sets using Wilcoxon (Mann-Whitney) non-parametric rank test: P < 2.2 x 10-16

Compare with cell cycle data

Data from 3 elutriations of wt cells over 2 cell cycles (Rustici et al., 2004).

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

W (0 vs >0): 3.9 x 10-5

S (all data): 4.4 x 10-5

Compare with Arabidopsis stress data

Data from various stresses in Arabidopsis for both roots and shoots including drought, UV-B, cold, heat, genotoxic, salt, wounding and osmotic from ? et al.

Time points include 30, 60, 180, 360, 720, 1440 mins (plus 15 and 240 for some).

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

Considerations with Arabidopsis data

Data collected using Affymetrix chips, so get absolute expression levels.

Hence can use data that is from the absolute values or a ratio to t = 0 (to compare with the 2-colour pombe data).

Also they did a time course with control untreated plants, so can also compare stress data using ratios to the control time points.

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

Rmax for Arabidopsis data using different methods

Absolute and data vs t = 0 look virtually identical – as have taken logs of expression values.

2 Affymetrix data sets in GEO: GDS1015 (fetal bovine serum factor; Philippar et al., 2004) and GDS683 (oxidative stress; Madsen et al., 2004). GDS1015 has better time points: t = 0, 10, 30, 50, 180 mins. GDS683: t = 0, 15, 60, 480, 1008 mins.

Some genes have >100 introns, so put into 22 equi-spaced bins (0-4, 5-9, 10-14, etc. introns).

Mouse stress data

W (0 introns vs >0 introns): - (mean/median less for 0 introns)

S (all data): - (positive gradient)

Similar poor stats for GDS683.

Use new metric for mouse data

As the transcripts are generally very long in mouse, the amount of time taken to transcribe the pre mRNA is also going to be a factor as well as the time taken to splice out introns.

Additionally, number of introns correlates with transcript length for Arabidopsis and mouse (only get a small correlation in pombe).

Can use an alternative metric called intron density, which correlates positively with intron number and inversely with transcript length:

Number of introns

Genomic length of transcriptIntron density =

Mouse stress data using intron densities

W (0 introns vs >0 introns): - (mean/median less)

S (all data): - (positive gradient)

W (<1/10th max vs >1/10th max): 4.0 x 10-10

S (all data): 1.1 x 10-6

As intron density is a continuous variable, put into 10 equi-spaced bins.

Check Arabidopis stress data for trend with intron densities

W (0 vs >0): <2.2 x 10-16

S (all data): <2.2 x 10-16

W (<1/10 vs >1/10): <2.2 x 10-16

S (all data): <2.2 x 10-16

Intron density still significant for Arabidopsis and pombe.

Pombe stats: W (<1/10 vs >1/10): <2.2 x 10-16; S (all data): < 2.2 x 10-16

Transcription proceeds at 1200-1500 bp/minute (Izban & Luse, 1992),

Pombe: mean gene length ~1,500 bp – time to transcribe ~1 min,mean intron number ~1 per gene.

Arabidopsis: mean gene length ~1,900 bp – time ~1.5 mins,mean intron number ~4.4 per gene.

Mouse: mean gene length ~33,000 bp – time ~20 mins,mean intron number ~9 per gene.

Half-lives for splicing reactions are considerably longer, under a minute for the first intron, but of the order of 2-8 mins for the second and third introns (Audibert et al., 2002).

Intron splicing may be the rate limiting factor, since new spliceosomal ‘speckles’ form ~15-20 mins after gene activation in mammalian cells, and speckle morphology changes on the order of 5-7 mins (Misteli et al., 1997).

With time scales on this order, it appears that the assembly splicing and release of the spliceosome may be limiting for rapid changes in gene expression.

Transcription and splicing kinetics

Acknowledgements

Jürg Bähler – Supervisor

Valerie Wood – Suggested adding GO into YOGY

Daniel Jeffares – Intron data

Gavin Burns – Laboratory help for pombe and arrays

Luis López – Stationary phase mutant data

Matloob Qureshi – 96 to 384 well program

Juan Mata – Normalisation script

Zoë Birtles, James Morris – Summer students

chris penkett wellcome trust sanger institute overview: web-based software for orthologs and primers...

Documents

translation data

webbased database

webbased software

new primers

gene taggingdeletionscripts

gene deletion

complete set of valid

gene ontology