statistical bioinformatics

24
Statistical Bioinformatics • QTL mapping • Analysis of DNA sequence alignments • Postgenomic data integration • Systems biology

Upload: sheng

Post on 15-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Statistical Bioinformatics. QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology. Statistical Bioinformatics. QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology. Mixed models for QTL by environment analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Bioinformatics

Statistical Bioinformatics

• QTL mapping• Analysis of DNA sequence alignments• Postgenomic data integration• Systems biology

Page 2: Statistical Bioinformatics

Statistical Bioinformatics

• QTL mapping• Analysis of DNA sequence alignments• Postgenomic data integration• Systems biology

2_056314.01_050215.22_142320.02_010722.71_094323.52_039437.12_126138.81_078742.52_086443.51_111044.21_145245.21_084746.72_017350.51_052551.61_107352.21_029653.21_083755.32_115355.92_133861.81_106169.01_105473.51_063876.31_127281.12_074882.01_033085.51_001286.41_047487.21_009988.11_003592.41_019193.52_037494.62_109496.71_112198.91_0545101.02_1144101.91_0265104.51_0407105.41_0717106.02_0667107.22_0528108.21_0650109.32_0419110.21_0952111.51_0196115.82_0489117.51_0475130.21_0936131.11_0969132.31_0214136.92_1007141.01_0649141.81_0731164.42_1396165.11_0429167.72_1220175.22_0511176.31_0092177.42_1370183.32_1406184.31_1486185.31_0109187.51_0656188.32_1125189.91_0065190.92_0715197.41_0625206.01_0566206.91_0315208.41_0714211.91_0487213.01_0181217.62_0293218.82_1436223.02_1099231.62_0561233.8

0

20

40

60

80

10

0

2H

Rust ratio

U2453

U7845

U6615

Page 3: Statistical Bioinformatics

Mixed models for QTL by environment analysis

Mixed models represent correlations over sites and models differences in environmental variance: allows tests for QTL by environment interactions

Eg marker Rub2a1 on LG3 shows a consistent effect raspberry total anthocyanins over 7 environments

ad

0.4

bc

-0.6

-0.4

0.0

bd

0.2

-0.8

acaverage s.e.d.

-0.2

To

tal A

nth

ocya

nin

Rub2a1 genotype

average s.e.d.

P2007Mean

Antho_poly_08Antho_PT_08

F2006MeanF2007Mean

Antho_M24_08PT2007Mean

Page 4: Statistical Bioinformatics

Genetical genomics: QTL eQTL

Jansen& Nap

Page 5: Statistical Bioinformatics

eQTL analysis using pairs of barley DHs on a two-colour microarray

2_056314.01_050215.22_142320.02_010722.71_094323.52_039437.12_126138.81_078742.52_086443.51_111044.21_145245.21_084746.72_017350.51_052551.61_107352.21_029653.21_083755.32_115355.92_133861.81_106169.01_105473.51_063876.31_127281.12_074882.01_033085.51_001286.41_047487.21_009988.11_003592.41_019193.52_037494.62_109496.71_112198.91_0545101.02_1144101.91_0265104.51_0407105.41_0717106.02_0667107.22_0528108.21_0650109.32_0419110.21_0952111.51_0196115.82_0489117.51_0475130.21_0936131.11_0969132.31_0214136.92_1007141.01_0649141.81_0731164.42_1396165.11_0429167.72_1220175.22_0511176.31_0092177.42_1370183.32_1406184.31_1486185.31_0109187.51_0656188.32_1125189.91_0065190.92_0715197.41_0625206.01_0566206.91_0315208.41_0714211.91_0487213.01_0181217.62_0293218.82_1436223.02_1099231.62_0561233.8

0

20

40

60

80

100

2H

Rust ratio

U2453

U7845

U6615

A distant pair design gives more informative pairs than a random design (horizontal line)

Significant (p < .001) QTLs were detected for 9557 out of 15208 genes

Most significant QTL for rust resistance mapped to 2H: 23 genes with highly correlated expression also mapped to the same region

Page 6: Statistical Bioinformatics

Taking QTL analysis further

• Analysis of more complex populations – moving from a single biparental cross through multiple related crosses to general association mapping populations.

• Analysis of high-dimensional phenotypic trait data (expression data, metabolomic data etc), including network-based approaches

• QTL analysis of processes (raspberry ripening, water use? Process of biofuel production?)

• Linkage analysis: review statistical methods, especially clustering, behind some marker technologies. Analysis of blackcurrant (454 sequencing) and sugarcane (Dart) show that more information can be obtained by working directly on continuous underlying data (intensities).

Page 7: Statistical Bioinformatics

Statistical Bioinformatics

• QTL mapping• Analysis of DNA sequence alignments• Postgenomic data integration• Systems biology

Page 8: Statistical Bioinformatics

Molecular Sequence Analysis

• Intragenic recombination detection - method Various methods developed at BioSS (DSS, PDM,HMM)

• TOPALi - software User-friendly access to statistical phylogenetic methods

• Molecular sequence alignment - analysis automation Phylogenetic tree/ model selection selection

• Positive (diversifying) selection - methods appliedUse of state-of-the-art methodology for detection of functionally significant amino acid sites in proteins.

• Comparative genomics analysis – growth area Phylogenetic tree estimation using many loci

• Population genetic structure analysis – growth area

• Optimal use of Next Generation Sequence data

development

Page 9: Statistical Bioinformatics

Statistical Bioinformatics

• QTL mapping• Analysis of DNA sequence alignments• Postgenomic data integration• Systems biology

Page 10: Statistical Bioinformatics

Example: Human nutrigenomics study

10 volunteers observed over 10 time points

Various body fluids (blood, urine,saliva) collected

Samples analyzed by various ‘omics’ techniques

Page 11: Statistical Bioinformatics

Co-inertia analysis from metabolomic profiles of two samples: urine and plasma

Page 12: Statistical Bioinformatics

Statistical Bioinformatics

• QTL mapping• Analysis of DNA sequence alignments• Postgenomic• Systems biology

Page 13: Statistical Bioinformatics

Can we learn the signalling pathway from data?

From Sachs et al Science 2005

Cell membrane

Receptor molecules

Inhibition

Activation

Interaction in signalling pathway

Phosphorylated protein

Page 14: Statistical Bioinformatics

Bayesian network

Differential equation

model

Mechanistic models versus machine learning

Page 15: Statistical Bioinformatics

Circadian rhythms in Arabidopsis thalianaCollaboration with the Institute of Molecular Plant Sciences at Edinburgh University

(Andrew Millar’s group)

Page 16: Statistical Bioinformatics

T28 T20

Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4,ELF3, GI, PRR9, PRR5, and PRR3

Two gene expression time series measured with Affymetrix arrrays under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h

Plants entrained to different light:dark cycles 10h:10h (T20) and 14h:14h (T28)

Page 17: Statistical Bioinformatics

Cogs of the Plant Clockwork

Morning genes

Evening genes

Page 18: Statistical Bioinformatics

Circadian genes in Arabidopsis thaliana, network learned from two time series over 13 time points

CCA1

LHY

PRR9

GI

ELF3

TOC1

ELF4

PRR5

PRR3

“False positives”“False negatives”

Page 19: Statistical Bioinformatics

Overview of the plant clock model

X

LHY/ CCA1

TOC1Y (GI)PRR9/ PRR7

Morning Evening

Locke et al. Mol. Syst. Biol. 2006

Sensitivity = TP/[TP+FN] = 62%

Specificity = TN/[TN+FP] = 81%

Page 20: Statistical Bioinformatics

Overview of the plant clock model

X

LHY/ CCA1

TOC1Y (GI)PRR9/ PRR7

Morning Evening

Locke et al. Mol. Syst. Biol. 2006

Sensitivity = TP/[TP+FN] = 62%

Specificity = TN/[TN+FP] = 81%

Yes

Yes

Yes

Yes

Correct sign

Page 21: Statistical Bioinformatics

Future work

• Integration of mechanistic and machine learning models

• Latent variable models for post-translational modifications

• Network inferences from eQTL type data• Allowing for heterogeneity and non-stationarity

Page 22: Statistical Bioinformatics

Latent variable model for post-translational modifications

Page 23: Statistical Bioinformatics

Can we learn the protein signalling pathway

from protein concentrations?

Raf pathway

Flow cytometry data from 100 cells

Sachs et al., Science 2005

Page 24: Statistical Bioinformatics

Predicted network

11 nodes, 20 edges, 90 non-edges

20 top-scoring edges: 15/20 correct 5/90 false

75%

94%