paper presentation @dils'07
DESCRIPTION
Accelerating Disease Gene Identification Through Integrated SNP Data AnalysisTRANSCRIPT
![Page 1: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/1.jpg)
Accelerating Disease Gene Identification Through Integrated
SNP Data Analysis
Paolo Missier, S. Embury, C. Hedeler, M. GreenwoodSchool of Computer Science, University of Manchester, UK
J. Pennock, A. BrassSchool of Biological Sciences, University of Manchester, UK
DILS ’07, Philadelphia, USA
![Page 2: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/2.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Overall goal
– Add value to existing public SNP databases
– Support multiple experimental added-value SNP analysis packages
Core application:
improving the search for candidate gene selection in quantitative trait analysis
– Analysis of genetic factors in observed quantitative phenotypes
• resistance / susceptibility to a certain disease
• life span, weight, …
Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)
Build a flexible data infrastructure to support current biology research involving gene polymorphism (SNP)
![Page 3: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/3.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Example: study on a parasite worm
Trichuris trichiura Trichuris muris
•Same life cycle•Natural parasite of mice
Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254
‘’…….28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’
Genetic Component to Susceptibility to Trichuris trichiura: Evidence from Two Asian Populations S. Williams-Blangero et al. - Genetic Epidemiol. 2002 22 (5):254
‘’…….28% of the variation in Trichuris trichiura loads was attributable to genetic factors in both populations.’’
0
5
10
15
20
25
30
4 5 6 7 8 9 10 11 12 13 14 15
Age (years)
Inte
nsi
ty (
epg
) x
1000
![Page 4: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/4.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Finding candidate genes
• Candidate gene determination is an area of active research
(A. Chakravarti. Population genetics – making sense out of sequence. Nature Genetics, 21(Suppl. 1), January 1999)
• Current methodology involves QTL mapping
– Experimental method to correlate quantitative phenotype with genotype
– Associates a region on the chromosome to a specific phenotype through complex in-breeding schemes
Mixed responders
ResistantSusceptible
![Page 5: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/5.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
0
2
4
6
8
10
12
14
16
18
K s
tatis
tic
D12M
it63
D12M
it136
D12M
it285
D12M
it201
D12M
it156
D12M
it52
D12M
it144
cM
The challenge
• A QTL may contain hundreds or thousands of genes
• Quantitative phenotypes are often polygenic
• Determination of candidate genes is a difficult and slow process
Example QTL (chr 12)
Automation is needed to narrow the scope of the search to a manageable size
Automation is needed to narrow the scope of the search to a manageable size
![Page 6: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/6.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPs and their role in QT analysis
SNP: Single Nucleotide Polymorphism
– single-base change in a strain relative to a reference strain (mus musculus)
• Strategy
– Identify areas of greatest difference between resistant / susceptible strains
– Prioritize candidate gene search using the density of highly differentiated gene regions
0
100
200
300
400
500
600
700
800
900
31
-32
32
-33
33
-34
34
-35
35
-36
36
-37
37
-38
38
-39
39
-40
40
-41
41
-42
42
-43
43
-44
44
-45
45
-46
46
-47
47
-48
48
-49
49
-50
50
-51
51
-52
52
-53
53
-54
54
-55
1Mbp divisions Chr 12
No
ind
ivid
ua
l SN
Ps
Priority region
![Page 7: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/7.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
SNP informativeness
• Rank SNPs according to strain differences
• Strain allele: nucleotide base replacement for a SNP observed in a single strain
Strain group 1(resistant)
Strain group 2(susceptible)
Group score model:
• Compare susceptible strains vs resistant strains
Perfect score:• Disjoint sets of alleles• No missing alleles
Group score model:
• Compare susceptible strains vs resistant strains
Perfect score:• Disjoint sets of alleles• No missing alleles
![Page 8: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/8.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Group strain score model
mnAAgs
1),( 210
nssS ,...,11
''12 ,..., maaA naaA ,...,11
21 AA
m
Strains
Corresponding alleles
For each SNP:
Common, distinct non-null alleles
Distinct non-null alleles in A1, A2 : n
1
1
1 A
N
jj aAap
21210211 ),(),( ppAAgsAAgs
nssS ,...,12
Penalties:
2
2
2 A
N
jj aAap
![Page 9: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/9.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Example
3
1
2
11
3
11
1),( 21211
pp
mnAAgs
1,2,1 mn
![Page 10: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/10.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Score model performance
• No standard test dataset
• Criteria: evaluate ranking of polymorphic genes
– Based on known candidate genes for HDL (cholesterol) QTL regions
• From SNP scores to gene scores:
High-score SNP density / total SNP density
![Page 11: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/11.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Score selectivity
HDL data on Perlegen
{ CAST/EiJ, C57BL/6J } vs { C3H/HeJ, FVB/NJ }
7090 / 101,896 = 6.9%Translates to < 20 candidate genes
![Page 12: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/12.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
The SNPit project
• A "lightweight" SNP database designed to support genetic research
– gene identification in QTLs is one application
– Hope to answer a broader array of research questions beyond QTL analysis
• SNPit is a secondary DB
– Primary sources: Ensembl (EBI), dbSNP (NCBI), Perlegen
• Others available, not considered in this study MGD – see Nucl. Acids Res., 35(Database issue), 2007UCSC – see Nucl. Acids Res., 35(Database issue), 2007Wellcome-CTC Mouse Strain SNP Genotype SetMPD – Mouse Phenome Database
![Page 13: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/13.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPit application challengesSupports interactive exploratory analysis over large regions
• Over 50Mb –200K SNPs/source/session
Typical flow:
1. Region selection (or gene set)
2. Source selection (multiple)
3. Strain group selection – per-session basis
4. Compute score for each SNP in the region – on the fly
5. (filter by gene polymorphism)
6. Rank SNPs by score, gene polymorphism – in-memory sorting
7. Plot density of high-score SNPs over the selected region
• Change parameters and repeat…
Response times typically within 30secs on a Tomcat deployment, high-end server with co-located DBMS (mySQL)
![Page 14: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/14.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Why multiple SNP DBs
• SNP databases differ
– Partially overlap in structure and content
– Different update policy and frequency
• Biologists like to choose their sources
– Based on experience, prior usage, confidence
• The SNPit application offers an explicit choice
• It exploits complementary features and content of the DBs
![Page 15: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/15.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Data architecture
SNPitDB
SNPitDB
EnsemblSNP
EnsemblSNP dbSNPdbSNP PerlegenPerlegen
SNPitWeb app
SNPitWeb app
SNPitWeb Service
SNPitWeb Service
loadload loadload loadload
rsId ssId
PerlegendbSNPEnsembl
Interdependent materialized views
Interdependent materialized views
• no single global schema
• queries against the views
• some queries can be directedto more than one DB
• End-user web app
• Web Service accessible as a workflow processor (Taverna)
• no single global schema
• queries against the views
• some queries can be directedto more than one DB
• End-user web app
• Web Service accessible as a workflow processor (Taverna)
Periodic updatesPeriodic updates
Core Data processing
Core Data processing Score 2Score 2
……
Score 1
![Page 16: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/16.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPit access from a workflow
![Page 17: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/17.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
EnsemblMouse(407,000)
NCBIdbSNP
Perlegen
Public submissionfrom multiple sources
joinrsId rsId ssId ssId
join SNPs
Strainalleles
SNPProvenance
MultipleSNPsstrains
LoadLoad
Load
Sangerinstitute
PrimarysourcesUpdates
Updates
Tot 407,000Tot 420,000
147,000 146,00014,000
133,000
132,000
(420,000)
SNP DB dependencies
(all figures relative to chromosome 12)
![Page 18: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/18.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Qualitative differencesStrengths Weaknesses Strain info
Ensembl • Curated SNPs
• Evolving
• SNP location info (exonic, intronic)
• Multiple reputable sources
•Controlled submission
Low timeliness About 60 strains
Not very complete
dbSNP • Submitter info
• Update history (provenance)
•Multiple sources
•Low quality control on public submission
•Timely
Not used
Perlegen • Good quality control
• High reputation
•No SNP location
•Not evolving
16 strains (ref + 15)
Fairly complete
![Page 19: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/19.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Missing strains – chr 17% SNPs with available strain allele - PERLEGEN
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
C57BL/6
J
KK/HlJ
BTBR T+ tf
NZW/La
cJ
AKR/J A/J
WSB/E
iJ
DBA/2J
BALB/cByJ
C3H/HeJ
129S
1/SvIm
NOD/L
tJ
FVB/NJ
CAST/EiJ
MOLF
/EiJ
PWD/P
hJ
strain
% a
vail
able
all
ele
% SNPs with available allele strain - ENSEMBL
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
A/J
C57BL/6
J
DBA/2J
129X
1/SvJ
129S
1/SvIm
J
AKR/J
C3H/HeJ
KK/HlJ
NZW/La
cJ
BALB/cByJ
BTBR T+ tf
/J
WSB/E
iJ
NOD/L
TJ
MOLF
/EiJ
FVB/NJ
PWD/P
hJ
CAST/EiJ
MSM/M
s
NOD/D
IL
CZECHII/Ei
SPRET/Ei
strain
avai
lab
le a
llel
e in
fo
![Page 20: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/20.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Effect of source selection
EnsemblPerlegen
![Page 21: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/21.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
Summary• SNPit complements current methodologies for candidate gene
discovery in QTL regions– Helps focusing on promising genes
– Automates SNP analysis over large regions
• View-based, loose integration of three prominent DBs
• Original score models– More study needed to exploit other features
• SNP location, submitter info, revision frequency…
• Can be invoked from workflows– As part of larger in silico experiments
• Plan to release SNPit as a public Web Service
![Page 22: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/22.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
![Page 23: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/23.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
SNPs and their role in QT analysis
• SNP: Single Nucleotide Polymorphism
– single-base change in a strain relative to a reference strain (mus musculus)
• Inbred strains are genetically similar
• The arrangement of SNPs across the mouse genome falls into blocks which are common among strains (haplotypes)
• ex.: C57 strain (susceptible) different from A/J and BALBc strains (resistant)
![Page 24: Paper presentation @DILS'07](https://reader036.vdocuments.net/reader036/viewer/2022062418/55506699b4c905c0448b547d/html5/thumbnails/24.jpg)
Combining the strengths of UMIST andThe Victoria University of Manchester
DB overlaps
Perlegen291,718
Ensembl253,862
dbSNP
50,564
122,938
105,265243,702
(Chromosome 17)