candid: a cand idate gene id entification tool part 2
DESCRIPTION
CANDID: A cand idate gene id entification tool Part 2. Janna Hutz [email protected] March 26, 2007. Review. Literature Well-characterized genes Protein domains All genes Cross-species conservation All genes. Today’s agenda. Expression levels Linkage data Association data - PowerPoint PPT PresentationTRANSCRIPT
Review
• Literature– Well-characterized genes
• Protein domains– All genes
• Cross-species conservation– All genes
Today’s agenda
• Expression levels
• Linkage data
• Association data
• CANDID performance measures
Candidate lists vs.single candidates
• Candidate lists– Complex trait or disease– Disease with known heterogeneity
• Single candidates– Mendelian trait– New disease– Disease with clear, well-defined pathology
Candidate lists vs. single candidates
• Microarray
• SNP typing
• Sequencing
• Immunocytochemistry
• Knockout model
ACT[A/G]GGA
Example 4
• Goiter - thyroid gland problem
• Iodine deficiency
• Genetic causes
Example 4
• Iodine is not supplied• Iodine is present, but is not added to the
molecule
• Which gene is mutated?
Expression data
• We know what tissue our gene is expressed in (thryoid).
• How can we use this knowledge to help identify the candidate?
• Wouldn’t it be nice if we had an expression database?
Expression databases
• Our ideal expression database would have:– Expression data for the same genes across many
different tissues– As many tissues as possible– As many genes as possible– Good documentation
• Gene Atlas
Gene Atlas
• Genomics Institute of the Novartis Research Foundation
• 79 human tissues (160 samples)
• 2 arrays– Affymetrix HG-U133A– GNF1H (custom)
• 17,809 genes
Measure of gene expression
• Our thyroid gene:– Gene that is brightest on the thyroid array?– Gene that is brightest on the thyroid array,
compared to all the other arrays.
heart brain thyroid lung
Measures of gene expression
• Run CANDID, specifying that we’re interested in the thyroid.
http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html
User name: workshopPassword: perl031907
• (We’ll need a tissue code for that.)
Example 4 - Results
• Our favorite genes:
• TP53 - rank is…– 16314th
• KRAS - rank is…– 5229th
• What genes are ranked most highly?
Example 4 - Results
• 192 genes with expression score of 1
• The TOP gene is actually responsible for the phenotype described earlier– Its expression score = 1
Prior evidence
• I’m not interested in examining all of the genes in the genome - just some of them.
• Linkage and association
Linkage
• CANDID can:– Weight regions with higher LOD scores
– Limit analysis to certain regions
– How does it do this?
Linkage scoring
3172
gene’s LOD score
maximum genome-wide LOD score
Linkage files
• How does CANDID get this linkage information?
• CANDID takes two kinds of files– Unformatted output from GENEHUNTER
and MERLIN– Custom linkage files
Custom linkage files
• Simple format• Line 1 of the file must contain the word
“custom” somewhere• Subsequent lines:
Chromosome (tab) cM (tab) LOD score
• But how do I get cM positions?
Mapmaker
• Inputs file as:Chromosome (tab) basepair (tab) LOD
score
• Outputs new file in the format:Chromosome (tab) cM (tab) LOD score
• Will be available on the CANDID website soon
Example 5
• Deletion on chromosome 13 between 23.65 cM and 25.08 cM.
pancreatic cancer
Creating a custom linkage file
• Example:
custom
13 23.64 0
13 23.65 3
13 25.08 3
13 25.08 023.65 25.08
Running CANDID
1. Try running CANDID using only the linkage criterion.
2. Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords)
• Linkage weight = 1000• Literature weight = 1
Results
• From OMIM:
“Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”
But linkage is so last season…
Association
• Increasing numbers of association studies
• Increasing numbers of SNPs in each study
• Can CANDID use this information, too?
Association
• Database– dbSNP - 11.8 million human SNPs– Includes HapMap SNPs– Most comprehensive– Each snp has a number prefixed with “rs”
Association
• How does CANDID accept association data?
• Custom file format - each line is:
rs# (tab) p-value
Association scoring
• For each gene, take the best p-value for that gene’s SNPs
• Subtract that p-value from 1
• Unless you test SNPs in every gene, this can be kind of unfair…
Association scoring
• Tested 10 genes
• Gene 9 has a best p-value of 0.8 (bad)
• Gene X was not tested
• Should Gene 9 get a higher overall score than Gene X?
p-value threshold
• User defines a p-value threshold• Let’s say it’s 0.1.
• Any SNPs with p-values above 0.1 are not considered.
• Now Gene 9 and Gene X have the same score (0).
Example 6
• Age-related Eye Disease Study
• Macular degeneration
Example 6
• Make custom association file
rs3753396 0.0444
rs543879 0.0494
rs7724788 0.75
• Run CANDID with this association file
Results
rs3753396 0.0444
rs543879 0.0494
rs7724788 0.75
} CFH
} SLC25A46
So just how well does this work anyway?
Preliminary evidence
• Online Mendelian Inheritance in Man
• 154 diseases linked to chromosome 1
• Literature, domains - chose keywords
• Conservation
• Expression - chose tissue codes
Ideal weights
• Tested all combinations of weights in those 4 categories– Possible weights: (0, 0.1, … , 0.9, 1)
• Which weight combination was the best, across all 154 diseases?
Top 10 weight combinations
1. Literature = 1, everything else = 0
2. Literature = 0.9, everything else = 0
3. Literature = 0.8, everything else = 0
4. Literature = 0.7, everything else = 0
5. …
10. Literature = 0.1, everything else = 0
11. Literature = 1, domains = 0.1
More specifics
• Literature only: average ranking = 425– 425/38697 = 98.9th percentile– 44/154 genes ranked #1 for at least one set of
weights
• Chromosome 1: average ranking = 22– 22/2280 = 99th percentile– 84/154 genes ranked #1 for at least one set of
weights
Analysis of results
• They make a lot of sense.
• Genes in OMIM are, by definition, well-characterized.
• Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.
Next steps
• Separate OMIM analysis into simple and complex traits– Get new ideal weights
• See how well these ideal weights do in ranking candidates from chromosome 2.
Next steps
• CANDID’s databases were last compiled in November 2006.
• Find publications that have come out since then.
• How well does CANDID do in ranking those genes?
Next steps
• Many new whole-genome studies and microarray studies implicate lists of candidates.
• If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?
Next steps
• Any other suggestions?
• Any interesting data you have?
Any questions?
Acknowledgments
• Mike Province• Howard McLeod
• Aldi Kraja
• Ingrid Borecki• Qunyuan Zhang
• Ryan Christensen• John Martin