candid: a cand idate gene id entification tool part 2

CANDID:A candidate gene identification tool

Part 2

Janna Hutz

[email protected]

March 26, 2007

Review

• Literature– Well-characterized genes

• Protein domains– All genes

• Cross-species conservation– All genes

Today’s agenda

• Expression levels

• Linkage data

• Association data

• CANDID performance measures

Candidate lists vs.single candidates

• Candidate lists– Complex trait or disease– Disease with known heterogeneity

• Single candidates– Mendelian trait– New disease– Disease with clear, well-defined pathology

Candidate lists vs. single candidates

• Microarray

• SNP typing

• Sequencing

• Immunocytochemistry

• Knockout model

ACT[A/G]GGA

Example 4

• Goiter - thyroid gland problem

• Iodine deficiency

• Genetic causes

Example 4

• Iodine is not supplied• Iodine is present, but is not added to the

molecule

• Which gene is mutated?

Expression data

• We know what tissue our gene is expressed in (thryoid).

• How can we use this knowledge to help identify the candidate?

• Wouldn’t it be nice if we had an expression database?

Expression databases

• Our ideal expression database would have:– Expression data for the same genes across many

different tissues– As many tissues as possible– As many genes as possible– Good documentation

• Gene Atlas

Gene Atlas

• Genomics Institute of the Novartis Research Foundation

• 79 human tissues (160 samples)

• 2 arrays– Affymetrix HG-U133A– GNF1H (custom)

• 17,809 genes

Measure of gene expression

• Our thyroid gene:– Gene that is brightest on the thyroid array?– Gene that is brightest on the thyroid array,

compared to all the other arrays.

heart brain thyroid lung

Measures of gene expression

• Run CANDID, specifying that we’re interested in the thyroid.

http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html

User name: workshopPassword: perl031907

• (We’ll need a tissue code for that.)

http://www.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

Example 4 - Results

• Our favorite genes:

• TP53 - rank is…– 16314th

• KRAS - rank is…– 5229th

• What genes are ranked most highly?

Example 4 - Results

• 192 genes with expression score of 1

• The TOP gene is actually responsible for the phenotype described earlier– Its expression score = 1

Prior evidence

• I’m not interested in examining all of the genes in the genome - just some of them.

• Linkage and association

Linkage

• CANDID can:– Weight regions with higher LOD scores

– Limit analysis to certain regions

– How does it do this?

Linkage scoring

3172

gene’s LOD score

maximum genome-wide LOD score

Linkage files

• How does CANDID get this linkage information?

• CANDID takes two kinds of files– Unformatted output from GENEHUNTER

and MERLIN– Custom linkage files

Custom linkage files

• Simple format• Line 1 of the file must contain the word

“custom” somewhere• Subsequent lines:

Chromosome (tab) cM (tab) LOD score

• But how do I get cM positions?

Mapmaker

• Inputs file as:Chromosome (tab) basepair (tab) LOD

score

• Outputs new file in the format:Chromosome (tab) cM (tab) LOD score

• Will be available on the CANDID website soon

Example 5

• Deletion on chromosome 13 between 23.65 cM and 25.08 cM.

pancreatic cancer

Creating a custom linkage file

• Example:

custom

13 23.64 0

13 23.65 3

13 25.08 3

13 25.08 023.65 25.08

Running CANDID

1. Try running CANDID using only the linkage criterion.

2. Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords)

• Linkage weight = 1000• Literature weight = 1

Results

• From OMIM:

“Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”

But linkage is so last season…

Association

• Increasing numbers of association studies

• Increasing numbers of SNPs in each study

• Can CANDID use this information, too?

Association

• Database– dbSNP - 11.8 million human SNPs– Includes HapMap SNPs– Most comprehensive– Each snp has a number prefixed with “rs”

Association

• How does CANDID accept association data?

• Custom file format - each line is:

rs# (tab) p-value

Association scoring

• For each gene, take the best p-value for that gene’s SNPs

• Subtract that p-value from 1

• Unless you test SNPs in every gene, this can be kind of unfair…

Association scoring

• Tested 10 genes

• Gene 9 has a best p-value of 0.8 (bad)

• Gene X was not tested

• Should Gene 9 get a higher overall score than Gene X?

p-value threshold

• User defines a p-value threshold• Let’s say it’s 0.1.

• Any SNPs with p-values above 0.1 are not considered.

• Now Gene 9 and Gene X have the same score (0).

Example 6

• Age-related Eye Disease Study

• Macular degeneration

Example 6

• Make custom association file

rs3753396 0.0444

rs543879 0.0494

rs7724788 0.75

• Run CANDID with this association file

Results

rs3753396 0.0444

rs543879 0.0494

rs7724788 0.75

} CFH

} SLC25A46

So just how well does this work anyway?

Preliminary evidence

• Online Mendelian Inheritance in Man

• 154 diseases linked to chromosome 1

• Literature, domains - chose keywords

• Conservation

• Expression - chose tissue codes

Ideal weights

• Tested all combinations of weights in those 4 categories– Possible weights: (0, 0.1, … , 0.9, 1)

• Which weight combination was the best, across all 154 diseases?

Top 10 weight combinations

1. Literature = 1, everything else = 0

2. Literature = 0.9, everything else = 0



5. …


11. Literature = 1, domains = 0.1

More specifics

• Literature only: average ranking = 425– 425/38697 = 98.9th percentile– 44/154 genes ranked #1 for at least one set of

weights

• Chromosome 1: average ranking = 22– 22/2280 = 99th percentile– 84/154 genes ranked #1 for at least one set of

weights

Analysis of results

• They make a lot of sense.

• Genes in OMIM are, by definition, well-characterized.

• Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.

Next steps

• Separate OMIM analysis into simple and complex traits– Get new ideal weights

• See how well these ideal weights do in ranking candidates from chromosome 2.

Next steps

• CANDID’s databases were last compiled in November 2006.

• Find publications that have come out since then.

• How well does CANDID do in ranking those genes?

Next steps

• Many new whole-genome studies and microarray studies implicate lists of candidates.

• If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?

Next steps

• Any other suggestions?

• Any interesting data you have?

Any questions?

Acknowledgments

• Mike Province• Howard McLeod

• Aldi Kraja

• Ingrid Borecki• Qunyuan Zhang

• Ryan Christensen• John Martin

candid: a cand idate gene id entification tool part 2

Documents

brca2 gene

linkage criterion

moleculewhich gene

linkage information

expression datawe

custom linkage fileexample

results192 genes

thyroid array