candid: a cand idate gene id entification tool part 2

47
CANDID: A cand idate gene id entification tool Part 2 Janna Hutz [email protected] March 26, 2007

Upload: vail

Post on 15-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

CANDID: A cand idate gene id entification tool Part 2. Janna Hutz [email protected] March 26, 2007. Review. Literature Well-characterized genes Protein domains All genes Cross-species conservation All genes. Today’s agenda. Expression levels Linkage data Association data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CANDID: A  cand idate gene  id entification tool Part 2

CANDID:A candidate gene identification tool

Part 2

Janna Hutz

[email protected]

March 26, 2007

Page 2: CANDID: A  cand idate gene  id entification tool Part 2

Review

• Literature– Well-characterized genes

• Protein domains– All genes

• Cross-species conservation– All genes

Page 3: CANDID: A  cand idate gene  id entification tool Part 2

Today’s agenda

• Expression levels

• Linkage data

• Association data

• CANDID performance measures

Page 4: CANDID: A  cand idate gene  id entification tool Part 2

Candidate lists vs.single candidates

• Candidate lists– Complex trait or disease– Disease with known heterogeneity

• Single candidates– Mendelian trait– New disease– Disease with clear, well-defined pathology

Page 5: CANDID: A  cand idate gene  id entification tool Part 2

Candidate lists vs. single candidates

• Microarray

• SNP typing

• Sequencing

• Immunocytochemistry

• Knockout model

ACT[A/G]GGA

Page 6: CANDID: A  cand idate gene  id entification tool Part 2

Example 4

• Goiter - thyroid gland problem

• Iodine deficiency

• Genetic causes

Page 7: CANDID: A  cand idate gene  id entification tool Part 2

Example 4

• Iodine is not supplied• Iodine is present, but is not added to the

molecule

• Which gene is mutated?

Page 8: CANDID: A  cand idate gene  id entification tool Part 2

Expression data

• We know what tissue our gene is expressed in (thryoid).

• How can we use this knowledge to help identify the candidate?

• Wouldn’t it be nice if we had an expression database?

Page 9: CANDID: A  cand idate gene  id entification tool Part 2

Expression databases

• Our ideal expression database would have:– Expression data for the same genes across many

different tissues– As many tissues as possible– As many genes as possible– Good documentation

• Gene Atlas

Page 10: CANDID: A  cand idate gene  id entification tool Part 2

Gene Atlas

• Genomics Institute of the Novartis Research Foundation

• 79 human tissues (160 samples)

• 2 arrays– Affymetrix HG-U133A– GNF1H (custom)

• 17,809 genes

Page 11: CANDID: A  cand idate gene  id entification tool Part 2

Measure of gene expression

• Our thyroid gene:– Gene that is brightest on the thyroid array?– Gene that is brightest on the thyroid array,

compared to all the other arrays.

heart brain thyroid lung

Page 12: CANDID: A  cand idate gene  id entification tool Part 2

Measures of gene expression

• Run CANDID, specifying that we’re interested in the thyroid.

http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html

User name: workshopPassword: perl031907

• (We’ll need a tissue code for that.)

Page 13: CANDID: A  cand idate gene  id entification tool Part 2

Example 4 - Results

• Our favorite genes:

• TP53 - rank is…– 16314th

• KRAS - rank is…– 5229th

• What genes are ranked most highly?

Page 14: CANDID: A  cand idate gene  id entification tool Part 2

Example 4 - Results

• 192 genes with expression score of 1

• The TOP gene is actually responsible for the phenotype described earlier– Its expression score = 1

Page 15: CANDID: A  cand idate gene  id entification tool Part 2

Prior evidence

• I’m not interested in examining all of the genes in the genome - just some of them.

• Linkage and association

Page 16: CANDID: A  cand idate gene  id entification tool Part 2

Linkage

• CANDID can:– Weight regions with higher LOD scores

– Limit analysis to certain regions

– How does it do this?

Page 17: CANDID: A  cand idate gene  id entification tool Part 2

Linkage scoring

3172

gene’s LOD score

maximum genome-wide LOD score

Page 18: CANDID: A  cand idate gene  id entification tool Part 2

Linkage files

• How does CANDID get this linkage information?

• CANDID takes two kinds of files– Unformatted output from GENEHUNTER

and MERLIN– Custom linkage files

Page 19: CANDID: A  cand idate gene  id entification tool Part 2

Custom linkage files

• Simple format• Line 1 of the file must contain the word

“custom” somewhere• Subsequent lines:

Chromosome (tab) cM (tab) LOD score

• But how do I get cM positions?

Page 20: CANDID: A  cand idate gene  id entification tool Part 2

Mapmaker

• Inputs file as:Chromosome (tab) basepair (tab) LOD

score

• Outputs new file in the format:Chromosome (tab) cM (tab) LOD score

• Will be available on the CANDID website soon

Page 21: CANDID: A  cand idate gene  id entification tool Part 2

Example 5

• Deletion on chromosome 13 between 23.65 cM and 25.08 cM.

pancreatic cancer

Page 22: CANDID: A  cand idate gene  id entification tool Part 2

Creating a custom linkage file

• Example:

custom

13 23.64 0

13 23.65 3

13 25.08 3

13 25.08 023.65 25.08

Page 23: CANDID: A  cand idate gene  id entification tool Part 2

Running CANDID

1. Try running CANDID using only the linkage criterion.

2. Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords)

• Linkage weight = 1000• Literature weight = 1

Page 24: CANDID: A  cand idate gene  id entification tool Part 2

Results

• From OMIM:

“Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”

Page 25: CANDID: A  cand idate gene  id entification tool Part 2

But linkage is so last season…

Page 26: CANDID: A  cand idate gene  id entification tool Part 2

Association

• Increasing numbers of association studies

• Increasing numbers of SNPs in each study

• Can CANDID use this information, too?

Page 27: CANDID: A  cand idate gene  id entification tool Part 2

Association

• Database– dbSNP - 11.8 million human SNPs– Includes HapMap SNPs– Most comprehensive– Each snp has a number prefixed with “rs”

Page 28: CANDID: A  cand idate gene  id entification tool Part 2

Association

• How does CANDID accept association data?

• Custom file format - each line is:

rs# (tab) p-value

Page 29: CANDID: A  cand idate gene  id entification tool Part 2

Association scoring

• For each gene, take the best p-value for that gene’s SNPs

• Subtract that p-value from 1

• Unless you test SNPs in every gene, this can be kind of unfair…

Page 30: CANDID: A  cand idate gene  id entification tool Part 2

Association scoring

• Tested 10 genes

• Gene 9 has a best p-value of 0.8 (bad)

• Gene X was not tested

• Should Gene 9 get a higher overall score than Gene X?

Page 31: CANDID: A  cand idate gene  id entification tool Part 2

p-value threshold

• User defines a p-value threshold• Let’s say it’s 0.1.

• Any SNPs with p-values above 0.1 are not considered.

• Now Gene 9 and Gene X have the same score (0).

Page 32: CANDID: A  cand idate gene  id entification tool Part 2

Example 6

• Age-related Eye Disease Study

• Macular degeneration

Page 33: CANDID: A  cand idate gene  id entification tool Part 2

Example 6

• Make custom association file

rs3753396 0.0444

rs543879 0.0494

rs7724788 0.75

• Run CANDID with this association file

Page 34: CANDID: A  cand idate gene  id entification tool Part 2

Results

rs3753396 0.0444

rs543879 0.0494

rs7724788 0.75

} CFH

} SLC25A46

Page 35: CANDID: A  cand idate gene  id entification tool Part 2

So just how well does this work anyway?

Page 36: CANDID: A  cand idate gene  id entification tool Part 2

Preliminary evidence

• Online Mendelian Inheritance in Man

• 154 diseases linked to chromosome 1

• Literature, domains - chose keywords

• Conservation

• Expression - chose tissue codes

Page 37: CANDID: A  cand idate gene  id entification tool Part 2

Ideal weights

• Tested all combinations of weights in those 4 categories– Possible weights: (0, 0.1, … , 0.9, 1)

• Which weight combination was the best, across all 154 diseases?

Page 38: CANDID: A  cand idate gene  id entification tool Part 2

Top 10 weight combinations

1. Literature = 1, everything else = 0

2. Literature = 0.9, everything else = 0

3. Literature = 0.8, everything else = 0

4. Literature = 0.7, everything else = 0

5. …

10. Literature = 0.1, everything else = 0

11. Literature = 1, domains = 0.1

Page 39: CANDID: A  cand idate gene  id entification tool Part 2

More specifics

• Literature only: average ranking = 425– 425/38697 = 98.9th percentile– 44/154 genes ranked #1 for at least one set of

weights

• Chromosome 1: average ranking = 22– 22/2280 = 99th percentile– 84/154 genes ranked #1 for at least one set of

weights

Page 40: CANDID: A  cand idate gene  id entification tool Part 2

Analysis of results

• They make a lot of sense.

• Genes in OMIM are, by definition, well-characterized.

• Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.

Page 41: CANDID: A  cand idate gene  id entification tool Part 2

Next steps

• Separate OMIM analysis into simple and complex traits– Get new ideal weights

• See how well these ideal weights do in ranking candidates from chromosome 2.

Page 42: CANDID: A  cand idate gene  id entification tool Part 2

Next steps

• CANDID’s databases were last compiled in November 2006.

• Find publications that have come out since then.

• How well does CANDID do in ranking those genes?

Page 43: CANDID: A  cand idate gene  id entification tool Part 2

Next steps

• Many new whole-genome studies and microarray studies implicate lists of candidates.

• If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?

Page 44: CANDID: A  cand idate gene  id entification tool Part 2

Next steps

• Any other suggestions?

• Any interesting data you have?

Page 45: CANDID: A  cand idate gene  id entification tool Part 2

Any questions?

Page 46: CANDID: A  cand idate gene  id entification tool Part 2

Acknowledgments

• Mike Province• Howard McLeod

• Aldi Kraja

• Ingrid Borecki• Qunyuan Zhang

• Ryan Christensen• John Martin

Page 47: CANDID: A  cand idate gene  id entification tool Part 2