pgpop: pharmacogenomic discovery and replication in very large patient populations pgpop: summary...

1
RA rs6457620 Intergenic C hr.6 75 138 MS rs3135388 D R B 1*1501 108 61 RA rs6679677 RSBN1 238 134 RA rs2476601 PTPN22 238 134 AF rs2200733 C hr.4q25 292 147 CD rs11805303 IL23R 493 107 T2D rs4506565 TC F7L2 503 532 CD rs17234657 C hr.5 513 106 CD rs1000113 C hr.5 626 107 T2D rs12255372 TC F7L2 745 510 T2D rs12243326 TC F7L2 746 520 CD rs17221417 NOD2 866 107 AF rs10033464 C hr.4q25 1046 143 CD rs2542151 PTPN22 1104 107 MS rs2104286 IL2R A 2133 61 MS rs6897932 IL7R A 2263 61 T2D rs10811661 C D K N 2B 2406 534 T2D rs8050136 FTO 2569 533 T2D rs5219 K C N J11 2792 533 T2D rs5215 K C N J11 2908 527 T2D rs4402960 IG F2B P2 3111 527 gene /region m arker num ber needed num ber identified disease O dds ratio 0.5 1.0 2.0 5.0 0.1 1 10 PGPop: PharmacoGenomic discovery and replication in very large patient POPulations PGPop: SUMMARY PGPop was conceived as a network resource to provide to PGRN an opportunity to identify large groups of real world patients with known drug exposures and outcomes for pharmacogenomic study in a clinical setting. Each PGPop node includes a very large collection of patient data, drug exposures, and outcomes, and they share the general characteristic that they include “all comers” rather than more narrowly defined clinical trial populations. Some consortium nodes include large DNA collections in place, while others cover millions of lives and have committed to an infrastructure to collect DNA from patients with identified phenotypes. The participating systems include •BioVU, the Vanderbilt DNA databank that currently links 90,000 de-identified electronic health records (EHR) records with DNA obtained from discarded blood samples •The Marshfield Clinic Personalized Medicine Research Project (PMRP) that includes DNA from almost 20,000 individuals coupled to an EHR that extends back to the 1960s •Informatics for Integrating Biology and the Bedside (i2b2), an informatics capability at Harvard supported by the National Center for Biomedical Computing. The i2b2 group will not only contribute informatics excellence, but has also developed the Crimson Project that can provide DNA linked to de-identified medical records to Harvard Partners investigators from over 800,000 patient visits annually. •BioBank Japan, a resource that includes DNA and other biospecimens in >300,000 subjects. Clinical data are collected by medical coordinators at each of the 66 participating hospitals that cover 2% of all Japanese hospital beds (~25,000). •The integrated pharmacoepidemiology program of 13 health plans participating in the HMO Research Network Center for Education and Research in Therapeutics (CERT); these plans together cover 11,000,000 lives. •The Pharmacy benefits company Medco, that currently provides services to >60 million patients and has an active program in pharmacogenomics Vanderbilt BioVU – design and current status Leadership at PGPop nodes Top: The BioVU model. BioVU uses DNA extracted from blood samples that were obtained in the course of clinical care and that are about to be discarded. Using discarded biologic material as a research resource requires that the associated clinical information be de-identified. Accordingly, the first step (top left) in creation of the BioVU resource was creation of an image, termed the Synthetic Derivative, of the Vanderbilt EMR in which identifiers have been scrubbed and the medical record number has been hashed. The medical record number in eligible blood samples is labeled with the same hashed number, and DNA extracted. Bottom: Sample access procedures. After signing a data use agreement, investigators gain access to the Synthetic Derivative. The Data Use Agreement includes further stipulations against attempts at re- identification, and mandates that genotype data be redeposited into the resource. Tools to conduct simple automated searches are in place, but investigator curation is generally required to more precisely identify cases and controls for subsequent studies. Sam ple retrieval B699tre563msd.. scrub be d F5rt783m bncds… scrub bed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scrub be d F5rt783m bncds… scrub be d B699tre563msd.. scrub bed F5rt783m bncds… scrub bed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scrub be d F5rt783m bncds… scrub be d B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. sc ru bbed F5rt783m bncds… sc r u bbed B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scrub bed F5rt783m bncds… scru bbed B699tre563msd.. scr u bbed F5rt783m bncds… scrub bed B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scru bbed F5rt783m bncds… scru bbed B699tre563msd.. scru bbed F5rt783m bncds… scrub be d B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scru bbed F5rt783m bncds… scru bbed B699tre563msd.. scru bbed F5rt783m bncds… scrub be d B699tre563msd.. scrub be d F5rt783m bncds… scrubbed B699tre563msd.. scru bbed F5rt783m bncds… scrub bed B699tre563msd.. scru bbed F5rt783m bncds… scrub be d B699tre563msd.. scrubbed F5rt783m bncds… scrubbed B 699tre563m sd.. scrubbed F5rt783m bncds… scrub be d B699tre563msd.. scru bbed F5rt783m bncds… scrub be d F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B 699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B 699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . F5rt783m bncds… . B699tre563m sd… . G enotyping, genotype- phenotype relations cases controls + O ne w ay hash Investigator query cases controls + D ata use agreem ent BioVU (Vanderbilt University Medical Center) Marshfield Clinic Personalized Medicine Research Project (PMRP) Crimson Project (i2b2 at Harvard) HMO Research Network Center for Education and Research in Therapeutics Biobank Japan Medco (Pharmacy benefits) R esource C urrent size EM R DNA in hand Ethnicity (% ) C aucasian African American Asian Hispanic BioVU 90,000 Y Y 85 12 1 1 PM RP 20,000 Y Y 98 0.5 1 Crim son 800,000 Y 60 10 15 15 Biobank Japan 300,000 Y 100 HMORN CERT 11,000,000 Y varies 1-33 1-9 1-39 M edco 65,000,000 Hua Xu Josh Denny Yusuke Nakamura Zak Kohane Cathy McCarty Bob Davis Felix Frueh Dan Roden, PI The BioVU “demonstration project”. The first 10,000 subjects accrued were all genotyped at multiple SNP sites previously associated with disease susceptibility, and then natural language processing methods were used to identify cases and controls in the entire set. The experiment thus mimics a situation in which genotypic information is available in many subjects, and sets are then selected for genotype-phenotype analysis. The results are ordered by the number of cases estimated for replication (“number needed” column), calculated from previously-reported odds ratios, indicated by a red square. The number of cases actually identified is also shown (“number identified”). The blue diamonds indicate the point estimate of the allelic odds ratio derived from analysis of cases and controls identified. The confidence intervals for these estimates are also provided. This analysis used only cases in which European ancestry had been assigned. AF: atrial fibrillation; CD: Crohn’s Disease; MS: multiple sclerosis; RA: rheumatoid arthritis; T2D: type 2 diabetes. eligible John D oe O ne w ay hash A7C C F99D E5732… . A7C C F99D E65732… . Extract DNA A7C C F99D E65732… . John D oe The “synthetic derivative” (SD ) Searches conducted in B ioVU (April-M ay 2009,in preparation forthe PG Pop subm ission) Phenotype Location in EM R searched R equesting investigator/ site Num ber % w om en % A frican- American BioVU (M ay 21,2009) 56,907 58.1 9.9 w arfarin medications PAT 4,482 48.3 9.5 5 m ostcomm only prescribed statins medications Krauss/PA R C 10,216 46.0 10.9 clopidogrel medications Shuldiner/PAPI Limdi/UAB 4,407 42.4 10.1 prednisone or dexam ethasone medications Relling /PA AR 4KID S 10,584 58.7 12.1 m etform in + Type 2 diabetes + H gA1c Com plex N LP-based search Giacomini/PM T 1,794 55.7 21.3 rheum atoid arthritis Com plex N LP-based search Plenge/M GH 1,777 77.1 9.5 asthma ICD 9 code W eiss/PH AT 3,916 70.8 17.5 hypertension ICD 9 code Johnson/P EA R 21,102 52.0 14.3 Zyban,Wellbutrin,bupropion, C hantix,V arenicline in m edications O R “nicotine replacem ent”in the history and physical,problem listor discharge sum m ary. Tyndale/P N AT 3,855 70.2 8.3 N LP:N atural Language processing;EM R :Electronic M edical R ecord PGPop goals PGPop will be managed by a Steering Committee that will include representation from the participating nodes. Our initial task will be (1) organization of the resource and (2) execution of a demonstration project that will establish mechanisms for access to samples from multiple resource nodes. We anticipate that mechanisms to access PGPop will be similar to those being established for access to other PGRN resources. This will likely involve an application process to be reviewed by components of the PGRN and by PGPop. There will be costs associated with accessing the samples, which remain to be determined. PGPop goals are 1. Establish the infrastructure to enable rapid access to well-phenotyped samples across nodes Catalog resource components • Facilitate access to cases and controls, and ultimately samples • Coordination of methods to define phenotypes across nodes. 2. Undertake a demonstration project across nodes in Year 01 3. Deploy the resource for pharmacogenomic studies proposed by PGRN sites • The Steering Committee and PGRN will receive applications and decide on scientific merit. The Steering Committee will establish which PGPop node(s) can and wish to collaborate on a given project. Any single PGRN center could interact individually with any participating node. We anticipate that PGPop would support 1-2 projects/year. 4. Evaluate best practices and models for using large resources for pharmacogenomic science

Upload: austen-wade

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PGPop: PharmacoGenomic discovery and replication in very large patient POPulations PGPop: SUMMARY PGPop was conceived as a network resource to provide

RA rs6457620 Intergenic Chr. 6 75 138

MS rs3135388 DRB1*1501 108 61

RA rs6679677 RSBN1 238 134

RA rs2476601 PTPN22 238 134

AF rs2200733 Chr. 4q25 292 147

CD rs11805303 IL23R 493 107

T2D rs4506565 TCF7L2 503 532

CD rs17234657 Chr. 5 513 106

CD rs1000113 Chr. 5 626 107

T2D rs12255372 TCF7L2 745 510

T2D rs12243326 TCF7L2 746 520

CD rs17221417 NOD2 866 107

AF rs10033464 Chr. 4q25 1046 143

CD rs2542151 PTPN22 1104 107

MS rs2104286 IL2RA 2133 61

MS rs6897932 IL7RA 2263 61

T2D rs10811661 CDKN2B 2406 534

T2D rs8050136 FTO 2569 533

T2D rs5219 KCNJ11 2792 533

T2D rs5215 KCNJ11 2908 527

T2D rs4402960 IGF2BP2 3111 527

gene / regionmarkernumber needed

number identified

disease

Odds ratio

0.5 1.0 2.0 5.0

0.1 1 10

PGPop: PharmacoGenomic discovery and replication in very large patient POPulations

PGPop: SUMMARYPGPop was conceived as a network resource to provide to PGRN an opportunity to identify large groups of real world patients with known drug exposures and outcomes for pharmacogenomic study in a clinical setting.

Each PGPop node includes a very large collection of patient data, drug exposures, and outcomes, and they share the general characteristic that they include “all comers” rather than more narrowly defined clinical trial populations. Some consortium nodes include large DNA collections in place, while others cover millions of lives and have committed to an infrastructure to collect DNA from patients with identified phenotypes. The participating systems include •BioVU, the Vanderbilt DNA databank that currently links 90,000 de-identified electronic health records (EHR) records with DNA obtained from discarded blood samples•The Marshfield Clinic Personalized Medicine Research Project (PMRP) that includes DNA from almost 20,000 individuals coupled to an EHR that extends back to the 1960s•Informatics for Integrating Biology and the Bedside (i2b2), an informatics capability at Harvard supported by the National Center for Biomedical Computing. The i2b2 group will not only contribute informatics excellence, but has also developed the Crimson Project that can provide DNA linked to de-identified medical records to Harvard Partners investigators from over 800,000 patient visits annually.•BioBank Japan, a resource that includes DNA and other biospecimens in >300,000 subjects. Clinical data are collected by medical coordinators at each of the 66 participating hospitals that cover 2% of all Japanese hospital beds (~25,000). •The integrated pharmacoepidemiology program of 13 health plans participating in the HMO Research Network Center for Education and Research in Therapeutics (CERT); these plans together cover 11,000,000 lives. •The Pharmacy benefits company Medco, that currently provides services to >60 million patients and has an active program in pharmacogenomics

Vanderbilt BioVU – design and current status

Leadership at PGPop nodes

Top: The BioVU model. BioVU uses DNA extracted from blood samples that were obtained in the course of clinical care and that are about to be discarded. Using discarded biologic material as a research resource requires that the associated clinical information be de-identified. Accordingly, the first step (top left) in creation of the BioVU resource was creation of an image, termed the Synthetic Derivative, of the Vanderbilt EMR in which identifiers have been scrubbed and the medical record number has been hashed. The medical record number in eligible blood samples is labeled with the same hashed number, and DNA extracted. Bottom: Sample access procedures. After signing a data use agreement, investigators gain access to the Synthetic Derivative. The Data Use Agreement includes further stipulations against attempts at re-identification, and mandates that genotype data be redeposited into the resource. Tools to conduct simple automated searches are in place, but investigator curation is generally required to more precisely identify cases and controls for subsequent studies. Samples are retrieved for genotyping after review of a genotyping plan. Planning for BioVU began in 2004 and the first samples were acquired in 2007. The resource currently accrues 500-1000 samples/week, and now holds ~90,000 samples. Samples from the Vanderbilt Children’s Hospital were included in spring 2010.

Sample retrieval

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

B6

99

tre

563

msd

..

scru

bbed

F5

rt7

83

mb

nc

ds…

scru

bbed

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

F5

rt7

83

mb

nc

ds…

.B

699

tre

563

msd

….

Genotyping, genotype-phenotype relations

cases

controls

+

On

e w

ay

hash

Investigator query

cases

controls

+

Data use agreement

BioVU (Vanderbilt University Medical Center)Marshfield Clinic Personalized Medicine Research Project (PMRP)

Crimson Project (i2b2 at Harvard)

HMO Research Network Center for Education and Research in Therapeutics

Biobank Japan

Medco (Pharmacy benefits)

Table 1: PGPop nodes

Resource Current size

EMR DNA

in hand

Ethnicity (%)

Caucasian African

American Asian Hispanic

BioVU 90,000 Y Y 85 12 1 1 PMRP 20,000 Y Y 98 0.5 1 Crimson 800,000 Y 60 10 15 15 Biobank Japan

300,000 Y 100

HMORN CERT

11,000,000 Y varies 1-33 1-9 1-39

Medco 65,000,000

Hua Xu Josh Denny

Yusuke Nakamura

Zak Kohane

Cathy McCarty

Bob Davis

Felix Frueh

Dan Roden, PI

The BioVU “demonstration project”. The first 10,000 subjects accrued were all genotyped at multiple SNP sites previously associated with disease susceptibility, and then natural language processing methods were used to identify cases and controls in the entire set. The experiment thus mimics a situation in which genotypic information is available in many subjects, and sets are then selected for genotype-phenotype analysis. The results are ordered by the number of cases estimated for replication (“number needed” column), calculated from previously-reported odds ratios, indicated by a red square. The number of cases actually identified is also shown (“number identified”). The blue diamonds indicate the point estimate of the allelic odds ratio derived from analysis of cases and controls identified. The confidence intervals for these estimates are also provided. This analysis used only cases in which European ancestry had been assigned. AF: atrial fibrillation; CD: Crohn’s Disease; MS: multiple sclerosis; RA: rheumatoid arthritis; T2D: type 2 diabetes.

eligibleJoh

n D

oe O

ne w

ay h

ash

A7C

CF

99D

E57

32…

.

A7C

CF

99D

E65

732…

.

Extract DNA

A7C

CF

99D

E65

732…

.

Joh

n D

oe

The “synthetic derivative”(SD)

Searches conducted in BioVU (April-May 2009, in preparation for the PGPop submission)

Phenotype Location in

EMR searched

Requesting investigator /

site Number

% women

% African-American

BioVU (May 21, 2009) 56,907 58.1 9.9 warfarin medications PAT 4,482 48.3 9.5 5 most commonly prescribed statins

medications Krauss/PARC 10,216 46.0 10.9

clopidogrel medications Shuldiner/PAPI Limdi/UAB

4,407 42.4 10.1

prednisone or dexamethasone

medications Relling /PAAR4KIDS

10,584 58.7 12.1

metformin + Type 2 diabetes + HgA1c

Complex NLP-based search

Giacomini/PMT 1,794 55.7 21.3

rheumatoid arthritis Complex NLP-based search

Plenge/MGH 1,777 77.1 9.5

asthma ICD9 code Weiss/PHAT 3,916 70.8 17.5 hypertension ICD9 code Johnson/PEAR 21,102 52.0 14.3 Zyban, Wellbutrin, bupropion, Chantix, Varenicline in medications OR “nicotine replacement” in the history and physical, problem list or discharge summary.

Tyndale/PNAT 3,855 70.2 8.3

NLP: Natural Language processing; EMR: Electronic Medical Record

PGPop goalsPGPop will be managed by a Steering Committee that will

include representation from the participating nodes. Our initial task will be (1) organization of the resource and (2) execution of a demonstration project that will establish mechanisms for access to samples from multiple resource nodes.

We anticipate that mechanisms to access PGPop will be similar to those being established for access to other PGRN resources. This will likely involve an application process to be reviewed by components of the PGRN and by PGPop. There will be costs associated with accessing the samples, which remain to be determined.

PGPop goals are 1. Establish the infrastructure to enable rapid access to well-

phenotyped samples across nodes• Catalog resource components• Facilitate access to cases and controls, and ultimately

samples• Coordination of methods to define phenotypes across

nodes. 2. Undertake a demonstration project across nodes in Year 013. Deploy the resource for pharmacogenomic studies proposed by

PGRN sites• The Steering Committee and PGRN will receive

applications and decide on scientific merit. The Steering Committee will establish which PGPop node(s) can and wish to collaborate on a given project. Any single PGRN center could interact individually with any participating node. We anticipate that PGPop would support 1-2 projects/year.

4. Evaluate best practices and models for using large resources for pharmacogenomic science