pgpop: pharmacogenomic discovery and replication in very large patient populations pgpop: summary...
TRANSCRIPT
RA rs6457620 Intergenic Chr. 6 75 138
MS rs3135388 DRB1*1501 108 61
RA rs6679677 RSBN1 238 134
RA rs2476601 PTPN22 238 134
AF rs2200733 Chr. 4q25 292 147
CD rs11805303 IL23R 493 107
T2D rs4506565 TCF7L2 503 532
CD rs17234657 Chr. 5 513 106
CD rs1000113 Chr. 5 626 107
T2D rs12255372 TCF7L2 745 510
T2D rs12243326 TCF7L2 746 520
CD rs17221417 NOD2 866 107
AF rs10033464 Chr. 4q25 1046 143
CD rs2542151 PTPN22 1104 107
MS rs2104286 IL2RA 2133 61
MS rs6897932 IL7RA 2263 61
T2D rs10811661 CDKN2B 2406 534
T2D rs8050136 FTO 2569 533
T2D rs5219 KCNJ11 2792 533
T2D rs5215 KCNJ11 2908 527
T2D rs4402960 IGF2BP2 3111 527
gene / regionmarkernumber needed
number identified
disease
Odds ratio
0.5 1.0 2.0 5.0
0.1 1 10
PGPop: PharmacoGenomic discovery and replication in very large patient POPulations
PGPop: SUMMARYPGPop was conceived as a network resource to provide to PGRN an opportunity to identify large groups of real world patients with known drug exposures and outcomes for pharmacogenomic study in a clinical setting.
Each PGPop node includes a very large collection of patient data, drug exposures, and outcomes, and they share the general characteristic that they include “all comers” rather than more narrowly defined clinical trial populations. Some consortium nodes include large DNA collections in place, while others cover millions of lives and have committed to an infrastructure to collect DNA from patients with identified phenotypes. The participating systems include •BioVU, the Vanderbilt DNA databank that currently links 90,000 de-identified electronic health records (EHR) records with DNA obtained from discarded blood samples•The Marshfield Clinic Personalized Medicine Research Project (PMRP) that includes DNA from almost 20,000 individuals coupled to an EHR that extends back to the 1960s•Informatics for Integrating Biology and the Bedside (i2b2), an informatics capability at Harvard supported by the National Center for Biomedical Computing. The i2b2 group will not only contribute informatics excellence, but has also developed the Crimson Project that can provide DNA linked to de-identified medical records to Harvard Partners investigators from over 800,000 patient visits annually.•BioBank Japan, a resource that includes DNA and other biospecimens in >300,000 subjects. Clinical data are collected by medical coordinators at each of the 66 participating hospitals that cover 2% of all Japanese hospital beds (~25,000). •The integrated pharmacoepidemiology program of 13 health plans participating in the HMO Research Network Center for Education and Research in Therapeutics (CERT); these plans together cover 11,000,000 lives. •The Pharmacy benefits company Medco, that currently provides services to >60 million patients and has an active program in pharmacogenomics
Vanderbilt BioVU – design and current status
Leadership at PGPop nodes
Top: The BioVU model. BioVU uses DNA extracted from blood samples that were obtained in the course of clinical care and that are about to be discarded. Using discarded biologic material as a research resource requires that the associated clinical information be de-identified. Accordingly, the first step (top left) in creation of the BioVU resource was creation of an image, termed the Synthetic Derivative, of the Vanderbilt EMR in which identifiers have been scrubbed and the medical record number has been hashed. The medical record number in eligible blood samples is labeled with the same hashed number, and DNA extracted. Bottom: Sample access procedures. After signing a data use agreement, investigators gain access to the Synthetic Derivative. The Data Use Agreement includes further stipulations against attempts at re-identification, and mandates that genotype data be redeposited into the resource. Tools to conduct simple automated searches are in place, but investigator curation is generally required to more precisely identify cases and controls for subsequent studies. Samples are retrieved for genotyping after review of a genotyping plan. Planning for BioVU began in 2004 and the first samples were acquired in 2007. The resource currently accrues 500-1000 samples/week, and now holds ~90,000 samples. Samples from the Vanderbilt Children’s Hospital were included in spring 2010.
Sample retrieval
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
B6
99
tre
563
msd
..
scru
bbed
F5
rt7
83
mb
nc
ds…
scru
bbed
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
F5
rt7
83
mb
nc
ds…
.B
699
tre
563
msd
….
Genotyping, genotype-phenotype relations
cases
controls
+
On
e w
ay
hash
Investigator query
cases
controls
+
Data use agreement
BioVU (Vanderbilt University Medical Center)Marshfield Clinic Personalized Medicine Research Project (PMRP)
Crimson Project (i2b2 at Harvard)
HMO Research Network Center for Education and Research in Therapeutics
Biobank Japan
Medco (Pharmacy benefits)
Table 1: PGPop nodes
Resource Current size
EMR DNA
in hand
Ethnicity (%)
Caucasian African
American Asian Hispanic
BioVU 90,000 Y Y 85 12 1 1 PMRP 20,000 Y Y 98 0.5 1 Crimson 800,000 Y 60 10 15 15 Biobank Japan
300,000 Y 100
HMORN CERT
11,000,000 Y varies 1-33 1-9 1-39
Medco 65,000,000
Hua Xu Josh Denny
Yusuke Nakamura
Zak Kohane
Cathy McCarty
Bob Davis
Felix Frueh
Dan Roden, PI
The BioVU “demonstration project”. The first 10,000 subjects accrued were all genotyped at multiple SNP sites previously associated with disease susceptibility, and then natural language processing methods were used to identify cases and controls in the entire set. The experiment thus mimics a situation in which genotypic information is available in many subjects, and sets are then selected for genotype-phenotype analysis. The results are ordered by the number of cases estimated for replication (“number needed” column), calculated from previously-reported odds ratios, indicated by a red square. The number of cases actually identified is also shown (“number identified”). The blue diamonds indicate the point estimate of the allelic odds ratio derived from analysis of cases and controls identified. The confidence intervals for these estimates are also provided. This analysis used only cases in which European ancestry had been assigned. AF: atrial fibrillation; CD: Crohn’s Disease; MS: multiple sclerosis; RA: rheumatoid arthritis; T2D: type 2 diabetes.
eligibleJoh
n D
oe O
ne w
ay h
ash
A7C
CF
99D
E57
32…
.
A7C
CF
99D
E65
732…
.
Extract DNA
A7C
CF
99D
E65
732…
.
Joh
n D
oe
The “synthetic derivative”(SD)
Searches conducted in BioVU (April-May 2009, in preparation for the PGPop submission)
Phenotype Location in
EMR searched
Requesting investigator /
site Number
% women
% African-American
BioVU (May 21, 2009) 56,907 58.1 9.9 warfarin medications PAT 4,482 48.3 9.5 5 most commonly prescribed statins
medications Krauss/PARC 10,216 46.0 10.9
clopidogrel medications Shuldiner/PAPI Limdi/UAB
4,407 42.4 10.1
prednisone or dexamethasone
medications Relling /PAAR4KIDS
10,584 58.7 12.1
metformin + Type 2 diabetes + HgA1c
Complex NLP-based search
Giacomini/PMT 1,794 55.7 21.3
rheumatoid arthritis Complex NLP-based search
Plenge/MGH 1,777 77.1 9.5
asthma ICD9 code Weiss/PHAT 3,916 70.8 17.5 hypertension ICD9 code Johnson/PEAR 21,102 52.0 14.3 Zyban, Wellbutrin, bupropion, Chantix, Varenicline in medications OR “nicotine replacement” in the history and physical, problem list or discharge summary.
Tyndale/PNAT 3,855 70.2 8.3
NLP: Natural Language processing; EMR: Electronic Medical Record
PGPop goalsPGPop will be managed by a Steering Committee that will
include representation from the participating nodes. Our initial task will be (1) organization of the resource and (2) execution of a demonstration project that will establish mechanisms for access to samples from multiple resource nodes.
We anticipate that mechanisms to access PGPop will be similar to those being established for access to other PGRN resources. This will likely involve an application process to be reviewed by components of the PGRN and by PGPop. There will be costs associated with accessing the samples, which remain to be determined.
PGPop goals are 1. Establish the infrastructure to enable rapid access to well-
phenotyped samples across nodes• Catalog resource components• Facilitate access to cases and controls, and ultimately
samples• Coordination of methods to define phenotypes across
nodes. 2. Undertake a demonstration project across nodes in Year 013. Deploy the resource for pharmacogenomic studies proposed by
PGRN sites• The Steering Committee and PGRN will receive
applications and decide on scientific merit. The Steering Committee will establish which PGPop node(s) can and wish to collaborate on a given project. Any single PGRN center could interact individually with any participating node. We anticipate that PGPop would support 1-2 projects/year.
4. Evaluate best practices and models for using large resources for pharmacogenomic science