whole genome regression using bayesian lasso

29
Limitation of GWAS or Linkage analysis Introduction of WGP Lasso estimation Bayesian inference of Lasso Whole Genome Prediction Using Penalized Regression Bayesian Lasso Jinseob Kim, MD, MPH GSPH, SNU February 27, 2014 Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Upload: jinseob-kim

Post on 28-Nov-2014

152 views

Category:

Data & Analytics


1 download

DESCRIPTION

Whole Genome Regression : Lasso, Ridge, Bayesian Lasso

TRANSCRIPT

Page 1: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Whole Genome Prediction Using PenalizedRegressionBayesian Lasso

Jinseob Kim, MD, MPH

GSPH, SNU

February 27, 2014

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 2: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Contents

1 Limitation of GWAS or Linkage analysis

2 Introduction of WGP목표

3 Lasso estimation

4 Bayesian inference of Lasso

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 3: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Limitation

1 GWAS & Linkage : No consistent result → poor prediction.

2 Complex traits : Overall effect (e.g:cardiovascular, cancer,etc..).

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 4: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Example of GWAS

GWAS에서 여러 genetic information을 한 모형에 포함시키는것은 불가능.

1 1 trait VS 1 locus → SNP 갯수만큼 통계량 구한다.

2 허나 전체 SNP information을 고려해야 하므로 multiplecomparison p-value를 이용하게 된다. (ex: p-value cutoff-5× 10−8)

3 Significant SNP만을 대상으로 모형을 구성 or combineinformation via Genetic Risk Score

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 5: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Problem

1 Multiple comparison → Power...

2 SNP 하나씩 trait와 분석 → LD information...

3 What is Genetic Risk Score??? 부정확한 지표..

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 6: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Why this problem?

그냥 다 넣고 회귀분석하면 안될까???

1 Multicolinearity issue!!! → LD: similar allele information

2 n < p issue: 즉, 사람수보다 변수(SNP)갯수가 많으면회귀계수 추정이 안됨.

회귀계수의 분산(variance)이 너무 커진다..... 추정불가..

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 7: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Why?

β 추정량의 unbiaseness를 포기하지 않았기 때문이다.Variance-bias trade-off!!

(a) (b)

Figure : Summary of variance-bias tradeoff

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 8: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Variance-bias tradeoff

Y = f (x) + ε, ε ∼ N(0, σe), f : estimate of f 일 때

Err(x) = E [(Y − f (x))2] (1)

Err(x) = (E [f (x)− f (x)])2 + E [f (x)− E [f (x)]]2 + σe (2)

Err(x) = Bias2 + Variance + Irreducible error (3)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 9: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

목표

Core context of WGP

β가 unbiased estimator 임을 포기한다!!!

1 WGP can use all available markers to regress phenotype ontogenomic information.

Ridge regressionLasso (Least absolute shrinkage and selection operator)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 10: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

목표

1 Lasso와 Bayesian inference의 핵심 원리와 아이디어만 알면된다.

2 Lasso 패키지를 이용하기 위한 데이터 정리를 할 수 있다.

3 분석은 최대한 자동화 → 바로 테이블과 그림 생성.

4 Data와 phenotype 입력 → 논문에 수록할 테이블과 그림!!

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 11: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Ridge VS Lasso

Ridge regression

minimize (y − Xβ)T (y − Xβ) s.t

p∑j=1

β2j ≤ t

↔ minimize (y − Xβ)T (y − Xβ) + λ

p∑j=1

β2j

(4)

Lasso

minimize (y − Xβ)T (y − Xβ) s.t

p∑j=1

|βj | ≤ t

↔ minimize (y − Xβ)T (y − Xβ) + λ

p∑j=1

|βj |(5)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 12: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Ridge VS Lasso(2)

1 두 방법 모두 많은 beta값들을 0으로 보낸다. 다중공선성해결, LD information 반영.

2 Square(β2) VS Abs(|β|)3 0.04 VS 0.2 : 제곱이 절대값보다 작은 β값을 0으로 더 잘보낸다.

4 절대값이 더 강한 조건, 즉 더 많은 β들을 0으로 보낸다.

5 Lasso가 ridge보다 더 많은 β들을 0으로 보낸다.

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 13: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Ridge VS Lasso(3)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 14: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Choosing λ

K-fold cross validation 데이터에서 k개 뺀 n − k개들을 가지고Modeling 후 이것을 k개의 sample에 적용하여 error구한 후 그것들을 다 평균한 것을 CV error라 한다.CV error들의 평균을 최소화 하는 λ 구한다.

(CV error)(λ) = E ((CV error)(λ)k ) (6)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 15: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

10 fold CV

Figure : 10 fold CV

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 16: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Bayesian inference

Introduction → ThinkBayes 강의록 gogo!!

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 17: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Lasso → Bayesian Lasso

βi |σ2 ∼ λ2σ e−λ|βi |/σ : Laplace prior

1 The Laplacian prior assigns more weight to regions near zerothan the normal prior.

2 Interpretated as mixture of the hierarchical priors (Normal +exponential)

a2e−a|z| =

∫∞0

1√2πs

e−z2/2s a2

2 e−as2/2ds, a > 0

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 18: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Laplace prior

Figure : Normal VS Laplace prior

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 19: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Example: Continuous case

Whole model

µi = µ+J∑

j=1

xijγj +L∑

l=1

zljβj (7)

Likelihood

p(yi |µi , σ2) = (2πσ2)−12 exp{−(yi − µi )2

2σ2} (8)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 20: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Likelihood

p(y |µ, γ, β, σ2) =∏

N(yi |µi +J∑

j=1

xijγj +L∑

l=1

zijβj , σ2) (9)

y = {yi}, γ = {γj}, β = {βl}

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 21: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Prior construction: hierarchial model

1 Intercept(µ) & sex, smoking, BMI(γ) : vagueprior(non-informative)

2 Residual variance - standard assumption of bayesian regression: scaled-inverse Chi-square density χ−2(σ2|df ,S)

3 marker effect - bayesian Lassop(β, τ2, λ2|H, σ2) = p(β|τ2σ2)p(τ2|λ2)p(λ2|r , s)= {

∏Ll=1N(βl |0, τ2l σ2)Exp(τ2l |λ2)}G (λ2|r , s)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 22: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Prior

p(µ, γ, σ2, β, τ2, λ2|H) ∝

χ−2(σ2|df ,S){L∏

l=1

N(βl |0, τ2l σ2)Exp(τ2l |λ2)}G (λ2|r , s)(10)

H = {df = 5,S = 170, δ = 1× 104, s = 2} : For priors with smallinfluences on predictions

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 23: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Posterior

p(µ, γ, σ2, β, τ2, λ2|y) ∝∏N(yi |µi +

J∑j=1

xijγj +L∑

l=1

zijβj , σ2)

×χ−2(σ2|df , S){L∏

l=1

N(βl |0, τ2l σ2)Exp(τ2l |λ2)}G (λ2|r , s)

(11)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 24: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Implementation

1 BLR(Bayesian Linear Regression) package in R

2 bayesm, splines and SuppDists for sampler

→ BGLR(Bayesian Generalized Linear Regression) package in R

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 25: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

Goodness of fit, DIC

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 26: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

실습

BGLR package 실습 : continuous trait (TG) & binomial traint(hyperTG)

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 27: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

주의사항

1 미리 변수성질(conti VS categorial) 지정.

2 Lasso 쓸 변수(genotype)와 그냥 변수(age)를 구분

3 이론적으로 Lasso 에 들어갈 x들은 모두 표준화되어야한다. 베타값이 공평하게 측정되어야 하기 때문이다. 허나allele count는 무조건 0,1,2이므로 상관없음.

4 Missing이 없어야 한다. GWAS는 Missing 빼고 알아서계산해주지만 BGLR은 그렇지 않다. 게다가 predictionmodel이므로 더더욱 x값에 missing 없어야 함: Imputation ormean allele count.

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 28: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

주의사항2

Validation 할 것이라면

1 두 Set의 공통 SNP만으로 예측모형 구성하여야 한다.

2 두 Set의 allele count reference가 동일하여야 한다.

3 두 Set에 모두 해당 trait이 있어야 한다.

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression

Page 29: Whole Genome Regression using Bayesian Lasso

Limitation of GWAS or Linkage analysisIntroduction of WGP

Lasso estimationBayesian inference of Lasso

HP: 010-9192-5385E-mail: [email protected]

Jinseob Kim, MD, MPH Whole Genome Prediction Using Penalized Regression