data standards and statistical issues for immunogenetic data richard m. single associate professor...

59
Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics University of Vermont

Upload: hannah-underwood

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Data Standards and Statistical Issuesfor Immunogenetic Data

Richard M. Single

Associate Professor of Statistics

Department of Mathematics & Statistics

University of Vermont

Page 2: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

Page 3: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

HLA Nomenclature and why it matters

MHC

Page 4: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

HLA Nomenclature and why it matters

• Challenges for HLA data management and analysis– The HLA genes are very polymorphic;– HLA nomenclature is complicated;– There are multiple ways to generate HLA data;– All common typing systems generate ambiguous data;– There are multiple ways to report alleles and ambiguities;

These issues make meta-analyses of HLA data from

different sources very difficult.

Page 5: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics
Page 6: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

TCR

= peptide fragment

-m

TCR

HLA class I HLA class II

TCR = T-cell receptor

-m = microglobulin

Structure of HLA molecules

• HLA molecules are cell-surface proteins that present peptide fragments to T-cells• They bind specific sets of peptides based on structure

Page 7: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

7

90

73 77 80

Ribbon drawing from Hedrick et al. PNAS, 88, 5897-5901

HLA-C binding pocket

Page 8: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

DP DQ DR B C A

50 kb 850 kb 100 kb 1270 kb

class II loci class I loci

B1 A1 B1 A1 B1 A

400 kb 250 kb

16122211 1280

2980

31216

19153

IMGT/HLA Database Release 3.12.0 April 17, 2013

HLA classical loci and polymorphism

Protein-level allele numbers:

Page 9: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

HLA-A * 24 : 02 : 01 : 02 : L

Locus Field 1 (2-Digit)

Serological level(where possible)

Field 2 (4-Digit)

Peptide level(amino acid difference)

Field 3(6-Digit)

Nucleotide level[silent]

(synonymous substitutions)

Field 4(8-Digit)

Intron level (3’ or 5’

polymorphism)

ExpressionN = nullL = lowS = soluble…

• For most analyses, we want to distinguish among unique peptide sequences, i.e., 2 fields (“4-digit”) level

• This level of resolution treats alleles with the same peptide sequence for exons 2 & 3 (class I) or exon 2 (class II) as being equivalent [“binning” alleles]

HLA Allele Nomenclature

Page 10: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA alleles are defined by a “patchwork” of sequence-level polymorphisms.

• Most typing systems do not interrogate the same set of polymorphisms

- e.g., DRB1*14:01:01 vs. *14:54 differ only in exon 3

• There is currently no simple way to identify which alleles could (could not)

have been detected by a given typing system.

HLA Nomenclature & Polymorphism

Page 11: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Distinctive Geographical Distribution of subtypes of HLA-DRB1*08

Page 12: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

Page 13: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Data Standardization to facilitate Meta-analyses

Data standardization methods …

• Document the typing method (SSOP, SSP, SBT, …), version, exons interrogated,

and the set of detectable alleles:

• Perform data validation by checking against IMGT & IPD-KIR allele lists

allow re-evaluation of raw data in future contexts

allow information/results to be combined across datasets more easily

Page 14: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Extending STREGA to Immunogenomic Studies

• The STrengthening the REporting of Genetic Association studies (STREGA) statement provides community-based data reporting and analysis standards for genomic disease association studies

• The IDAWG (immunogenomics.org) has proposed an extension of STREGA: STrengthening the REporting of Immunogenomic Studies (STREIS)

Page 15: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

From STREGA to STREIS

Extensions to the STREGA guidelines for immunogenomic data include:

• Describing the system(s) used to store, manage, and validate genotype and allele data

• Documenting all methods applied to resolve ambiguity • Defining any codes used to represent ambiguities• Describing any binning or combining of alleles into common categories• Avoiding the use of subjective terms (e.g. high-resolution typing), that

may change over time

Page 16: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

Page 17: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Allele-level Ambiguity

Group codes (“g”-codes) for alleles identical in exons 2 & 3 for class I, or exon 2 for class II.

A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289 = “A020101g”

NMDP ambiguity codes for 4-digit non-null alleles

A*0201/0209 = A*02AFA*0201/0209/0266 = A*02AJEYA*0201/0209/0266/0275/0289 = A*02BSFJ

Ambiguous allele sets A*0201/ 0209/ 0243N/ 0266/ 0275/ 0283N/ 0289

Ambiguous alleles result from polymorphisms outside of assessed regions; • outside of exons 2 & 3, or • in sections of those exons that were not interrogated.

Page 18: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Genotype-level Ambiguity

Ambiguous genotypes result from an inability to establish the phase of individual polymorphisms or entire exons.

Different combinations of alleles can lead to the same typing result.

Example: A typing result for one individual that could be explained by any of four different possible genotype sets at HLA-B.

Genotype 1 2705 4402Genotype 2 2705 4411Genotype 3 2709 4402Genotype 4 2709 4411

B*2705 + B*4402 or B*2705 + B*4411 or B*2709 + B*4402 or B*2709 + B*4411

Most analytical methods require a single genotype call for each individual sample.

Page 19: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Standardized Ambiguity Reduction

2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717

44020101, 44020102S, 440203, 4419N, 4423N, 4424, 4427, 4433

2703, 270502, 270503, 270504, 270505, 270506, 270508, 2710, 2713, 2717

440202, 4411

2709 44020101, 44020102S, 440203, 4419N, 4423N, 4424, 4427, 4433

2709 440202, 4411

HLA-B allele 1 HLA-B allele 2

Genotype 1

Genotype 2

Genotype 3

Genotype 4

Sample #001

Peptide-level Filtering, Remove non-CWD alleles,

Binning alleles identical over exons 2&3

Unambiguous data

2703, 2705 4402

Regional population-level frequency data

Page 20: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

xxx2703, 2705 4402

2705 4402

immunogenomics.org

Page 21: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics
Page 22: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Genotype List (GL) Strings

• Use a hierarchical set of operators to describe the relationships between – alleles, lists of possible alleles, phased alleles, genotypes, lists of

possible genotypes, and multilocus unphased genotypes, – without losing typing information or increasing typing ambiguity.

• Are proposed to replace NMDP codes

Milius et al. (2013) Tissue Antigens

Page 23: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Genotype List (GL) Strings

• Example GL string for the genotype:

A*02:69 + A*23:30 or A*02:302 + A*23:26 or A*02:302 + A*23:39

B*44:02 + B*49:08and

Page 24: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• Immunology Database and Analysis Portal (www.ImmPort.org) Developed under the Bioinformatics Integration Support Contract (BISC) for NIH, NIAID, & DAIT (Division of Allergy, Immunology, and Transplantation)

– Data validation pipeline– Analysis tools– Standardized ambiguity reduction tools – Data from a large number of immunogenomic studies

• ImmunoGenomics Data Analysis Working Group (www.immunogenomics.org) (www.IgDAWG.org)

An international collaborative group working to …– facilitate the sharing of immunogenomic data (HLA, KIR, etc.) and – foster consistent analysis and interpretation of immunogenomic data

Resources for HLA Data Validation & Analysis

Page 25: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics
Page 26: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics
Page 27: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

Page 28: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• The KIR gene complex is located on Chromosome 19 (19q13.4)

• KIR are expressed on natural killer (NK) cells and a subset of T cells

• Certain HLA alleles serve as ligands for KIR

KIR Gene FunctionLigand2DL1 Inhibitory HLA-C group22DS1 Activating HLA-C group22DL2/3 Inhibitory HLA-C group1 2DS2 Activating HLA-C group13DL1 Inhibitory HLA-Bw43DS1 Activating HLA-Bw4

Killer cell Immunoglobulin-like Receptor (KIR)

Page 29: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

NK Cell Normal Cell

No Lysis

Dominant inhibition

iKIR HLA

Act. rec.

Protection

ligand

Lysis

Cytokines

Missing-self recognition

NK Cell

iKIR

Act. rec.

HIV+

Targetsligand

KIR regulate NK cell activity

Page 30: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

HLA-C alleles can be divided into two groups based on the amino acid at position 80 (& 77),

which determines KIR recognition

Ser77Asp80

Cw1 Cw3 Cw7 Cw8 Cw12Cw13Cw14

HLA-C1

KIR2DL3/2DL2NK cell

inhibition

HLA-C2Asp77Lys80

Cw2 Cw4 Cw5 Cw6 Cw15Cw17

KIR2DL1

Page 31: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Bifurcation of HLA-B allotypes

HLA-B

Bw4 (40%) Bw6 (60%)

KIR3DL1 ligands

KIR3DS1

Not a ligand for KIR

80I 80T

Page 32: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

Page 33: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

KIR & HLA in 30 Global Populations

Page 34: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• Several studies hypothesized selection for KIR that suit the locale-specific HLA repertoire.

• Disease association studies point to HLA-Bw4 alleles with Isoleucine at position 80 (“Bw4-80I”) as the strongest ligand for KIR3DS1

Population-level evidence for Co-evolution & Natural Selection for KIR and HLA

Page 35: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

KIR2DL3 vs. HLA-Cgroup1

r = 0.184

KIR3DL1 vs. HLA-Bw4

r = 0.426

KIR2DL1 vs. HLA-Cgroup2

r = 0.046

Inhibitory KIR

Correlations between frequencies for KIR and HLA Ligands

Page 36: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Correlations between frequencies for KIR and HLA Ligands

KIR3DS1 vs. HLA-Bw4

r = -0.632

KIR2DS1 vs. HLA-Cgroup2

r = -0.478

KIR2DS2 vs. HLA-Cgroup1

r = -0.371

Activating KIR

Page 37: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Correlations between frequencies for KIR and HLA Ligands

Activating KIR3DS1

Subsets of Bw4 alleles based on amino acid position 80

KIR3DS1 vs. HLA-Bw4

r = -0.632

KIR3DS1 vs. HLA-Bw4-80I

r = -0.657

KIR3DS1 vs. HLA-Bw4-80T

r = -0.190

Single et al., Nature Genetics

Page 38: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• Challenges for these and other population studies– Demographic history shapes patterns of variation & can mimic the

effects of selection. – Gene frequencies are not statistically independent among populations,

due to shared demographic history.

• Ordinary Pearson correlation p-values assume independence among the observations.

• We constructed a randomization test to account for the demographic histories of the populations and focus on the genetic effect.

Statistical Issues

Page 39: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Assessing the significance ρ = cor(X,Y)

• Null Hypothesis: H0: ρ = 0

• Statistic: Pearson’s correlation coefficient

Hypothesis Test for a Correlation Coefficient

.674observedr

X Y4.1 4.98.6 5.42.3 4.25.4 7.49.2 8.87.7 6.76.4 8.84.3 5.17.6 9.43.4 5.3

2 2

i ii

i ii i

x x y yr

x x y y

Page 40: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Randomization Test

Population HLA-B (1) HLA-B (2) B-grp (1) B-grp (2) HLA-C (1) HLA-C (2) C-grp (1) C-grp (2)Biaka 0702 1503 Bw6 Bw6 0202 0702 C2 C1Biaka 0702 4403 Bw6 Bw4 0401 0702 C2 C1Biaka 1302 3701 Bw4 Bw4 0202 0602 C2 C2Biaka 4901 5301 Bw4 Bw4 0401 0701 C2 C1Biaka 3701 3910 Bw4 Bw6 0202 1203 C2 C1

… … … … … … … … …

• Bw4 alleles: 1301, 1302, 1516, 1517, 2702, 2703, 2704, 2705, 3701, 3801, 3802, 4402, 4403, 4404, 4405, ...

• Bw6 alleles: 0702, 0705, 0799, 0801, 1401, 1402, 1403, 1501, 1502, 1503, 1504, 1506, 1507, 1508, 1510, ...

• Reassign Bw4/Bw6 status to simulate the null hypothesis• Compute correlation of frequencies for KIR-3DS1 & reassigned HLA

Page 41: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Permutation Distribution

correlation

De

nsi

ty

-0.5 0.0 0.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

XX

KIR3DS1 – HLA-Bw4 correlation

Permutation p-value=0.012

r = -0.632

Page 42: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• Empirical comparisons based on genomic data or other methods that incorporate information about the demographic histories of populations (Pritchard and Donnelly, 2001).

– Our study used data from the ALFRED database to assess statistical significance http://alfred.med.yale.edu

– We selected 538 neutral sites from 202 genes typed in the same individuals

Genomic Controls

Page 43: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Genomic Data

Page 44: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• Randomly select two SNP sites from different chromosomes • Find the frequencies in each population and compute the correlation• Repeat

Genomic Data for Empirical Tests

0.2 0.4 0.6 0.8 1.0

0.3

0.4

0.5

0.6

0.7

0.8

SNP site 1

SN

P s

ite 2

Page 45: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Empirical Distribution for Correlations among unlinked SNPs

Correlation

Densi

ty

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

XX

KIR3DS1 – HLA-Bw4 correlation

empirical p-value=0.041

r = -0.632

Genomic Data – Empirical Distribution

Page 46: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

* Ordinary Pearson p-values in red overestimate the significance of trends

locus pair Correlationp-value (1)

(correlation)p-value (2)

(permutation)p-value (3)

(empirical)

3DS1 - Bw4 -0.632 0.000 0.012 0.041

3DS1 - Bw480I -0.657 0.000 0.009 0.038

3DS1 - Bw480T -0.190 0.316 0.532 0.534

3DL1 - Bw4 0.426 0.019 0.106 0.218

3DL1 - Bw480I 0.416 0.022 0.115 0.191

3DL1 - Bw480T 0.171 0.367 0.540 0.758

2DS1 - C2 -0.478 0.008 0.243 0.149

2DL1 - C2 0.046 0.810 0.891 0.924

2DL2 - C1 -0.366 0.047 0.193 0.542

2DL3 - C1 0.184 0.331 0.458 0.328

2DS2 - C1 -0.371 0.044 0.170 0.479

(1) P-correlation is the ordinary Pearson product-moment correlation p-value.(2) P-permutation is based on the permutation distribution under the null hypothesis.(3) P-empirical is based on the empirical distribution for unlinked SNPs from ALFRED.

Significance of Correlations *

Page 47: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• HLA nomenclature: Why it matters for analysis and interpretation– Challenges for combining HLA data from different sources

• Data Standardization to facilitate meta-analyses and reproducibility – Developing a community standard for HLA & KIR data reporting

• Overview of HLA data curation & ambiguity resolution– Example, Immport, Next steps: GL strings & QR codes

• HLA (chrom 6) and KIR (chrom 19) interactions – A brief overview

• HLA and KIR: population-level evidence of co-evolution– Population-genetic evidence of co-evolution– Randomization tests and genomic controls

Outline

Page 48: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Acknowledgements

NCIMary CarringtonPat MartinGao Xiaojiang

USPDiogo MeyerRodrigo dos Santos Francisco

Yale UniversityKen and Judy Kidd

Children's Hospital Oakland Research Inst.Steven J. MackJill A. Hollenbach

Harvard Medical SchoolAlex Lancaster

UC San FranciscoOwen Solberg

Roche Molecular SystemsHenry A. Erlich

Anthony Nolan Research Inst.Steven G.E. Marsh

NCBI/NIHMike Feolo

NGITJeff WiserPatrick DunnTom Smith

Page 49: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

If time allows …

Page 50: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

1 1

I J

iji ji j

D p q D

12

12

2

21 1 2

min( 1 1) min( 1 1)

I J

ij i ji j LD

n

D p qX N

WI J I J

The two most common measures of the strength of LD are:

(1) the normalized measure of the individual LD values, namely Dij' = Dij / Dmax (Lewontin 1964); and

(2) the correlation coefficient r for bi-allelic data, which is most often reported as r2 = D2 / (pA1 pA2 pB1 pB2).

r =1 only when the allelic variations at the two loci show 100% correlation

Their multi-allelic extensions are:

Linkage Disequilibrium (LD) Measures

Page 51: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Standard LD measures D’ and Wn

Standard LD measures (overall D’ & Wn) assume/force symmetry, even though with >2 alleles per locus that is not the case

Data Source: Immport Study#SDY26: Identifying polymorphisms associated with risk for the development of myopericarditis following smallpox vaccine

Page 52: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Asymmetric Linkage Disequilibrium (ALD)

Interpretation:

ALD for HLA-DRB1 conditioning on HLA-DQA1 WDRB1 / DQA1 = .58

ALD for HLA-DQA1 conditioning on HLA-DRB1 WDQA1 / DRB1 = .95

 The overall variation for DRB1 is relatively high given specific DQA1 alleles.

The overall variation for DQA1 is relatively low given specific DRB1 alleles.

ALDrow gene conditional on column gene

Thomson and Single, 2014 Genetics

Page 53: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

• Balancing selection can result from:

- Overdominance/Heterozygote advantage- Frequency-dependent selection- Selective regimes that change over time/space

• For HLA, the common factor in these models is rare allele advantage, which is consistent with a pathogen-directed frequency-dependent selection model.

• At the Amino Acid (AA) level we see- High AA variability at antigen recognition sites (ARS)- Relatively even AA frequencies at ARS sites- Higher rates of non-synonymous vs. synonymous changes at ARS

Balancing Selection Operates at Most HLA Loci

Page 54: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Homozygosity (F) and theNormalized Deviate (Fnd)

0

0.05

0.1

0.15

0.2

0.25

0.3

allele

alle

le fr

eque

ncy

0

0.1

0.2

0.3

0.4

0.5

0.6

allele

alle

le fr

eque

ncy

0

0.02

0.04

0.06

0.08

0.1

0.12

alleleal

lele

freq

uenc

y

Neutrality

FOBS ≈ FEQ

Fnd ≈ 0

Directional Selection

FOBS > FEQ

Fnd > 0

Balancing Selection

FOBS < FEQ

Fnd < 0

2

1

k

iiF p

Fnd = (FOBS - FEQ) / SD(FEQ)

Page 55: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Fnd for DRB1 AA sites in a EUR population

• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.

Page 56: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

LD for DRB1 AAs

Wn ALDrow gene conditional on column gene

Asymmetric LD (ALD)Wn (symmetric)

Page 57: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Fnd for DRB1 AA sites (Meta-Analysis)

Fnd for all polymorphic sites in a meta-analysis of 57 populations

• Fnd << 0 gives evidence of possible balancing selection.• Fnd >> 0 gives evidence of possible directional selection.

Page 58: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures

Description

Definition of Measuresa 1. Single locus homozygosity (F)b

FA = i pAi

2 2. Haplotype specific homozygosity (HSF)c

FA/Bj = i (fij / pBj)

2

3. Overall weighted HSF valuesd FA/B (and FB/A)

FA/B = j (FA/Bj) (pBj) = FA + i j Dij

2 / pBj

4. Multi-allelic ALDe squared WA/B (and WB/A)

WA/B

2 = (FA/B−FA) / (1−FA)

Thomson and Single(2014) Genetics

Page 59: Data Standards and Statistical Issues for Immunogenetic Data Richard M. Single Associate Professor of Statistics Department of Mathematics & Statistics

Asymmetric Linkage Disequilibrium (ALD)Table 1. Linkage disequilibrium and genetic diversity measures

Description

Definition of Measuresa 1. Single locus homozygosity (F)b

FA = i pAi

2 2. Haplotype specific homozygosity (HSF)c

FA/Bj = i (fij / pBj)

2

3. Overall weighted HSF valuesd FA/B (and FB/A)

FA/B = j (FA/Bj) (pBj) = FA + i j Dij

2 / pBj

4. Multi-allelic ALDe squared WA/B (and WB/A)

WA/B

2 = (FA/B−FA) / (1−FA)

If both loci are bi-allelic: WA/B

2 = [i j (Dij2 / pBj)] / (1 − FA) = D2 / (pA1 pA2 pB1 pB2) = r2, since D11= −D12= −D21= D22=D

Thomson and Single(2014) Genetics