the international the international hapmap hapmap project

The International The International HapMapHapMapProjectProject

Anno 2009/2010Dott.ssa Laura Rita Duro

Most common diseases, such as diabetes, cancer, stroke, heart disease, depression, and

asthma, are affected by combinations of multiple asthma, are affected by combinations of multiple genetic and environmental factors

Genetic and environmental contributions to monogeni c and complex disorders

(A) Monogenic disease . A variant in a single gene is the primary determinant of amonogenic disease or trait, responsible for most of the disease risk or trait variation(dark blue sector), with possible minor contributions of modifier genes (yellowsectors) or environment (light blue sector).

(B) Complex disease . Many variants of small effect (yellow sectors) contribute todisease risk or trait variation, along with many environmental factors (blue sector).

Complex diseases, in contrast, have proven

much more challenging to study,

as they are thought to be due to the

combined effect of

More than a thousand genes for rare, highly heritable ‘mendelian’ disorders have been identified, in which variation in a combined effect of

many different susceptibility DNA variants interacting with environmental

factors

in which variation in a single gene is both

necessary and sufficient to cause

disease.

Discovering these genetic factors will provide Discovering these genetic factors will provide

fundamental new insights into the pathogenesis,

diagnosis and treatment of human disease

Although any two unrelated people are the same at about 99.9% of their DNA sequences,

the remaining 0.1% is important because it contains the genetic variants that influence how people differ in their risk of disease or

their response to drugs.

Discovering the DNA sequence variants that contribute to common disease risk offers one

of the best opportunities for understanding the complex causes of disease in humans.

Human Genetic VariationsHuman Genetic Variations

Primarily two types of genetic mutation events create allforms of variations:

Single base mutation which substitutes one nucleotide for another

-Single Nucleotide Polymorphisms (SNP)

Insertion or deletion of one or more nucleotide(s)

-Tandem Repeat Polymorphisms

-Insertion/Deletion Polymorphisms

Tandem Repeat Polymorphisms

Tandem repeats or variable number of tandem repeats (VNTR) are a verycommon class of polymorphism, consisting of variable length of sequencemotifs that are repeated in tandem in a variable copy number.

VNTRs are subdivided into two subgroups based on the size of thetandem repeat units.

Microsatellites or Short Tandem Repeat (STR)Microsatellites or Short Tandem Repeat (STR)repeat unit: 1-6 (dinucleotide repeat: CACACACACACA)

Minisatellitesrepeat unit: 10-100

SNPs

Sites in the genome where the DNA sequences of many

individuals differ by a single base are called single

nucleotide polymorphisms (SNPs)

For example, some people may have a chromosome with an A at a particular site where

others have a chromosome with a G

Each form is called an allele

Variation Or Mutation ?

Terminology for variation at a single nucleotide position is defined by allele

frequencyfrequency

PolymorphismA sequence variation that occurs at least 1 percent of the time (> 1%)90% of variations are SNPs

MutationIf the variation is present less than 1 percent of the time (<= 1%)

Transitions and Transversions

SNPs include single base substitutions such as:

Transitions:change of one purine (A,G) for a purine,

or a pyrimidine (C,T) for a pyrimidine

Transversions:change of a purine (A,G) for a pyrimidine (C,T),

or viceversa

A G G A C T T C

A C A T G C G T C A C G T A T G

In principle, SNPs could be bi-, tri-, or tetra-allelic polymorphisms

However, in humans, tri-allelic and tetra-allelic However, in humans, tri-allelic and tetra-allelic SNPs are rare almost to the point of

non-existence, and so SNPs are sometimes simply referred to as bi-allelic markers

Non-coding SNPs::

5’ and 3’ UTRsIntronsIntergenic Spaces

Synonymous Coding SNPs:

when single base substitutions do not cause a change in the resultant amino acid

Non-synonymous CodingSNPs:

when single base substitutions cause a change in the resultant amino acid

Non-coding SNPs

Example: Regulatory SNPs (rSNPs)

Two allelic variants of the same gene are transcribed in differentamounts as a consequence of an adjacent polymorphism. In thisexample, allele G, located upstream of the gene, has a highertranscript level than does allele T.

Coding SNPs

Example: Synonymous, mutation does not change amino acid.

Example: Non-synonymous, mutation change amino acid.

Coding SNPs

SNPs

It has been estimated that, in the world’s human po pulation, about 10 million sites (that is, one variant per 300 bases on

average) vary such that both alleles are observed a t a frequency of > 1%, and that these 10 million common SNPs

constitute 90% of the variation in the population.

The remaining 10% is due to a vast array of variant s that are each rare in the population.each rare in the population.

The presence of particular SNP alleles in an indivi dual is determined by testing (‘genotyping’) a genomic DNA sample.

NATURE |VOL 426 | 18/25 DECEMBER 2003

A particular combination of alleles along a chromosome is termed a haplotype

Haplotype is a set of SNPs on a single chromatid that are statistically associated

The coinheritance of SNP alleles on these haplotypes

leads to associations between these alleles in the

population

(known as linkage disequilibrium , LD)

Linkage disequilibrium

� Situation in which some combinations of alleles or geneticmarkers occur more or less frequently in a population thanwould be expected from a random formation of haplotypesfrom alleles based on their frequencies.from alleles based on their frequencies.

� Non-random associations between polymorphisms atdifferent loci are measured by the degree of linkagedisequilibrium (LD).

The LD between many neighboring SNPs generally persists because meiotic recombination does not occur at random, but is concentrated in recombination hot spots.

Adjacent SNPs that lack a hot spot between them are likely to be in strong LD.

r2 = 1: two SNPs that are perfectly correlated (allele A of SNP1 is always observed withallele C of SNP2, and viceversa)r2 = 0: allele A of SNP1 providing no information at all about which allele of SNP4 ispresent.

Complete independence of these 6 SNPs would predict the possibility of 64 differenthaplotypes (because n biallelic SNPs could generate 2n haplotypes), but in reality just 4haplotypes comprise 90% of observed chromosomes, indicating that LD is present.

Because of the strong associations among the SNPs in most chromosomal

regions, only a few carefully chosen SNPs (known as tag SNPs ) need to be typed to predict the likely variants at the

rest of the SNPs in each region

SNP1, SNP2, and SNP3 are strongly correlated, and SNP4, SNP5, and SNP6are strongly correlated, so that any of SNP1–SNP3 (or SNP4–SNP6) could

serve as tags for the other 2 SNPs in each group.

Many empirical studies have shown highly significant levels of LD, and often strong associations between nearby SNPs, in the human genome.

Because the likelihood of recombination between two SNPs increases with the distance between them, on average such associations between

SNPs decline with distance.

B.A. Salisbury et al. Mutation Research 2003

Average linkage disequilibrium, |D|, vs. distance between SNPs for 2597 genes in which accurate distances were available. Lower values indicate a stronger effect of recombination and recurrent mutation.LD decreases with distance.

Genotyping only a few, carefully chosen SNPs in the region will provide enough

The strong associations between SNPs in a region have

a practical value

Genotyping only a few, carefully chosen SNPs in the region will provide enoughinformation to predict much of the information about the remainder of the commonSNPs in that region. As a result, only a few of these ‘tag’ SNPs are required toidentify each of the common haplotypes in a region.

On the basis of empirical studies, it has been estimated that most ofthe information about genetic variation represented by the 10 millioncommon SNPs in the population could be provided by genotyping200.000 to 1.000.000 tag SNPs across the genome

These observations are the conceptual and empirical foundation fordeveloping a haplotype map of the human genome, the ‘HapMap’.

The International HapMap Project is a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the

United Kingdom and the United States to develop a public resource that will help researchers find genes associated with human that will help researchers find genes associated with human

disease and response to pharmaceuticals.

An initial meeting to discuss the scientific and ethical issues associated with developing a human haplotype map was held in Washington in 2001.

The International HapMap Project was then formally initiated in 2002.

The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap ,

which will describe the common patterns of human DNA sequence variation.


The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses

to drugs and environmental factors.

The information produced by the Project is freely available (www.hapmap.org)

The HapMap was designed to determine the frequencie s and The HapMap was designed to determine the frequencie s and patterns of association among roughly 3 million com mon patterns of association among roughly 3 million com mon SNPs in four populations, for use in genetic associ ation SNPs in four populations, for use in genetic associ ation

studiesstudies

The HapMap project focuses only on common SNPs, tho se The HapMap project focuses only on common SNPs, tho se where each allele occurs in at least 1% of the popu lationwhere each allele occurs in at least 1% of the popu lation

The project studied a total of 270 DNA samples:

� 90 samples from a US Utah population withNorthern and Western European ancestry(samples collected in 1980 by the Centred’Etude du Polymorphisme Humain (CEPH)and used for other human genetic maps)

� new samples collected from 90 Yorubapeople in Ibadan, Nigeria

� 45 unrelated Japanese in Tokyo, Japan

� 45 unrelated Han Chinese in Beijing, China

The International HapMap Consortium decided to include several

populations from different ancestral geographic locations to ensure that the

HapMap would include most of the common variation and some of the less

common variation in different populations.


Human Genome Project Human Genome Project vsvs

International HapMap Project International HapMap Project

In its scope and potential consequences, the International HapMap Projecthas much in common with the Human Genome Project, which sequenced thehuman genome.

Both projects have been scientifically ambitious and technologicallydemanding, have involved intense international collaboration, have beendedicated to the rapid release of data into the public domain, and promise tohave profound implications for our understanding of human biology andhuman health.

Whereas the sequencing project covered the entire genome, including the99.9% of the genome where we are all the same, the HapMap willcharacterize the common patterns within the 0.1% where we differ from eachother.

� the availability of the human genome sequence;

� databases of common SNPs (subsequently enriched by thisproject) from which genotyping assays could be designed;

The project had become practical by the confluence of the following:The project had become practical by the confluence of the following:

� insights into human LD;

� development of inexpensive, accurate technologies for high-throughput SNP genotyping;

� web-based tools for storing and sharing data.

The International HapMap Consortium NATURE October 2005

HapMap Project comprises two phases

The complete data obtained in Phase I were published

on October 2005.

The analysis of the Phase II dataset was published in

October 2007.

The Phase I HapMapThe Phase I HapMap

Phase I of the HapMap Project set as agoal genotyping at least one commonSNP every 5 kb across the genome inSNP every 5 kb across the genome ineach of 269 DNA samples.

For the sake of practicality, and motivatedby the allele frequency distribution ofvariants in the human genome, a minorallele frequency (MAF) of 0.05 or greaterwas targeted for study.

Minor Allele Frequency (MAF) : The frequency at which the less abundant (or minor) allele of a SNP is present in a population. The MAF for a SNP to be considered common is usually above 1%.

The project required a dense map of SNPs, ideally containing information about validation and frequency of each candidate SNP.

When the project started, the public SNPdatabase (dbSNP) contained 2.6 millioncandidate SNPs, few of which wereannotated with the required information.annotated with the required information.

The HapMap Project contributed about 6million new SNPs to dbSNP. At October2005 dbSNP contains 9.2 millioncandidate human SNPs.

To study patterns of genetic variation were selected ten 500-kb regions from the ENCODE (Encyclopedia of DNA Elements) Project.

These ten regions were chosen to approximate the genome-wide average for G+C content, recombination rate, percentage of sequence

conserved relative to mouse sequence, and gene density.Each 500-kb region was sequenced in 48 individuals, and all SNPs in

these regions (discovered or in dbSNP) were genotyped in the complete set of 269 DNA samples.

Using the data provided by HapMap, a team of scientists at Harvard Medical

School and the Broad Institute has discovered a new genetic variant

associated with age-related macular degeneration (AMD), the leading cause

of blindness in people over 60 years of age, as well as confirming previously

reported variants

Nature Genetics - 38, 1055 - 1059 (2006)

They estimate that genotypes related to just five variants in three different genes can explain 50% of the risk of developing AMD

In addition to CFH on chromosome 1

The new genetic common variant identified was found in a non-coding region of theComplement Factor H (CFH) gene, other variants of which were recently shown to

be associated with the risk of developing AMD.

� the complement factor B (BF) gene on chromosome 6

� complement component 2 (C2) gene on chromosome 6

� a common variant (A69S) is in hypothetical gene LOC387715 on chromosome 10.

Interestingly, these three genes do not appear to interact directly, but insteadcontribute to the risk of AMD independently.

Phase II HapMap characterizes over 3.1 million human

SNPs genotyped in 270 individuals from four

geographically diverse populations

�� GenotypingGenotyping inin phasephase IIII waswas attemptedattempted forfor aboutabout 44..44 millionmilliondistinctdistinct SNPs,SNPs, ofof whichwhich roughlyroughly 11..33 millionmillion eithereither couldcouldnotnot bebe typed,typed, werewere notnot polymorphicpolymorphic inin anyany ofof thethepopulations,populations, oror diddid notnot passpass genotypinggenotyping qualityquality controlcontrolfiltersfilters..

�� CertainCertain regionsregions ofof thethe genomegenome werewere recognizedrecognized asas beingbeingchallengingchallenging toto study,study, suchsuch asas centromeres,centromeres, telomeres,telomeres,gapsgaps inin genomegenome sequence,sequence, andand segmentalsegmental duplications,duplications,regionsregions declareddeclared toto bebe notnot HapMapableHapMapable..

The resulting HapMap has an SNP density of approximately

one per kilobase and is estimated to contain approximately

25–35% of all the 9–10 million common SNPs in the 25–35% of all the 9–10 million common SNPs in the

assembled human genome

Variation in SNP density within the Phase II HapMap

Phase I

Phase II

Example of the fine-scale structure of SNP density for a 100-kb region on chromosome 17 showing polymorphic Phase I SNPs in the consensus data set (red triangles) and

polymorphic Phase II SNPs in the consensus data set (blue triangles)

The Phase II HapMap differs from the Phase I HapMap also in minor allele frequency (MAF) distribution. SNPs added in Phase II have lower MAF. Phase II HapMap

includes a better representation of rare variation than the Phase I HapMap

Advances in technology for high-throughput SNP geno typing

Advances in genotyping technology have vastly increased the number

of variants that can be typed and decreased the per-sample costs

These advances have made possible the

dense genotyping needed to capture the

majority of SNP variation within an individual

at a sufficiently low cost to allow the large

sample sizes needed for comparison of

individuals with and without disease

Studies in additional populations have shown that t he tag SNPs chosen using the HapMap are generally transfer able across other populations, but there are some limita tions.

So additional samples from the populations used to develop the HapMap as well as from seven more populations h ave

recently been genotyped across the genome.

� Luhya from Webuye, Kenya� Maasai from Kenya � Tuscans from Italy � Indian-Americans (Gujarati) from Houston, TX� Han Chinese from Denver� Mexican-Americans from Los Angeles� Americans of African Descent from the SW USA

recently been genotyped across the genome.

It is now clear that the HapMap can be

a useful resource for the design and

analysis of disease association studies analysis of disease association studies

in populations across the world

APPLICATION OF THE HAPMAP TO COMMON DISEASECOMMON DISEASE

The technological advances directly stimulated or

indirectly facilitated by the HapMap have had a

profound impact on the study of the genetics ofprofound impact on the study of the genetics of

common diseases

The history of high-density GWA

scanning to date has

demonstrated the striking

success of this approach in

finding genetic variants

associated with disease. associated with disease.

Variants or regions associated

with nearly 40 complex diseases

have been identified in diverse

population samples.

Major Autism Gene Found with Help of HapMap

Using data from the HapMap, along with DNA samples collected from manyfamilies who have affected children, researchers have discovered a geneticvariation linked to autism, one of the most heritable mental healthconditions.

They found a variation in the sequence of a gene - the “MET receptortyrosine kinase gene” - that is associated with autism. This gene is involvedtyrosine kinase gene” - that is associated with autism. This gene is involvedin brain development, immune function, and digestive system repair.

The MET promoter variant rs1858830 allele "C" is strongly associated withASD and results in reduced gene transcription. MET protein levels weresignificantly decreased in ASD cases compared with control subjects.People who have the variation are more than twice as likely as others tohave “autism spectrum disorders”

Campbell DB et al, Ann Neurol. 2007

A genome-wide association study identifies novel ri sk loci for type 2 diabetes

Type 2 diabetes mellitus results from the interaction of environmental factorswith a combination of genetic variants.A systematic search for these variants was recently made possible by thedevelopment of high-density arrays that permit the genotyping of hundreds ofthousands of polymorphisms.

Researchers tested 392,935 SNPs in a French case–control cohort.Markers with the most significant difference in genotype frequencies betweenMarkers with the most significant difference in genotype frequencies betweencases of type 2 diabetes and controls were fast-tracked for testing in a secondcohort.This identified four loci containing variants that confer type 2 diabetes risk, inaddition to confirming the known association with the TCF7L2 gene.These loci include a non-synonymous polymorphism in the zinc transporterSLC30A8, which is expressed exclusively in insulin-producing β-cells, and twolinkage disequilibrium blocks that contain genes potentially involved in β-celldevelopment or function (IDE–KIF11–HHEX and EXT2–ALX4).

Sladek R et al. Nature 445, 881-885 (2007)

Currently, additional samples from the populations used to developthe initial HapMap, as well as samples from seven additionalpopulations will be sequenced and genotyped extensively to extendthe HapMap, providing information on rarer variants and helping toenable genome-wide association studies in additional populations.

There are also ongoing efforts by many groups to characterize

Future of the HapMap Project

There are also ongoing efforts by many groups to characterizeadditional forms of genetic variation, such as structural variation, andmolecular phenotypes in the HapMap samples. Finally, in the future,whole-genome sequencing will provide a natural convergence oftechnologies to type both SNP and structural variation.

Nevertheless, until that point the HapMap Project data will provide aninvaluable resource for understanding the structure of human geneticvariation and its link to phenotype.

BeyondBeyond SNPsSNPs::Copy Copy NumberNumber VariantsVariants and and OtherOtherCopy Copy NumberNumber VariantsVariants and and OtherOther

StructuralStructural VariationVariation

Current generation high-throughput

genotyping platforms are

extraordinarily efficient at genotyping

SNPs, but they are less effective at SNPs, but they are less effective at

genotyping structural variants, such

as insertions, deletions, inversions,

and copy number variants

Although not as common as SNPs, these variants also occur commonly in the human genome

The distribution of copy number variation in the human genome among 270 HapMap samples

A Copy number variants (CNV) is a segment of DNA in which copy-

number differences have been found by comparison of two or more

genomes.

CNV in which stretches of genomic sequence of roughly 1 kb to 3 Mb in size are deleted or are duplicated in

varying numbers, have gained increasing attention because of their

apparent ubiquity and potential dosage effect on gene expression.

In 2004, the interrogation of genomic variability by array

hybridization methods clearly demonstrated the existence of copy hybridization methods clearly demonstrated the existence of copy

number variants.

Intense analysis of this type of genomic variability followed, and

the current conservative estimate from studies in a few hundred

individuals is that at least 10% of the genome is subject to copy

number variation

Although a typical SNP affects only one single nucleotide

pair, their genomic abundance (over 10 million) makes

them the most frequent source of polymorphic changes

By contrast, CNVs are far less numerous but can affect

from one kilobase to several megabases of DNA per

event, adding up to a significant fraction of the genome

It is now recognized that the genomes of any two individuals in the human population differ more at the structural level than at the nucleotide sequence level

NATURE GENETICS SUPPLEMENT | VOLUME 39 | JULY 2007

� Much of what was previously known about the role of CNVs in diseasecomes from a rich literature on ‘genomic disorders’.

� Genomic disorders are defined as a diverse group of genetic diseases thatare each caused by an alteration in DNA copy number.

� These mutations can be relatively large, microscopically visibleimbalances, such as in Prader-Willi syndrome, or they may be muchsmaller, requiring higher resolution detection methods, such as in WilliamsSyndrome.

� Genomic disorders are typically sporadic in nature because the CNV in� Genomic disorders are typically sporadic in nature because the CNV inmost cases is a de novo mutation with nearly complete penetrance, andbecause the affected individuals have severe developmental problems andare unlikely to have offspring.

� However, there are notable examples of mendelian disease traitsassociated with CNVs. For example, duplications of the gene for peripheralmyelin protein 22 (PMP22) cause the dominant neuropathy Charcot-MarieTooth disease type 1A, and deletions of the α-globin gene cluster causethe recessive anemia α-thalassemia.

Bibliografia� The International HapMap Consortium. The International HapMap Project.

NATURE. 426: 18/25, December 2003.

� Deloukas P, Bentley D. The HapMap project and its application to geneticstudies of drug response. The Pharmacogenomics Journal. 4,88–90 (2004).

� The International HapMap Consortium. A haplotype map of thehumangenome. NATURE. 437: 27, October 2005.

� Manolio TA, BrooksLD, Collins FS. A HapMapharvestof insightsinto the� Manolio TA, BrooksLD, Collins FS. A HapMapharvestof insightsinto thegenetics of common disease. The Journal of Clinical Investigation. 118: 5,May 2008.

� The International HapMap Consortium. A second generation humanhaplotype map of over 3.1 million SNPs. NATURE. 449: 18, October 2007.

� Maller J, George S, Purcell S, Fagerness J, Altshuler D, DalyMJ, Seddon JM.Common variation in three genes, including a noncoding variant in CFH,strongly influences risk of age-related macular degeneration. Nat Genet. 38:9(1055-9), Sep 2006.

the international the international hapmap hapmap project

Documents