the international the international hapmap hapmap project
TRANSCRIPT
The International The International HapMapHapMapProjectProject
Anno 2009/2010Dott.ssa Laura Rita Duro
Most common diseases, such as diabetes, cancer, stroke, heart disease, depression, and
asthma, are affected by combinations of multiple asthma, are affected by combinations of multiple genetic and environmental factors
Genetic and environmental contributions to monogeni c and complex disorders
(A) Monogenic disease . A variant in a single gene is the primary determinant of amonogenic disease or trait, responsible for most of the disease risk or trait variation(dark blue sector), with possible minor contributions of modifier genes (yellowsectors) or environment (light blue sector).
(B) Complex disease . Many variants of small effect (yellow sectors) contribute todisease risk or trait variation, along with many environmental factors (blue sector).
Complex diseases, in contrast, have proven
much more challenging to study,
as they are thought to be due to the
combined effect of
More than a thousand genes for rare, highly heritable ‘mendelian’ disorders have been identified, in which variation in a combined effect of
many different susceptibility DNA variants interacting with environmental
factors
in which variation in a single gene is both
necessary and sufficient to cause
disease.
Discovering these genetic factors will provide Discovering these genetic factors will provide
fundamental new insights into the pathogenesis,
diagnosis and treatment of human disease
Although any two unrelated people are the same at about 99.9% of their DNA sequences,
the remaining 0.1% is important because it contains the genetic variants that influence how people differ in their risk of disease or
their response to drugs.
Discovering the DNA sequence variants that contribute to common disease risk offers one
of the best opportunities for understanding the complex causes of disease in humans.
Human Genetic VariationsHuman Genetic Variations
Primarily two types of genetic mutation events create allforms of variations:
Single base mutation which substitutes one nucleotide for another
-Single Nucleotide Polymorphisms (SNP)
Insertion or deletion of one or more nucleotide(s)
-Tandem Repeat Polymorphisms
-Insertion/Deletion Polymorphisms
Tandem Repeat Polymorphisms
Tandem repeats or variable number of tandem repeats (VNTR) are a verycommon class of polymorphism, consisting of variable length of sequencemotifs that are repeated in tandem in a variable copy number.
VNTRs are subdivided into two subgroups based on the size of thetandem repeat units.
Microsatellites or Short Tandem Repeat (STR)Microsatellites or Short Tandem Repeat (STR)repeat unit: 1-6 (dinucleotide repeat: CACACACACACA)
Minisatellitesrepeat unit: 10-100
SNPs
Sites in the genome where the DNA sequences of many
individuals differ by a single base are called single
nucleotide polymorphisms (SNPs)
For example, some people may have a chromosome with an A at a particular site where
others have a chromosome with a G
Each form is called an allele
Variation Or Mutation ?
Terminology for variation at a single nucleotide position is defined by allele
frequencyfrequency
PolymorphismA sequence variation that occurs at least 1 percent of the time (> 1%)90% of variations are SNPs
MutationIf the variation is present less than 1 percent of the time (<= 1%)
Transitions and Transversions
SNPs include single base substitutions such as:
Transitions:change of one purine (A,G) for a purine,
or a pyrimidine (C,T) for a pyrimidine
Transversions:change of a purine (A,G) for a pyrimidine (C,T),
or viceversa
A G G A C T T C
A C A T G C G T C A C G T A T G
In principle, SNPs could be bi-, tri-, or tetra-allelic polymorphisms
However, in humans, tri-allelic and tetra-allelic However, in humans, tri-allelic and tetra-allelic SNPs are rare almost to the point of
non-existence, and so SNPs are sometimes simply referred to as bi-allelic markers
Non-coding SNPs::
5’ and 3’ UTRsIntronsIntergenic Spaces
Synonymous Coding SNPs:
when single base substitutions do not cause a change in the resultant amino acid
Non-synonymous CodingSNPs:
when single base substitutions cause a change in the resultant amino acid
Non-coding SNPs
Example: Regulatory SNPs (rSNPs)
Two allelic variants of the same gene are transcribed in differentamounts as a consequence of an adjacent polymorphism. In thisexample, allele G, located upstream of the gene, has a highertranscript level than does allele T.
Coding SNPs
Example: Synonymous, mutation does not change amino acid.
Example: Non-synonymous, mutation change amino acid.
Coding SNPs
SNPs
It has been estimated that, in the world’s human po pulation, about 10 million sites (that is, one variant per 300 bases on
average) vary such that both alleles are observed a t a frequency of > 1%, and that these 10 million common SNPs
constitute 90% of the variation in the population.
The remaining 10% is due to a vast array of variant s that are each rare in the population.each rare in the population.
The presence of particular SNP alleles in an indivi dual is determined by testing (‘genotyping’) a genomic DNA sample.
NATURE |VOL 426 | 18/25 DECEMBER 2003
A particular combination of alleles along a chromosome is termed a haplotype
Haplotype is a set of SNPs on a single chromatid that are statistically associated
The coinheritance of SNP alleles on these haplotypes
leads to associations between these alleles in the
population
(known as linkage disequilibrium , LD)
Linkage disequilibrium
� Situation in which some combinations of alleles or geneticmarkers occur more or less frequently in a population thanwould be expected from a random formation of haplotypesfrom alleles based on their frequencies.from alleles based on their frequencies.
� Non-random associations between polymorphisms atdifferent loci are measured by the degree of linkagedisequilibrium (LD).
The LD between many neighboring SNPs generally persists because meiotic recombination does not occur at random, but is concentrated in recombination hot spots.
Adjacent SNPs that lack a hot spot between them are likely to be in strong LD.
r2 = 1: two SNPs that are perfectly correlated (allele A of SNP1 is always observed withallele C of SNP2, and viceversa)r2 = 0: allele A of SNP1 providing no information at all about which allele of SNP4 ispresent.
Complete independence of these 6 SNPs would predict the possibility of 64 differenthaplotypes (because n biallelic SNPs could generate 2n haplotypes), but in reality just 4haplotypes comprise 90% of observed chromosomes, indicating that LD is present.
Because of the strong associations among the SNPs in most chromosomal
regions, only a few carefully chosen SNPs (known as tag SNPs ) need to be typed to predict the likely variants at the
rest of the SNPs in each region
SNP1, SNP2, and SNP3 are strongly correlated, and SNP4, SNP5, and SNP6are strongly correlated, so that any of SNP1–SNP3 (or SNP4–SNP6) could
serve as tags for the other 2 SNPs in each group.
Many empirical studies have shown highly significant levels of LD, and often strong associations between nearby SNPs, in the human genome.
Because the likelihood of recombination between two SNPs increases with the distance between them, on average such associations between
SNPs decline with distance.
B.A. Salisbury et al. Mutation Research 2003
Average linkage disequilibrium, |D|, vs. distance between SNPs for 2597 genes in which accurate distances were available. Lower values indicate a stronger effect of recombination and recurrent mutation.LD decreases with distance.
Genotyping only a few, carefully chosen SNPs in the region will provide enough
The strong associations between SNPs in a region have
a practical value
Genotyping only a few, carefully chosen SNPs in the region will provide enoughinformation to predict much of the information about the remainder of the commonSNPs in that region. As a result, only a few of these ‘tag’ SNPs are required toidentify each of the common haplotypes in a region.
On the basis of empirical studies, it has been estimated that most ofthe information about genetic variation represented by the 10 millioncommon SNPs in the population could be provided by genotyping200.000 to 1.000.000 tag SNPs across the genome
These observations are the conceptual and empirical foundation fordeveloping a haplotype map of the human genome, the ‘HapMap’.
The International HapMap Project is a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the
United Kingdom and the United States to develop a public resource that will help researchers find genes associated with human that will help researchers find genes associated with human
disease and response to pharmaceuticals.
An initial meeting to discuss the scientific and ethical issues associated with developing a human haplotype map was held in Washington in 2001.
The International HapMap Project was then formally initiated in 2002.
The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap ,
which will describe the common patterns of human DNA sequence variation.
NATURE |VOL 426 | 18/25 DECEMBER 2003
The HapMap is expected to be a key resource for researchers to use to find genes affecting health, disease, and responses
to drugs and environmental factors.
The information produced by the Project is freely available (www.hapmap.org)
The HapMap was designed to determine the frequencie s and The HapMap was designed to determine the frequencie s and patterns of association among roughly 3 million com mon patterns of association among roughly 3 million com mon SNPs in four populations, for use in genetic associ ation SNPs in four populations, for use in genetic associ ation
studiesstudies
The HapMap project focuses only on common SNPs, tho se The HapMap project focuses only on common SNPs, tho se where each allele occurs in at least 1% of the popu lationwhere each allele occurs in at least 1% of the popu lation
The project studied a total of 270 DNA samples:
� 90 samples from a US Utah population withNorthern and Western European ancestry(samples collected in 1980 by the Centred’Etude du Polymorphisme Humain (CEPH)and used for other human genetic maps)
� new samples collected from 90 Yorubapeople in Ibadan, Nigeria
� 45 unrelated Japanese in Tokyo, Japan
� 45 unrelated Han Chinese in Beijing, China
The International HapMap Consortium decided to include several
populations from different ancestral geographic locations to ensure that the
HapMap would include most of the common variation and some of the less
common variation in different populations.
NATURE |VOL 426 | 18/25 DECEMBER 2003
Human Genome Project Human Genome Project vsvs
International HapMap Project International HapMap Project
In its scope and potential consequences, the International HapMap Projecthas much in common with the Human Genome Project, which sequenced thehuman genome.
Both projects have been scientifically ambitious and technologicallydemanding, have involved intense international collaboration, have beendedicated to the rapid release of data into the public domain, and promise tohave profound implications for our understanding of human biology andhuman health.
Whereas the sequencing project covered the entire genome, including the99.9% of the genome where we are all the same, the HapMap willcharacterize the common patterns within the 0.1% where we differ from eachother.
� the availability of the human genome sequence;
� databases of common SNPs (subsequently enriched by thisproject) from which genotyping assays could be designed;
The project had become practical by the confluence of the following:The project had become practical by the confluence of the following:
� insights into human LD;
� development of inexpensive, accurate technologies for high-throughput SNP genotyping;
� web-based tools for storing and sharing data.
The International HapMap Consortium NATURE October 2005
HapMap Project comprises two phases
The complete data obtained in Phase I were published
on October 2005.
The analysis of the Phase II dataset was published in
October 2007.
The Phase I HapMapThe Phase I HapMap
Phase I of the HapMap Project set as agoal genotyping at least one commonSNP every 5 kb across the genome inSNP every 5 kb across the genome ineach of 269 DNA samples.
For the sake of practicality, and motivatedby the allele frequency distribution ofvariants in the human genome, a minorallele frequency (MAF) of 0.05 or greaterwas targeted for study.
Minor Allele Frequency (MAF) : The frequency at which the less abundant (or minor) allele of a SNP is present in a population. The MAF for a SNP to be considered common is usually above 1%.
The project required a dense map of SNPs, ideally containing information about validation and frequency of each candidate SNP.
When the project started, the public SNPdatabase (dbSNP) contained 2.6 millioncandidate SNPs, few of which wereannotated with the required information.annotated with the required information.
The HapMap Project contributed about 6million new SNPs to dbSNP. At October2005 dbSNP contains 9.2 millioncandidate human SNPs.
To study patterns of genetic variation were selected ten 500-kb regions from the ENCODE (Encyclopedia of DNA Elements) Project.
These ten regions were chosen to approximate the genome-wide average for G+C content, recombination rate, percentage of sequence
conserved relative to mouse sequence, and gene density.Each 500-kb region was sequenced in 48 individuals, and all SNPs in
these regions (discovered or in dbSNP) were genotyped in the complete set of 269 DNA samples.
Using the data provided by HapMap, a team of scientists at Harvard Medical
School and the Broad Institute has discovered a new genetic variant
associated with age-related macular degeneration (AMD), the leading cause
of blindness in people over 60 years of age, as well as confirming previously
reported variants
Nature Genetics - 38, 1055 - 1059 (2006)
They estimate that genotypes related to just five variants in three different genes can explain 50% of the risk of developing AMD
In addition to CFH on chromosome 1
The new genetic common variant identified was found in a non-coding region of theComplement Factor H (CFH) gene, other variants of which were recently shown to
be associated with the risk of developing AMD.
� the complement factor B (BF) gene on chromosome 6
� complement component 2 (C2) gene on chromosome 6
� a common variant (A69S) is in hypothetical gene LOC387715 on chromosome 10.
Interestingly, these three genes do not appear to interact directly, but insteadcontribute to the risk of AMD independently.
Phase II HapMap characterizes over 3.1 million human
SNPs genotyped in 270 individuals from four
geographically diverse populations
�� GenotypingGenotyping inin phasephase IIII waswas attemptedattempted forfor aboutabout 44..44 millionmilliondistinctdistinct SNPs,SNPs, ofof whichwhich roughlyroughly 11..33 millionmillion eithereither couldcouldnotnot bebe typed,typed, werewere notnot polymorphicpolymorphic inin anyany ofof thethepopulations,populations, oror diddid notnot passpass genotypinggenotyping qualityquality controlcontrolfiltersfilters..
�� CertainCertain regionsregions ofof thethe genomegenome werewere recognizedrecognized asas beingbeingchallengingchallenging toto study,study, suchsuch asas centromeres,centromeres, telomeres,telomeres,gapsgaps inin genomegenome sequence,sequence, andand segmentalsegmental duplications,duplications,regionsregions declareddeclared toto bebe notnot HapMapableHapMapable..
The resulting HapMap has an SNP density of approximately
one per kilobase and is estimated to contain approximately
25–35% of all the 9–10 million common SNPs in the 25–35% of all the 9–10 million common SNPs in the
assembled human genome
Variation in SNP density within the Phase II HapMap
Phase I
Phase II
Example of the fine-scale structure of SNP density for a 100-kb region on chromosome 17 showing polymorphic Phase I SNPs in the consensus data set (red triangles) and
polymorphic Phase II SNPs in the consensus data set (blue triangles)
The Phase II HapMap differs from the Phase I HapMap also in minor allele frequency (MAF) distribution. SNPs added in Phase II have lower MAF. Phase II HapMap
includes a better representation of rare variation than the Phase I HapMap
Advances in technology for high-throughput SNP geno typing
Advances in genotyping technology have vastly increased the number
of variants that can be typed and decreased the per-sample costs
These advances have made possible the
dense genotyping needed to capture the
majority of SNP variation within an individual
at a sufficiently low cost to allow the large
sample sizes needed for comparison of
individuals with and without disease
Studies in additional populations have shown that t he tag SNPs chosen using the HapMap are generally transfer able across other populations, but there are some limita tions.
So additional samples from the populations used to develop the HapMap as well as from seven more populations h ave
recently been genotyped across the genome.
� Luhya from Webuye, Kenya� Maasai from Kenya � Tuscans from Italy � Indian-Americans (Gujarati) from Houston, TX� Han Chinese from Denver� Mexican-Americans from Los Angeles� Americans of African Descent from the SW USA
recently been genotyped across the genome.
It is now clear that the HapMap can be
a useful resource for the design and
analysis of disease association studies analysis of disease association studies
in populations across the world
APPLICATION OF THE HAPMAP TO COMMON DISEASECOMMON DISEASE
The technological advances directly stimulated or
indirectly facilitated by the HapMap have had a
profound impact on the study of the genetics ofprofound impact on the study of the genetics of
common diseases
The history of high-density GWA
scanning to date has
demonstrated the striking
success of this approach in
finding genetic variants
associated with disease. associated with disease.
Variants or regions associated
with nearly 40 complex diseases
have been identified in diverse
population samples.
Major Autism Gene Found with Help of HapMap
Using data from the HapMap, along with DNA samples collected from manyfamilies who have affected children, researchers have discovered a geneticvariation linked to autism, one of the most heritable mental healthconditions.
They found a variation in the sequence of a gene - the “MET receptortyrosine kinase gene” - that is associated with autism. This gene is involvedtyrosine kinase gene” - that is associated with autism. This gene is involvedin brain development, immune function, and digestive system repair.
The MET promoter variant rs1858830 allele "C" is strongly associated withASD and results in reduced gene transcription. MET protein levels weresignificantly decreased in ASD cases compared with control subjects.People who have the variation are more than twice as likely as others tohave “autism spectrum disorders”
Campbell DB et al, Ann Neurol. 2007
A genome-wide association study identifies novel ri sk loci for type 2 diabetes
Type 2 diabetes mellitus results from the interaction of environmental factorswith a combination of genetic variants.A systematic search for these variants was recently made possible by thedevelopment of high-density arrays that permit the genotyping of hundreds ofthousands of polymorphisms.
Researchers tested 392,935 SNPs in a French case–control cohort.Markers with the most significant difference in genotype frequencies betweenMarkers with the most significant difference in genotype frequencies betweencases of type 2 diabetes and controls were fast-tracked for testing in a secondcohort.This identified four loci containing variants that confer type 2 diabetes risk, inaddition to confirming the known association with the TCF7L2 gene.These loci include a non-synonymous polymorphism in the zinc transporterSLC30A8, which is expressed exclusively in insulin-producing β-cells, and twolinkage disequilibrium blocks that contain genes potentially involved in β-celldevelopment or function (IDE–KIF11–HHEX and EXT2–ALX4).
Sladek R et al. Nature 445, 881-885 (2007)
Currently, additional samples from the populations used to developthe initial HapMap, as well as samples from seven additionalpopulations will be sequenced and genotyped extensively to extendthe HapMap, providing information on rarer variants and helping toenable genome-wide association studies in additional populations.
There are also ongoing efforts by many groups to characterize
Future of the HapMap Project
There are also ongoing efforts by many groups to characterizeadditional forms of genetic variation, such as structural variation, andmolecular phenotypes in the HapMap samples. Finally, in the future,whole-genome sequencing will provide a natural convergence oftechnologies to type both SNP and structural variation.
Nevertheless, until that point the HapMap Project data will provide aninvaluable resource for understanding the structure of human geneticvariation and its link to phenotype.
BeyondBeyond SNPsSNPs::Copy Copy NumberNumber VariantsVariants and and OtherOtherCopy Copy NumberNumber VariantsVariants and and OtherOther
StructuralStructural VariationVariation
Current generation high-throughput
genotyping platforms are
extraordinarily efficient at genotyping
SNPs, but they are less effective at SNPs, but they are less effective at
genotyping structural variants, such
as insertions, deletions, inversions,
and copy number variants
Although not as common as SNPs, these variants also occur commonly in the human genome
The distribution of copy number variation in the human genome among 270 HapMap samples
A Copy number variants (CNV) is a segment of DNA in which copy-
number differences have been found by comparison of two or more
genomes.
CNV in which stretches of genomic sequence of roughly 1 kb to 3 Mb in size are deleted or are duplicated in
varying numbers, have gained increasing attention because of their
apparent ubiquity and potential dosage effect on gene expression.
In 2004, the interrogation of genomic variability by array
hybridization methods clearly demonstrated the existence of copy hybridization methods clearly demonstrated the existence of copy
number variants.
Intense analysis of this type of genomic variability followed, and
the current conservative estimate from studies in a few hundred
individuals is that at least 10% of the genome is subject to copy
number variation
Although a typical SNP affects only one single nucleotide
pair, their genomic abundance (over 10 million) makes
them the most frequent source of polymorphic changes
By contrast, CNVs are far less numerous but can affect
from one kilobase to several megabases of DNA per
event, adding up to a significant fraction of the genome
It is now recognized that the genomes of any two individuals in the human population differ more at the structural level than at the nucleotide sequence level
NATURE GENETICS SUPPLEMENT | VOLUME 39 | JULY 2007
� Much of what was previously known about the role of CNVs in diseasecomes from a rich literature on ‘genomic disorders’.
� Genomic disorders are defined as a diverse group of genetic diseases thatare each caused by an alteration in DNA copy number.
� These mutations can be relatively large, microscopically visibleimbalances, such as in Prader-Willi syndrome, or they may be muchsmaller, requiring higher resolution detection methods, such as in WilliamsSyndrome.
� Genomic disorders are typically sporadic in nature because the CNV in� Genomic disorders are typically sporadic in nature because the CNV inmost cases is a de novo mutation with nearly complete penetrance, andbecause the affected individuals have severe developmental problems andare unlikely to have offspring.
� However, there are notable examples of mendelian disease traitsassociated with CNVs. For example, duplications of the gene for peripheralmyelin protein 22 (PMP22) cause the dominant neuropathy Charcot-MarieTooth disease type 1A, and deletions of the α-globin gene cluster causethe recessive anemia α-thalassemia.
Bibliografia� The International HapMap Consortium. The International HapMap Project.
NATURE. 426: 18/25, December 2003.
� Deloukas P, Bentley D. The HapMap project and its application to geneticstudies of drug response. The Pharmacogenomics Journal. 4,88–90 (2004).
� The International HapMap Consortium. A haplotype map of thehumangenome. NATURE. 437: 27, October 2005.
� Manolio TA, BrooksLD, Collins FS. A HapMapharvestof insightsinto the� Manolio TA, BrooksLD, Collins FS. A HapMapharvestof insightsinto thegenetics of common disease. The Journal of Clinical Investigation. 118: 5,May 2008.
� The International HapMap Consortium. A second generation humanhaplotype map of over 3.1 million SNPs. NATURE. 449: 18, October 2007.
� Maller J, George S, Purcell S, Fagerness J, Altshuler D, DalyMJ, Seddon JM.Common variation in three genes, including a noncoding variant in CFH,strongly influences risk of age-related macular degeneration. Nat Genet. 38:9(1055-9), Sep 2006.