a. genetic variationcschweikert/cisc4020/snps.pdf · – genetic variation in populations snps as...

Post on 17-Aug-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sequence Variations

Baxevanis and Ouellette, Chapter 7 - Sequence Polymorphisms

NCBI SNP Primer: http://www.ncbi.nlm.nih.gov/About/primer/snps.html

Overview

Mutation and Alleles– Linkage– Genetic variation in populations

SNPs as genetic markers– “Classical” genetic diseases– Multi-factorial diseases and risk factors– Genome scans (genotyping)

A review of some basic genetics

Alleles• An allele is a particular DNA sequence for a gene.• Some gene alleles are responsible for ordinary

phenotypes like blue/brown eyes.• Others lead to classic genetic diseases like cystic

fibrosis or Huntington’s disease.

Changes occur in DNA sequences = mutations

Many Causes of Mutations

• Somatic vs. reproductive cells• Radiation and/or chemical damage to DNA• Random errors of the replication machinery• Normal biological processes - methylation

• Mutations occur randomly throughout DNA.

•Most have no phenotypic effect (non-coding regions, equivalent codons, similar AAs).

•Some damage the function of a protein or regulatory element.

•A very few provide an evolutionary advantage.

Mutations Create Alleles

Population Genetics• Chromosome pairs segregate and recombine in every

generation.

• Every allele of every gene has its own independent evolutionary history (and future).

• Frequencies of various alleles differ in different sub-populations of people.

Human Alleles• The OMIM (Online Mendelian Inheritance

in Man) database at the NCBI tracks all human mutations with known pheontypes.

• It contains a total of about 2,000 genetic diseases [and another ~11,000 genetic loci with known phenotypes - but not necessarily known gene sequences]

• It is designed for use by physicians:– can search by disease name– contains summaries from clinical studies

OMIM Morbid Map: Cytogenetic map location of disease genes.

Variation Makes Life Interesting

• The Human Genome has been sequenced;what’s next?

• Much of what makes us unique individuals is represented by the differences in our DNA sequence from other people.

• There are rare and common forms (alleles) of every gene.

• Probably only 3-4 alleles are present in 95% of the population for most genes, but lots of rare mutations.

SNPs are Mutations

SNPs• A mutation that causes a single base change is

known as a Single Nucleotide Polymorphism (SNP).

• Other kinds of mutations include insertions and deletions.

• Large breaks and rearrangement of chromosomes also occur (translocations)s

GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG

^

SNPs are Very Common• SNPs are very common in the human

population.• Between any two people, there is an average

of one SNP every ~1250 bases.• Most of these have no phenotypic effect.

– Only <1% of all human SNPs impact protein function (non-coding regions).

– Selection against mis-sense mutations (think about what would happen to dominant lethal mutations?).

• Some are alleles of genes.

Genome Sequencing finds SNPs• The Human Genome Project involves sequencing

DNA cloned from a number of different people.[The Celera sequence comes from 5 people.]

• Even within one person’s DNA, the homologous chromosomes have SNPs.

• This inevitably leads to the discovery of SNPs -any single base sequence difference

• These SNPs can be valuable as the basis for diagnostic tests

We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.

http://www.ncbi.nlm.nih.gov/snp

SNP Discovery: dbSNP database

Search dbSNP with BLAST“As of June, 2008,dbSNP has 12.8 million SNPs in the human genome”

• It is possible to search dbSNP by BLAST comparisons to a target sequence

>gnl|dbSNP|rs1042574_allelePos=51 total len = 101 |taxid = 9606|snpClass = 1Length = 101

Score = 149 bits (75), Expect = 3e-33Identities = 79/81 (97%)Strand = Plus / Plus

Query: 1489 ccctcttccctgacctcccaactctaaagccaagcactttatatttttctcttagatatt 1548||||||||||||||||||||||||||||||||||||||||||||||| || |||||||||

Sbjct: 1 ccctcttccctgacctcccaactctaaagccaagcactttatattttcctyttagatatt 60

Query: 1549 cactaaggacttaaaataaaa 1569|||||||||||||||||||||

Sbjct: 61 cactaaggacttaaaataaaa 81

If a matchingSNP is found, then it can bedirectly located on the Genome map

Uses for SNPs• Diagnostic tests for disease alleles• Markers to aid in cloning of interesting

genes (disease genes)• Pharmacogenomics - genetics of reponse

to drugs (effectiveness and side effects)

DNA Diagnostic Testing• Hereditary diseases - potential parents, pre-

natal, late onset diseases.• Genes that predispose to disease (risk

factors).• Genotyping of infectious agents (bacterial

& viral).• Forensics - using DNA testing to establish

identity.

Clinical Manifestationsof Genetic Variation

(All disease has a genetic component)• Susceptibility vs. resistance• Variations in disease severity or symptoms• Reaction to drugs (pharmacogenetics)• Variable disease course and prognosis SNPs can be found that are linked to

all of these traits.

Finding Disease Genes• Virtually all diseases have a genetic component.• Start with DNA samples from families that show

inheritance of the disease.• Use STS markers to map the gene or genes

involved (linkage analysis).• Find SNPs in the genetic region(s) that are likely

candidates for involvement in that disease.• Get the gene from genomic sub-clone.

Some Diseases Involve Many Genes

• There are a number of classic “genetic diseases” caused by mutations of a single gene .

– Huntington’s, Cystic Fibrosis, Tay-Sachs, PKU, etc.• There are also many diseases that are the result of the

interactions of many genes:– asthma, heart disease, cancer

• Each of these genes may be considered to be a risk factor for the disease.

• Groups of genetic markers (SNPs) may be associated with a disease without determining a mechanism.

Multiple Causes

• Some diseases may actually be caused by any of a group of different genes (multiple causes), but all show the same symptoms.

• SNP linkage analysis can identify these sub-populations more efficiently than classical molecular genetic approaches.

• Machine learning, genetic algorithms, SVMs

“The study of the distribution of genetic variants, including SNPs, lies

within the domain of population genetics, and the study of the

relationship between SNPs and phenotypic variation lies in the domain

of quantitative genetics.”�

Gibson&Muse

A B c

a B C

a

B

C

A B c

a B C

a B c

A b c

A b c

A b c

a b C

a b C

A b c

A b c

a B C

A

B

c

a b C

a B c

A b c

Quantitative Trait Locus Mapping

A B C

a b c

F1

A B C

a b c

F1

X

a b c

a b c

A B C

A B C

Parent 3 Parent 4

X

HEI

GH

T

GENOTYPE BB Bb bb

♦ ♦

♦ ♦ ♦

♦ ♦ ♦

B b

Bb Bb Bb BB BB BB bb bb bb

a b c

a b c

A B C

A B C

Parent 1 Parent 2

X

Knott et al. (1997) TAG 84:810-820

Association Mapping

recombination through evolutionary history

present-day chromosomes in natural population

* T G

* T A

C G

C A * T G

C A

ancestral chromosomes

* T G

SNP Discovery Methods

•  Pairwise Sequence Comparison from databases, eSNP

•  Deep Resequencing

SNP Analysis Agenda

Sequence-Based SNP Identification

Common Bioinformatic Solutions Phred, Phrap, Consed, Polyphred, and Polybayes

High-Throughput SNP Identification Solution

•  Overlapping PCR Amplicons across entire gene •  Make no assumptions about sequence function

•  Sequence diversity and genetic structure for each gene is different •  Proper association studies can only be designed in this context •  Complete resequencing facilitates population genetics methods

Sequence each end of the fragment.

Base-calling Quality determination

Contig assembly Final

quality determination

Sequence viewing Polymorphism tagging

Polymorphism reporting Individual genotyping

Polymorphism detection

PolyPhred/Polybayes

Consed

Analysis

Sequence Phred Phrap Amplify DNA 5’ 3’

Sequence-based SNP Identification

Phylogenetic analysis

ATAGACG ATACACG ATAGACG ATACACG

ATAGACG ATACACG

Homozygotes Heterozygote

Phred, Phrap, Consed, Polyphred, Polybayes

•  phred: Base calling and quality assignments �

•  phrap: Contig formation and new quality assignments �

•  consed: Visual X-Windows graphic interface, to view and edit alignments and contigs, and to view the original traces �

•  polyphred: find polymorphisms in phrap contigs, quality calls, add data to phrap files to permit consed finding and visualization of polymorphisms.

•  polybayes: Fully probabilistic SNP detection algorithm that calculates the probability (SNP score) that discrepancies at a given location of a multiple alignment represent true sequence variations as opposed to sequencing errors.

Figure 1. Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Matching human ESTs are retrieved from dbEST and traces are re-called. c, Paralogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples.

Nature Genetics 23, 452 - 456 (1999)

A general approach to single-nucleotide polymorphism discovery

Gabor T. Marth, Ian Korf, Mark D. Yandell, Raymond T. Yeh, Zhijie Gu, Hamideh Zakeri, Nathan O. Stitziel, LaDeana Hillier, Pui-Yan Kwok & Warren R. Gish

PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing

Deborah A. Nickerson*, Vincent O. Tobe and Scott L. Taylor �

Nucleic Acids Research; 1997- 25:2745

SNP calling Correct call False positive False positive False positive

Trace File�

High quality region – no ambiguities

Trace File�

Medium quality region – some ambiguities

Trace File�

Poor quality region – low confidence

Using PolyPhred to Visualize SNPs

• Compares sequences across traces obtained from different individuals to identify sites for SNPs. • Will occasionally miscall genotypes - frequency of such mistakes depends on the sequencing chemistry used to generate the trace. • To reduce the number of miscalled sites, ignores regions of poor quality & ends

Polyphred –  Reads the ACE file to obtain the consensus sequence and the names of the

trace (chromat) files used in the assembly.

–  Reads the PHD files associated with each trace.

–  During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence

–  The score indicates how well the trace at the site matches the expected pattern for a SNP.

–  Updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using Consed.

Polybayes Bayesian statistical model takes into account: - depth of coverge - base quality values of the sequences

Polybayes calculations are aided with information on major/minor allele frequencies as well as polymorphism rates within the species under investigation

**Can also integrate into the poly files for viewing with Consed

•  Alignment Critical in the automation of base calls –  Commonly used Phrap (from PhredPhrap) is an assembler and is NOT

ideal for alignments –  Many commonly used aligners work best with protein sequences

or with a reference sequence –  Preservation of quality scores for input into SNP identification

programs –  Speed for high-throughput programs

•  Automated SNP Calls -  Reference Sequence Required -  Traditional approaches without reference sequence include

“eSNPs” (human, maize, and pine) -Very little redundancy outside of abundant genes -Overall high number of false positives (single pass reads)

-  Not specific to frequencies observed in different organisms -  High number of false positives in currently accepted methods

(Polybayes & Polyphred)

Alignment and SNP Calling Pipeline�Challenges in High-Throughput SNP Identification

5’ UTR

exon

Intron

3’ UTR

4-Coumarate CoA Ligase (4CL)0 500 1000 1500 2000 2500

1

994

1410

1609

1697

1845

1934

2004

2385

2589

F4 R4 F3 R3 F2 R1A61 601 947 1454 1486 2003

F5 R3 F6 R6491 1956 2728

743-781 bound_moiety="AMP" 2396-2417 proposed active sites1

s2

s3

s4

s5

s6

s7

s8

s9

s11

G T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G CG T A G T C G G G C

A C T A C T G A A TA C T A C T G A A TA C T A C T G A A TA C T A C T G A A T

A C T A C T G G A TA C T A C T G G A TA C T A C T G G A TA C T A C T G G A T

A C T A C C G G A TA C T A C C G G A TA C T A C C G G A TA C T A C C G G A T

A C T A C C G G A CA C T A C C G G A CA C T A C C G G A CA C T A C C G G A C

A C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A CA C T A C C A G A C

G T A G T C G G G CG T A G T C G G G C

A C T G T C G G G CA C T G T C G G G C

G C A G C C G G G C

4CL haplotype frequencies

top related