predicting the effects of amino acid substitutions on...

22
Predicting the Effects of Amino Acid Substitutions on Protein Function Pauline C. Ng and Steven Henikoff Fred Hutchinson Cancer Research Center, Seattle, Washington 98109; email: [email protected], [email protected] Annu. Rev. Genomics Hum. Genet. 2006. 7:61–80 The Annual Review of Genomics and Human Genetics is online at genom.annualreviews.org This article’s doi: 10.1146/annurev.genom.7.080505.115630 Copyright c 2006 by Annual Reviews. All rights reserved 1527-8204/06/0922-0061$20.00 Key Words nsSNP, nonsynonymous, single nucleotide polymorphism, coding SNP, missense Abstract Nonsynonymous single nucleotide polymorphisms (nsSNPs) are coding variants that introduce amino acid changes in their corre- sponding proteins. Because nsSNPs can affect protein function, they are believed to have the largest impact on human health compared with SNPs in other regions of the genome. Therefore, it is impor- tant to distinguish those nsSNPs that affect protein function from those that are functionally neutral. Here we provide an overview of amino acid substitution (AAS) prediction methods, which use se- quence and/or structure to predict the effect of an AAS on protein function. Most methods predict approximately 25–30% of human nsSNPs to negatively affect protein function, and such nsSNPs tend to be rare in the population. We discuss the utility of AAS prediction methods for Mendelian and complex diseases as well as their broader applications for understanding protein function. 61 Annu. Rev. Genom. Human Genet. 2006.7:61-80. Downloaded from arjournals.annualreviews.org by INSTITUTE FOR GENOMIC RESEARCH on 09/14/06. For personal use only.

Upload: lyminh

Post on 17-Apr-2018

222 views

Category:

Documents


4 download

TRANSCRIPT

ANRV285-GG07-03 ARI 8 August 2006 1:43

Predicting the Effects ofAmino Acid Substitutionson Protein FunctionPauline C. Ng and Steven HenikoffFred Hutchinson Cancer Research Center, Seattle, Washington 98109;email: [email protected], [email protected]

Annu. Rev. Genomics Hum. Genet. 2006.7:61–80

The Annual Review of Genomics and HumanGenetics is online atgenom.annualreviews.org

This article’s doi:10.1146/annurev.genom.7.080505.115630

Copyright c© 2006 by Annual Reviews.All rights reserved

1527-8204/06/0922-0061$20.00

Key Words

nsSNP, nonsynonymous, single nucleotide polymorphism, codingSNP, missense

AbstractNonsynonymous single nucleotide polymorphisms (nsSNPs) arecoding variants that introduce amino acid changes in their corre-sponding proteins. Because nsSNPs can affect protein function, theyare believed to have the largest impact on human health comparedwith SNPs in other regions of the genome. Therefore, it is impor-tant to distinguish those nsSNPs that affect protein function fromthose that are functionally neutral. Here we provide an overview ofamino acid substitution (AAS) prediction methods, which use se-quence and/or structure to predict the effect of an AAS on proteinfunction. Most methods predict approximately 25–30% of humannsSNPs to negatively affect protein function, and such nsSNPs tendto be rare in the population. We discuss the utility of AAS predictionmethods for Mendelian and complex diseases as well as their broaderapplications for understanding protein function.

61

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

Online MendelianInheritance in Man(OMIM): databasecontaining mutationsdiscovered inpatients and found ingenes known to beinvolved in disease

Human GeneMutation Database(HGMD): databasecontaining mutationsin genes known to beinvolved in disease

NonsynonymousSNP (nsSNP): asingle nucleotidepolymorphismlocated in a codingregion that causes anamino acidsubstitution in thecorrespondingprotein

Amino acidsubstitution (AAS)prediction method:bioinformatics toolthat evaluateswhether an AASaffects proteinfunction

INTRODUCTION

Most genetic variation is considered neutralbut single base changes in and around a genecan affect its expression or the function of itsprotein products (11, 56). A nonsynonymousor missense variant is a single base changein a coding region that causes an amino acidchange in the corresponding protein. If a non-synonymous variant alters protein function,the change can have drastic phenotypic conse-quences. Most alterations are deleterious andso are eventually eliminated through purify-ing selection. However, beneficial mutationscan sweep through the population and be-come fixed, thus contributing to species dif-ferentiation.

The importance of nonsynonymous sub-stitutions in humans is illustrated by twodatabases containing disease-causing vari-ants, Online Mendelian Inheritance inMan (OMIM) and Human Gene Muta-tion Database (HGMD) (24, 62). In bothdatabases, nonsynonymous changes accountfor approximately half of the genetic changesknown to cause disease. Although thesedatabases contain information primarily con-cerning disorders caused by single Mendelianlesions, it is likely that nonsynonymouschanges will play a similarly important role incomplex diseases because of their potentiallylarge impact.

The human population is estimated tohave 67,000–200,000 common nonsynony-mous SNPs (nsSNPs) (8, 23, 35) and eachperson is thought to be heterozygous for24,000–40,000 nsSNPs (8). It would be time-consuming, difficult, and expensive to ex-perimentally characterize the impact of eachnsSNP on protein function. But because anamino acid change can have a large impacton fitness, a computational method that couldpredict whether an amino acid substitution(AAS) affects protein function would help re-searchers prioritize AASs for additional study.The observation that disease-causing muta-tions are more likely to occur at positions thatare conserved throughout evolution, as com-

pared with positions that are not conserved,suggested that prediction could be based onsequence homology (39). It was also observedthat disease-causing AASs had common struc-tural features that distinguished them fromneutral substitutions, suggesting that struc-ture could also be used for prediction (68,77). Since these studies were performed, aplethora of AAS prediction methods basedon sequence and/or structure have becomeavailable (7, 9, 14, 16–19, 27, 31, 33, 40–42, 45–47, 53, 58, 59, 64, 65, 68, 69, 72,73).

In this review we first survey existing AASprediction methods and summarize the his-tory of the field. We also offer practical advicefor researchers who would like to use AAS pre-diction methods. Next, we look at the useful-ness of AAS prediction methods in identify-ing candidate mutations responsible for bothMendelian and complex diseases. Third, wediscuss other applications of AAS predictionmethods. Finally, we discuss likely future im-provements of such methods.

METHODOLOGY OF AMINOACID SUBSTITUTIONPREDICTION

Basic Methodology

AAS prediction methods use sequenceand/or structural information for prediction.Prediction is feasible because mutationsthat affect protein function tend to occur atevolutionarily conserved sites (Figure 1a)and/or are buried in protein structure(Figure 1b). These observations came fromseveral early studies that used AASs foundin disease genes in affected individuals (39,68, 77). These studies assumed that thesesubstitutions affected protein function,thereby causing disease. These studies alsoassumed that a majority of nsSNPs in hu-mans or the substitutions observed betweenhumans and closely related species arefunctionally neutral. When Wang & Moult(77) modeled disease-causing mutations

62 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

onto their corresponding wild-type proteinstructures, they found that 83% of disease-causing mutations affected protein stability.By applying the stability criterion andseveral other structural criteria, they coulddetect 90% of disease-causing mutations. Incontrast, only 30% of neutral nsSNPs weredetected with the same set of rules, whichsuggests that their rules could be used todistinguish disease-causing mutations fromneutral nsSNPs. Using both structure andsequence, Sunyaev et al. (68) could detect70% of disease-causing mutations and only17% of neutral substitutions. Based on anal-ysis of protein sequences, Miller & Kumar(39) showed that disease-causing AASs areoverabundant at conserved sites.

Based on these observations, AAS predic-tion methods using either sequence and/orstructural information were introduced, someof which were also implemented as Webservers [Table 1; Supplemental Table 1 (fol-low the Supplemental Material link from theAnnual Reviews home page at http://www.annualreviews.org.)] (7, 9, 14, 16–19, 27, 31,33, 40–42, 45–47, 53, 58, 59, 64, 65, 68, 69,72, 73). The typical procedure used by AASprediction methods is shown in Figure 2. Tomake a prediction, AAS prediction methodscan use sequence, structure, and/or annota-tion. Sequence-based AAS prediction meth-ods (14, 18, 19, 42, 45, 58, 69, 73) acceptan input sequence and search it against asequence database to find homologous se-quences. A multiple sequence alignment ofthe homologous sequences reveals what posi-tions have been conserved throughout evolu-tionary time, and these positions are inferredto be important for function. The AAS pre-diction method then scores the AAS basedon the amino acids appearing in the multi-ple alignment and the severity of the aminoacid change. An amino acid that is not presentat the substitution site in the multiple align-ment can still be predicted to be toleratedif there are amino acids with similar physio-chemical properties present in the alignment.For example, if a protein sequence alignment

0.25 0.50 0.75 1.00 1.25

a

Conserved sitesVariable sitesRel

ativ

e m

uta

tio

n p

rob

abili

ties

Surface sitesBuried sites

b

Solvent accessibility of the mutation site (%)

Non-synonymous benign Synonymous benign Disease

Rel

ativ

e m

uta

tio

n p

rob

abili

ties

0

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Conservation of the mutation site(relative entropy)

60-10030-6010-300-10

Figure 1(a) The probability that a mutation will cause a disease increasesmonotonically with an increase in the degree of site conservation.Nonsynonymous SNPs found in normal individuals are assumed to bebenign and their probabilities show the opposite trend where theiroccurrence decreases with the degree of site conservation. The benignsynonymous SNPs do not change amino acids and should be predominantlyneutral. As a result, their probability is uniform across sites, regardless ofwhether or not the site is conserved. Figure from Reference 75; licenseeBioMed Central Ltd. (This is an Open Access Article: Verbatim copyingand redistribution of this article are permitted in all media for any purpose.)(b) The solvent accessibility of an amino acid residue in a protein reflects thedegree of the residue’s exposure to the surrounding solvent in the proteinstructure. The relative probability of disease-causing mutations is highest inthe protein interior and lowest on the protein surface. The benign SNPsshow the reverse trend, as their relative probability is highest on the surfaceand lowest in the protein interior. Figure from Reference 75; licenseeBioMed Central Ltd. (This is an Open Access Article: Verbatim copyingand redistribution of this article are permitted in all media for any purpose.)

www.annualreviews.org • Amino Acid Substitutions on Protein Function 63

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

Table 1 Amino acid substitution (AAS) prediction methods available on the Internet∗

Method and Web site Interface Performance AlgorithmSIFThttp://blocks.fhcrc.org/sift/SIFT.html (45–47)

Input: Protein sequence and AAS,protein sequence alignment and AAS,dbSNP id, or protein id

Output: Score ranges from 0 to 1,where 0 is damaging and 1 is neutral

FN error: 31%FP error: 20%dbSNP: 25% predicted

to be damagingCoverage: 60%

Using sequencehomology, scores arecalculated usingposition-specific scoringmatrices with Dirichletpriors

PolyPhenhttp://www.bork.embl-heidelberg.de/PolyPhen(64, 65)

Input: Protein sequence and AAS,dbSNP id, HGVbASE id, or protein id

Output: Score ranges from 0 to apositive number, where 0 is neutral,and a high positive number isdamaging

FN error: 31%FP error: 9%dbSNP: 32% predicted

damagingCoverage: 81%

Uses sequenceconservation, structureto model position ofamino acid substitution,and SWISS-PROTannotation

SNPs3Dhttp://www.snps3d.org/(82, 83)

Input: dbSNP id, protein id, literaturesearch, or gene ontology

Output: Scores from structure-basedSVM and sequence-based SVMreported separately. Score <0 isdamaging. Mutation on proteinstructure can be visualized

Structure-based SVMFN error: 26%FP error: 15%Coverage: 14%Sequence-based SVMFN error: 20%FP error: 10%Coverage: 71%Predicted damaging in

dbSNP: 25%

Structure-based supportvector machine uses 15structural factors

Sequence-conservationsupport vector machineuses five features thatcapture sequenceconservation

PANTHER PSEChttps://panther.appliedbiosystems.com/methods/csnpScoreForm.jsp (73)

Input: Protein sequence and AASOutput: A negative score is damaging,

zero is neutral, and positive isgain-of-function

FN error: 59%FP error: N/ACoverage: 40%dbSNP: 9% predicted

damaging

Uses sequence homology;scores are calculatedusing PANTHERHidden Markov Modelfamilies

PMUThttp://mmb2.pcb.ub.es:8080/PMut/ (16–18)

Input: Protein id, protein sequence, ormultiple sequence alignment

Output: Score ranges from 0 to 1,where 0 is neutral and high scores arepredicted to be damaging. Mutationon protein structure is shown

FN error: 21%FP error: 17%When structure is

included:FN error: 12%FP error: 10%

Prediction provided byone of two neuralnetworks. Neuralnetwork uses internaldatabases, secondarystructure prediction, andsequence conservation

TopoSNPhttp://gila.bioengr.uic.edu/snp/toposnp (64, 65)

Input: Protein id or protein sequenceOutput: Can view position of mutation.

Location of substitution on protein(surface, internal, or pocket) andconservation reported separately

Results are stored so an input proteinsequence not in the database will notbe processed

FN error: 12%FP error: N/AdbSNP: 68% predicted

to be damaging

Classifies substitution asburied, on the surface, orin a pocket of theprotein’s structure. Alsoprovides conservationscore based on Pfamprotein alignments

∗False negative (FN) error rate is the percentage of substitutions predicted to be functionally neutral on a set of AASs that are known to affectprotein function. These substitutions come from a mutagenesis set or those suspected to be involved in disease. False positive (FP) error rate is thepercentage of substitutions predicted to be damaging on substitutions known to be functionally neutral. For some methods, error rates were notreported and are marked as N/A in the table. The coverage shown in this table is for dbSNP (60) and not for mutations in disease genes. Diseasegenes tend to have higher coverage because they are well studied. Coverage depends on databases, so methods published later tend to have morecoverage than those published earlier. For a more complete list of AAS prediction methods, see Supplemental Table 1. (Follow the SupplementalMaterial link from the Annual Reviews home page at http://www.annualreviews.org.)

64 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

Input protein sequence andamino acid substitution

Apply scoring rules for prediction

Structure Sequence Annotation

Search a protein structure database.

Model input sequence on to structure of top hit

Assess features at the site of the amino acid substitution, such as solvent accessibility,

B-factor, etc.

Search a protein sequence database and

obtain homologous sequences

Score amino acid substitution according to amino acids appearing in

the multiple sequence alignment, taking into

account conservation and the physiochemical

properties of amino acids present

Take annotation from SWISS-PROT database

or use prediction programs, such as those that predict

transmembrane regions or secondary structure, to annotate the input

sequence

Output AAS prediction

Figure 2Flowchart for aminoacid substitution(AAS) prediction.Input typicallyconsists of theprotein sequence andAASs. The methodcan use sequenceand/or structuralfeatures forprediction. Somemethods also useannotation to aid inprediction.

shows tyrosines and tryptophans at a particu-lar site, one would expect that the other aro-matic amino acid, phenylalanine, would alsobe tolerated at that site. The probability ofobserving a particular AAS at a site can be es-timated from an appropriate model of proteinconservation, and substitutions that are un-likely to be observed at a site are expected toreduce protein stability or function.

Structure-based AAS prediction methods(9, 27, 65, 69, 72) take an input sequence andfind the best match against a protein structuredatabase. Because most structure-based AASprediction methods use general structural fea-tures surrounding the site of substitution anddo not require detailed information at the

atomic level, they can model the substitutiononto the structure of a homologous proteinrather than require the exact structure of theinput sequence. AAS prediction methods thenexamine the position of the AAS and can takeinto account several structure factors suchas solvent accessibility, carbon-beta density,crystallographic B-factor, and the differencein free energy between the new and the oldamino acid. Based on these structural features,structure-based AAS prediction methods fol-low rules to arrive at a prediction.

AAS prediction methods can also incor-porate annotations to refine prediction. TheSwiss-Prot database annotates the positions ofa protein that are located in the active site,

www.annualreviews.org • Amino Acid Substitutions on Protein Function 65

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

are involved in ligand binding, are part ofa disulfide bridge, or are involved in otherprotein-protein interactions (1). AAS predic-tion methods can use this information to guideprediction (17, 18, 68, 69, 77). For example,if the position of the AAS is annotated asinvolved in ligand binding, then the AAS ispredicted to affect the protein. Also, one canuse sequence-based predictions of secondarystructure and solvent accessibility and incor-porate this annotation into the scoring scheme(18, 31).

Direct comparisons between AAS predic-tion methods are difficult because they weretrained and tested on different data sets usingdifferent versions of sequence and structuraldatabases as resources. The performance of anAAS prediction method depends on the datasets the method is tested on. AAS predictionmethods are typically tested on two types ofdata sets: a nonneutral set, which contains sub-stitutions assumed to affect protein function,and a neutral set, which contains substitutionsassumed to have no effect. An AAS predictionmethod should predict the substitutions inthe nonneutral set to be damaging to proteinfunction. The percentage of non-neutral sub-stitutions incorrectly predicted to be toleratedis an approximation of the false negative er-ror rate. The AAS prediction method shouldalso predict the majority of the substitutionsin a neutral set as having no effect on proteinfunction. The percentage of neutral substitu-tions incorrectly predicted to affect proteinfunction approximates the false positive errorrate. The best AAS prediction methods mini-mize both false negative and false positive er-ror rates.

Popular nonneutral sets include data fromlaboratory mutagenesis experiments (36, 49,54, 80), human disease proteins where manymutations have been characterized (3, 21,22, 25, 32, 50, 55, 70), and human diseasedatabases such as OMIM, HGMD, or Swiss-Prot (1, 24, 62), which contain AASs that havebeen found in patients. Currently, the muta-genesis data sets represent only a few pro-teins, so caution should be used in extrapo-

lating results. For the data sets that containmutations found in disease genes, it is as-sumed that the AAS found in the patient isthe causative variant. However, the databaseentry is not necessarily the etiological variant.Instead, it could be in linkage disequilibriumwith the causative variant. Alternatively, thecausative mutation could be in an unscreenedgene.

Popular neutral sets include substitutionsthat cause no phenotypic effect in mutagenesisexperiments (36, 49, 54, 80) and nonsynony-mous single-nucleotide mutations that havebeen fixed during divergence between humanand a closely related species. The false posi-tive error rate based on the first set tends to behigher than the second set because mutationsdeemed neutral in laboratory experiments arethose that do not give a detectable pheno-type. However, such mutations could have toosmall an impact on protein function or re-quire alternative environmental conditions toreveal phenotypic effects (46). In the secondset, substitutions between human and anotherspecies have undergone millions of years ofevolutionary selective pressure, whereas thosewith negligible selection coefficients and verysmall effects on protein function have beeneliminated.

There are many AAS prediction meth-ods available and it is beyond the scope ofthis review to critique and compare everymethod. We have tried to summarize the ma-jor points of each method (SupplementalTable 1). It is encouraging that progress isbeing made in that overall prediction per-formance has improved over early meth-ods, and many methods are now availableto the research community as Web servers(Table 1).

Caveats for Using Structure inPrediction

AAS methods that use only protein struc-ture provide fewer predictions than meth-ods that use sequence because there are farfewer protein structures than sequences for

66 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

which homology can be found. The percent-age of amino acid substitutions that can bepredicted by an AAS method is defined as themethod’s coverage. Coverage for AAS meth-ods that rely on protein structure only is ap-proximately 14% (83), whereas coverage forAAS methods that use sequence can be ashigh as 81% (53). Notably, most methodsnow have a sequence-based score and anal-ysis of structure has become an option of-fered by the methods. Sequence-based cov-erage should continue to improve, becauselarge-scale sequencing projects are deposit-ing predicted protein sequences into publicdatabases at an increasing rate, whereas de-termination of three-dimensional (3D) struc-tures remains a challenging endeavor.

Even when a structure is available, predic-tion based solely on protein structure can bemisleading because the protein’s structure isoften determined in the isolated context ofa crystal and cannot take into considerationsupramolecular interactions. Structure-basedAAS prediction methods tend to predictpositions on the surface of the protein asneutral. Although substitutions at sites buriedin protein structure are more likely to bedamaging than surface residues, substitutionsat sites that appear to be solvent-accessible inthe crystal structure may also be importantfor function. These sites might be involvedin intermolecular interactions with proteinsthat are absent from the 3D structure of thesingle protein. For example, the β-globinE6V substitution causes sickle-cell anemia.The substitution occurs on the surface of theprotein and leads to formation of hemoglobinaggregates that underlie the sickling phe-notype. E6V is incorrectly predicted to bebenign by a structure-/sequence-based AASprediction method, Polymorphism Pheno-typing (PolyPhen) (71), whereas it is correctlypredicted by a sequence homology-basedAAS prediction method, Sorting IntolerantFrom Tolerant (SIFT). One possible wayto compensate for such misprediction is toidentify surface pockets or depressed regionsin the protein’s structure, which can be

PolymorphismPhenotyping(PolyPhen): apopular structure-/sequence-basedamino acidsubstitutionprediction methodavailable on theInternet(http://www.bork.embl-heidelberg.de/PolyPhen/)

Sorting IntolerantFrom Tolerant(SIFT): a popularsequence-basedamino acidsubstitutionprediction methodavailable on theInternet(http://blocks.fhcrc.org/sift/SIFT.html)

inferred to be potential functional bindingregions, as done in the AAS predictionmethod topoSNP (64, 65).

Some AAS methods use structural andfunctional annotation from the Swiss-Protdatabase in addition to structure and sequencemodeling (17, 18, 68, 69, 77). The functionalannotation is used to identify the residues thatare part of a binding site, active site, or disul-fide bond. It is presumed that changes at thesetypes of sites would have major effects onprotein function. Hence, substitutions occur-ring at these positions are predicted to affectfunction. One would think that use of suchannotation would improve prediction, but arecent study shows that use of Swiss-Protfunctional annotation decreases the overallprediction accuracy (83). Although use ofSwiss-Prot annotation reduces the false neg-ative rate by 1.6%, it increases the false posi-tive rate by 2.1%. Therefore, caution is war-ranted when interpreting results, based solelyon functional annotation, in which an AAS ispredicted to be damaging.

When the structure of the query pro-tein is not available, AAS prediction meth-ods model the query protein’s structure basedon a homolog. If the homolog is too distantlyrelated, prediction accuracy can suffer. Chas-man & Adams (9) obtained their best predic-tion results when the query protein was at least60% identical to its homolog. Yue & Moult(82) found that prediction is best when thesequence identity between the query proteinand the homologous protein is greater than40%. The false positive rate for their AASprediction method increased by 11% whenstructures with less than 40% sequence iden-tity were used (82).

Caveats for Using Sequencein Prediction

The first step for all sequence-based AASprediction methods is to choose homologoussequences, whether manually or automati-cally. Because the amino acids appearing inthe aligned sequences form the basis of the

www.annualreviews.org • Amino Acid Substitutions on Protein Function 67

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

Figure 3For some amino acid substitution (AAS) prediction methods, researcherscan submit a multiple sequence alignment for their protein of interest;using orthologs in the multiple sequence alignment instead of paralogsgives better performance. Blue points represent prediction accuracy basedon an alignment of the input protein and five sequences randomly chosenfrom a group of orthologs and paralogs. The red point is the predictionaccuracy based on a multiple sequence alignment of the input proteinsequence and five orthologs. Figure from Reference 67. Published withpermission from Genome Research, Volume 15, copyright by Cold SpringHarbor Laboratory Press.

scoring and the prediction, the sequences andtheir alignment are extremely important, andusers can take an active role in this predictionstep, which determines the quality of theirpredictions. The optimal set of sequences isdistantly related orthologs. Using orthologsinstead of paralogs can improve performanceby 8%, even when there are few sequences(67) (Figure 3).

Users should be cautious even with pro-teins that are judged to be orthologous basedon phylogeny. Orthologous genes in differ-ent species are derived from a common an-cestor, but they may not necessarily have thesame function. If function has changed, thenamino acids that are important for the func-tion of one protein may not necessarily beimportant for the function of the ortholog,and hence may have changed without any se-lection pressure. For example, 2% of disease-causing mutations in human genes are identi-cal to the sequences of their respective mouseorthologs, suggesting that even though thesepositions have huge phenotypic effects onhuman health, they have different roles orare no longer important in mice. If the or-thologs in alignment have slightly different

functions, then the positions that differentiatefunction among orthologs may be incorrectlypredicted.

Although functional differences betweenorthologs can result in misprediction by AASmethods, prior knowledge that orthologs havedifferent functions can be used to identifywhich amino acid(s) caused the functionalchange. For example, AlcR is a transcrip-tional regulator whose activation depends onthe presence of an inducer in one Bordetellaspecies but not in another species (5). Al-though the AlcR genes are clearly orthologsin the genus, they behave differently in dif-ferent species. To find the amino acid changeunderlying this behavioral difference, the au-thors of this study used the AAS predictionmethod, SIFT, which predicted that a seem-ingly conservative change (S103T) should notbe tolerated. The authors deduced that thisresidue was likely responsible for inducer de-pendence, which they subsequently provedby showing that S103T eliminated inducerdependence.

Most AAS prediction methods do not takeDNA sequence into account. As a result,they can miss changes that alter splice sitesor changes in regions under positive selec-tion. By ignoring DNA sequence differences,these methods might incorrectly predict sub-stitutions at sites under positive selection tobe neutral because of the many amino acidspresent at that position. However, the AASprediction method of Fleming et al. (19) usesthe DNA sequences of homologous genes inaddition to protein sequences to find sites un-der positive selection. The nonsynonymous-to-synonymous substitution rate ratio is al-lowed to vary at different amino acid sites andpositions with high ratios are posited to be un-der positive selection. It is assumed that aminoacid changes at these positions affect pro-tein function. Thus, knowledge of positivelyselected sites can lower the false negativeerror.

It is often assumed that the proteinsequence derived from the reference se-quence genome is functional and the “disease”

68 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

mutation reduces protein function. But insome instances, the amino acid correspond-ing to the more common allele reduces pro-tein function and a mutation causes gain-of-function. Thomas et al. (73) noticed this phe-nomenon when they applied their AAS pre-diction method PANTHER PSEC to mu-tations likely to cause disease. They foundthat the reference amino acid was predictedto damage protein function whereas the dis-ease mutation was predicted to be function-ally neutral for a small percentage of thecases (0.1%). This type of prediction re-veals two things. First, it suggests that thehuman protein sequence has reduced func-tion compared with orthologous proteins inother organisms. Second, it predicts thatthe “disease” mutation returns protein func-tion to a level similar to that in ortholo-gous proteins. Thus the mutation can bethought to have a gain-of-function effectin humans. Characterizing this mutation inmodel organisms may be inappropriate be-cause the common human allele has reducedprotein activity compared with that in modelorganisms.

Both structure and sequence are useful forprediction. Having a structure appears to pro-vide prediction performance that is equiva-lent to having four homologous sequences,with information from structure and sequencecomplementing each other (59). Finally, it isimportant to understand that these are pre-dictions only. They are meant to guide futureexperiments and not to be used directly in aclinical setting (71).

USEFULNESS OF AMINO ACIDSUBSTITUTION PREDICTIONMETHODS IN HUMANVARIATION AND DISEASE

Nonsynonymous SNPs

There are an estimated 67,000–200,000 com-mon nsSNPs in the human population (8, 23,35), with each person expected to be heterozy-gous for 24,000–40,000 nsSNPs (8). Given

this large number of nsSNPs and the obser-vation that single AASs can have a large effecton an organism or species, it is of great in-terest to identify which nsSNPs affect proteinfunction and, consequently, may affect humanhealth.

nsSNPs in the human population are ob-served less frequently than expected from theoverall mutation rate, which is evidence thatthey are under strong purifying selection (8,23, 63). Specifically, if a random mutationwere to occur in a coding region, it shouldlead to an amino acid change two thirds of thetime, but nsSNPs comprise only half of theobserved coding SNPs in the human genome(8). Furthermore, nsSNPs that cause a non-conservative amino acid change in the corre-sponding protein (for example, hydrophobicamino acid to a charged amino acid) surviveat approximately half the rate of conservativensSNPs (for example, a hydrophobic aminoacid changed to another hydrophobic aminoacid). These data strongly support the notionthat AASs play an important role in humanhealth. By providing information about whichsubstitutions are selected against, AAS predic-tion methods can help identify which nsSNPsmay be involved in disease.

Putative nsSNPs are catalogued in thedbSNP database maintained by NCBI, whichcurrently contains >50,000 nsSNPs (dbSNPbuild 124) (60). Of these nsSNPs, 25–30% arepredicted to reduce protein function by mostAAS prediction methods (46, 69, 83). SuchnsSNPs have a lower minor allele frequencydistribution than those that are predicted tobe functionally neutral (33–35, 69, 78). Thissuggests that damaging nsSNPs are being ac-tively selected against and confirms that AASprediction methods can successfully identifyputatively damaging nsSNPs that play animportant role in health. In one study, theAAS prediction method SIFT was applied tonsSNPs found in genes implicated in DNArepair, cell cycle arrest, apoptosis, and detox-ification (78). nsSNPs with low minor allelefrequency (<6%) were predicted by SIFT tobe damaging to the protein twice as often as

www.annualreviews.org • Amino Acid Substitutions on Protein Function 69

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

common nsSNPs (>10%). In a second study,less than 2% of common nsSNPs in en-vironmental response genes were predictedto be damaging by AAS prediction meth-ods PolyPhen and SIFT (34, 35). In a thirdstudy, SIFT was applied to nsSNPs foundin membrane transporter genes (34). Of thensSNPs with minor allele frequency <0.01,between 0.01 and 0.10, and >0.1, 40%, 13%,and 5% of nsSNPs were predicted to bedamaging, respectively. The authors foundthat general-purpose AAS scoring matrices,such as BLOSUM62, could not distinguishnsSNPs by minor allele frequency. Thus, ap-plication of AAS prediction methods to largensSNP data sets has confirmed that putativelydamaging nsSNPs are selected against and arelikely to have an impact on their respectiveproteins.

Experimental studies of individual pro-teins have also confirmed the accuracy of AASprediction methods. In one study, Brooks-Wilson et al. (6) studied mutations in E-cadherin that cause hereditary diffuse gastriccancer. They used SIFT to predict that threemissense mutations found in families with dif-fuse gastric cancer would be damaging to E-cadherin function, and all three were con-firmed to be damaging using in vitro assays.In a second study, Zhang et al. (84) examinedthe variants in PEPT1, a protein involved intransporting drugs across the cell membrane.nsSNPs found in the gene for PEPT1 weretested in vitro. The single SNP that reducedtransport capacity was predicted by SIFT toaffect protein function. This polymorphismmay be important in drug delivery for phar-macogenetics. In a third study, SIFT was usedto predict categories of cancer risk. Mutationsin the gene encoding melanocyte stimulatinghormone receptor (MSHR) increase the riskof skin cancer. Kanetsky et al. (30) identifiedrisk variants in the gene for MSHR from ei-ther published literature or using SIFT pre-dictions. The ability to assign an individual toa risk category was found to be similar whenusing either published literature or SIFTprediction.

In summary, AAS prediction methodshave proven useful for identifying damagingnsSNPs involved in human disease. Experi-mentally characterizing an AAS can be expen-sive and time-consuming, and AAS predictionmethods provide a valuable resource to sub-stantially reduce the effort.

Application to Mendelian Disease

AAS prediction methods have succeeded indistinguishing nonsynonymous changes thatcause simple Mendelian diseases from neutralnsSNPs that do not. AAS prediction meth-ods are trained and tested on AASs that wereidentified in disease genes in afflicted indi-viduals from databases such as OMIM (24),HGMD (62), and Swiss-Prot (1). It is assumedthat these substitutions found in patients af-fect protein function and cause disease as aresult. Currently, most of the diseases repre-sented by the genes in these databases segre-gate in a Mendelian manner, which suggeststhat they are caused by single deleterious le-sions. Most AAS prediction methods predictthat 70–90% of the AASs catalogued in thesedisease databases are damaging, whereas only10–20% of variants in neutral data sets arepredicted to be damaging (Table 1). Thisdemonstrates that AAS prediction methodscan distinguish between AASs that causeMendelian disease and neutral AASs. Inthis way, AAS prediction methods can helpnarrow down candidate nsSNPs to iden-tify the causative lesion within a large ge-nomic region implicated in disease by linkagestudies.

Although most simple Mendelian diseasesremain rare because of purifying selection,some become relatively common in popu-lations because of overdominant selection.Overdominant selection occurs when the het-eroyzgote carrier has higher fitness thanboth the mutant and normal homozygotes.The E6V substitution in β-globin mentionedabove is common in certain populations be-cause heterozygous carriers are more resis-tant to malaria than normal homozygotes,

70 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

whereas individuals homozygous for the rareallele have sickle-cell anemia. In this example,the AAS affects protein function but is main-tained in the population because reduced ac-tivity correlates with an advantageous effect.Another well-known example is overdomi-nance associated with methylenetetrahydro-folate reductase (MTHFR) alleles. Variantsthat reduce MTHFR activity can cause mentalretardation and cardiovascular disease in car-riers. Reduced MTHFR activity is thought tohave been beneficial to an individual’s over-all fitness during recent human evolution,so MTHFR variants damaging to proteinfunction have became common in humanpopulations. Because overdominant nsSNPscan severely affect protein function, they maybe detected by AAS prediction methods, andboth E6V in β-globin and overdominantMTHFR missense alleles are predicted to bedamaging by SIFT (Reference 46 and data notshown).

Application to Complex Diseases

Although AAS prediction methods can iden-tify nsSNPs involved in Mendelian diseases,their usefulness in studying complex diseases,such as hypertension, diabetes, and heart dis-ease, is still being explored. Complex geneticdiseases are those that cannot be mapped tosingle loci, which might indicate an interac-tion between multiple loci or an interactionwith the environment, or both. Two generalmodels have been proposed to explain the na-ture of genetic variation underlying complexdisease (4, 51, 52, 85). The common disease-common variant model predicts that the vari-ants of a particular locus that contribute todisease are few but common in the popula-tion and that complex disease results frominteractions between variants of many differ-ent genes. The common disease-rare variantmodel predicts that there are many etiologicalvariants at a locus, and each variant is presentat a low allele frequency in the human popu-lation. In this section, we discuss how usefulAAS prediction methods might be for detect-

ing variants involved in complex disease ac-cording to the two models.

Common disease-common variant. Ac-cording to the common disease-common vari-ant hypothesis, variants that are common inthe population, with minor allele frequencygreater than 5% or 1%, contribute to disease.If nsSNPs involved in complex disease can becategorized according to the rules of struc-ture and sequence conservation that AAS pre-diction methods have been trained on, thencurrent AAS prediction methods will be use-ful in identifying these common etiologicalvariants. A list of common nsSNPs that havebeen associated with complex disease has beencompiled (28, 37). When the AAS predictionmethod PANTHER PSEC was applied tothis set of disease-causing nsSNPs, the scoredistribution was different from the distribu-tion for Mendelian disease mutations (74).More importantly, the score distribution ofthese disease-associated nsSNPs is indistin-guishable from the distribution for normalhuman variation (Figure 4). The authors con-clude that nsSNPs involved in complex dis-eases caused by common variants do not occurat highly conserved positions, and thus cannotbe detected by current AAS prediction meth-ods. The data set of nsSNPs known to be in-volved in common disease is relatively smallcompared with the set of mutations knownto be involved in Mendelian disease. As morecausative common variants implicated in com-plex disease are discovered, it will be interest-ing to see if future studies yield similar results.If so, new rules will need to be formulated topredict the common polymorphisms involvedin complex disease.

AAS prediction methods can still be usefulin identifying the etiological variants in hap-lotypes. Haplotypes are particular combina-tions of alleles that are observed in a popu-lation. When a haplotype is associated withaffected individuals, the variants belonging tothe haplotype are all candidates for being thecausative mutation. If a haplotype containsa set of missense alleles, one can use AAS

www.annualreviews.org • Amino Acid Substitutions on Protein Function 71

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

Cu

mu

lati

ve f

ract

ion

0.6

0.8

1.0

0.2

Mendelian disease (HGMD)

0.4

0

SubPSEC score

–8–10 –6 –4 –2 0

Neutral model

Normal variation (dbSNP)

Complex disease

Figure 4Amino acid substitution (AAS) prediction methods may not be able toidentify disease variants in the common disease/common variant model.Cumulative distributions of scores from the AAS prediction methodPANTHER subPSEC are shown. A subPSEC score of 0 is predicted to befunctionally neutral and very negative scores are predicted to be damagingto protein function. Distributions are shown for mutations involved inMendelian disease (red), common variants associated with complex disease(blue squares), neutral and “normal” human variation (yellow and green,respectively). Figure from Reference 74, Volumes 90–102. Copyright c©1993–2005. Natl. Acad. Sci. USA. All rights reserved.

prediction methods to prioritize whichnsSNP may be the etiological variant (84).

Common disease-rare variant. Accordingto the common disease/rare variant hypothe-sis, low-frequency variants with strong effectson a locus can contribute to disease (52). Thatis, some “complex” diseases could actuallybe simple Mendelian diseases, but are causedby different allelic variants in different indi-viduals. Identifying these rare causative vari-ants requires sequencing genes in many indi-viduals. This process would uncover a largenumber of missense variants but only a fewmay contribute to disease. Below we discusstwo studies that used AAS prediction methods

to distinguish the causative missense variantsfrom neutral variants.

A study on the plasma levels of HDLcholesterol (HDL-C) demonstrates the use ofAAS prediction methods to detect rare dele-terious alleles that contribute to common dis-ease (10). Low levels of HDL-C are a ma-jor risk factor for coronary atherosclerosis.Three candidate genes involved in Mendelianforms of low HDL-C levels were sequencedin apparently normal individuals and a largernumber of rare nsSNPs were found in peo-ple with low levels of HDL-C compared withthose with high levels of HDL-C. The au-thors then applied the AAS prediction methodPolyPhen on the nsSNPs. The fraction ofnsSNPs predicted to be damaging in individ-uals with low levels of HDL-C is highly sig-nificant for two of the genes: p = 9 × 10−13 forABCA1 and p = 0.0003 for LCAT (Table 2).In the third gene, APOA1, only one nsSNPwas detected, and although it was predicted tobe damaging, this is not statistically significant(p = 0.09). If technological advances permit usto inexpensively sequence the coding regionsof many individuals, one could possibly iden-tify the genes and variants involved in diseaseby the increased number of nsSNPs and thesignificantly high proportion of nsSNPs pre-dicted to be damaging in the affected pop-ulation. Even with a Bonferroni correctiontaking into account the approximately 30,000genes in the human genome, the result ofABCA1 is still significant at p = 10−8. The re-sult of LCAT is no longer significant at thegenome-wide level. This could be compen-sated for by detecting more variants, which re-quires sequencing, or improved performanceof AAS prediction methods to reduce the falsepositive error.

Another study on the role of mitochon-drial mutations in Parkinson’s disease againshows that applying AAS prediction meth-ods to rare variants can identify genes of in-terest (61). Seven genes in the mitochondrialgenomes from normal individuals and patientswith Parkinson’s disease were sequenced.The authors’ AAS prediction method could

72 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

Table 2 A significantly high fraction of rare variants in individuals at risk for coronary atherosclerosis arepredicted to be damaging by the amino acid substitution (AAS) prediction method PolyPhen (10)∗

Gene Number of nsSNPs detectedNumber predicted to be possibly or

probably damaging by PolyPhenCumlative p-value, assuming

0.09 false positive rateLow levels of HDL-C, at risk for coronary atherosclerosisABCA1 25 17 9∗10−13

APOA1 1 1 0.09LCAT 5 4 0.0003High levels of HDL-CABCA1 4 2 0.04APOA1 0 0 NALCAT 1 0 0.91

∗The list of nsSNPs and their predictions from Cohen et al. (10) was compiled. For clarity, the two populations studied (256 Dallas Countyresidents and 263 Canadians) were combined and duplicated SNPs were counted only once. The p-value was calculated from a binomialdistribution, with the probability that a nonsynonymous (nsSNP) would be predicted to be damaging set to 0.09, Polyphen’s false positive rateon neutral substitutions (69).

distinguish patients from controls with 100%accuracy and two genes were identified tohave more putative damaging missense vari-ants compared with controls. In this case, allgenes had a similar proportion of missensevariants. Only by applying their AAS predic-tion method were the authors able to identifytwo genes that may be involved in Parkinson’sdisease. Because mutations in mitochondrialDNA become more abundant with increasingage, the role of rare missense variants may playan important role in late-onset diseases suchas Parkinson’s disease and Alzheimer’s disease.Because there already exists technology forresequencing mitochondrial genomes (38),one may be able to study the common dis-ease/rare variant model in diseases suspectedto involve the mitochondria.

If common diseases cannot be explained bycommon variants, then there will be a strongincentive to sequence coding regions to testthe common disease/rare variant hypothesis.Preliminary studies show that AAS predictionmethods play an important role in predict-ing which rare missense variants are delete-rious. Moreover, as technologies for discov-ering rare variants advance, AAS predictionmethods will become increasingly importantfor identifying disease genes in genome-widestudies.

OTHER APPLICATIONS OF AASPREDICTION METHODS

Interspecies Comparisons

AAS prediction methods are often applied topolymorphisms within a species, but they canalso be used on the fixed substitutions acrossspecies. Because AAS prediction methods pre-dict which amino acids are damaging to pro-tein function, they can identify which genesare under relaxed or neutral selection in a par-ticular species.

Interspecies comparisons show that do-mesticated species have a higher numberof putative damaging SNPs than the wildspecies. The Chicken Polymorphism Con-sortium sequenced three domestic breeds ofchicken and compared these data to thoseobtained from a wild line (79). For the non-synonymous substitutions fixed between thedomestic and wild lines, the alleles foundin domestic breeds were more than twice aslikely to be predicted as damaging by theAAS prediction method SIFT compared withthose in the wild line. This result suggests thatdomestic breeds are under relaxed selection,perhaps because they live in a less harsh en-vironment, which has allowed for damag-ing substitutions to become fixed in the do-mesticated species. In another interspecies

www.annualreviews.org • Amino Acid Substitutions on Protein Function 73

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

comparison, the missense substitutions be-tween two species of domesticated rice werestudied and a higher density of predicted dam-aging substitutions occurred in regions un-der positive selection (G.S.K. Wong, per-sonal communication). Because gene loss maysometimes be adaptive (48), the result sug-gested reduced function of some genes in ricecould be an adaptive response to domestica-tion.

These studies suggest that AAS pre-diction methods can provide insights intophenotypic differences observed betweenspecies. Because of false positive error, itwould be difficult to study this on an in-dividual gene basis. However, by groupinggenes within protein families or pathways,it may be possible to identify pathways thathave undergone relaxed selection in certainspecies.

Large-Scale Mutagenesis

AAS prediction methods can be applied tolarge-scale, reverse genetics projects, in whichmutations are introduced randomly in thegenome of an experimental organism, and torandom mutagenesis projects, when a geneof interest is saturated with mutations (4, 22,26, 36, 49, 51, 52, 54, 80, 85). When thereare many missense mutations in the gene(s)of interest, assaying all missense mutationscan be expensive and time-consuming. AASprediction methods can be used to priori-tize missense mutations that are most likelyto affect protein function and alter pheno-type. TILLING is one example of a large-scale reverse genetic strategy that uses an AASprediction method (SIFT) to prioritize whichmissense mutations are likely to reveal a phe-notype (26). TILLING has been applied toa wide range of organisms: Arabidopsis, ze-brafish, maize, Drosophila, and Lotus. BecauseAAS prediction methods are automated andgeneral, they can be widely applied to helpresearchers prioritize which AAS to charac-terize in genes of interest.

CONCLUSIONS

The presence of many AAS prediction meth-ods and their broad use underscores their im-portance. Prediction accuracy has graduallyimproved, but few head-to-head comparisonsexist (29, 35, 71, 81). Moreover, as the numberof servers providing AAS prediction increases,it will become increasingly difficult for inves-tigators to interpret the predictions.

These problems are similar to those facedby the protein structure prediction commu-nity 10 years ago. Critical Assessment ofTechniques for Protein Structure Prediction(CASP) (43) was motivated by the need tofairly assess structure prediction programs inorder to advance structure prediction meth-ods; we propose a similar solution for AASprediction methods.

Every two years, CASP releases thesequences of proteins for which structure isknown, but not yet available to the public.Researchers return predictions which arethen compared with known structure, andthe efficacy of each prediction method isassessed. Because investigators in the CASPprogram work on the same proteins during aspecified period of time, the CASP programoffers a valuable way to summarize whichmethods improve structure prediction andthis ultimately advances the field. As auto-mated structure prediction methods havebecome increasingly successful, there is now aserver competition that runs continuously asstructures are released, with results assessedautomatically (57).

A similar server competition could beimplemented for AAS prediction. Once adisease-associated variant is mapped, the re-sponsible nsSNP and benign control nsSNPscould be sent to participating servers andthe results immediately evaluated. Obtain-ing suitable data sets is difficult because eachdata set has its own advantages and disadvan-tages. For example, some of the AASs ob-tained from patients in the OMIM databasemay not necessarily be the causative vari-ant, which will artifactually inflate the false

74 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

negative error. But if one method has a higheraccuracy on both neutral and damaging datasets, then, despite the inaccuracies of the testsets, the prediction method can be assessedto be better. Moreover, as genome-wide mu-tagenesis projects generate phenotypic data(26), more accurate data can be obtained andtested.

In addition to accuracy, AAS predictionmethods can be evaluated based on coverage.Coverage depends on the source and versionof structure and sequence databases used. Thenumber of sequences and structures increaseevery year so coverage is expected to increasesimply because more information is available.

Most methods today offer confidencescores or an estimate of the degree to whicha substitution is damaging. One could alsotest how well an AAS prediction method’sscore correlates with phenotypic severity; theMAPP method and the method by Mooneyet al. (42, 67) have already demonstratedpromising results as their scores correlate withphenotypic severity. AAS prediction meth-ods could also be assessed according to theirability to predict how the substitution affectsthe protein: whether it increases or decreases

function, affects structure, or which domainor function is affected for multifunctionalenzymes.

Finally, the main benefit of such a com-petition is that investigators would have ac-cess to unified prediction. Currently, CASPmanages a metaserver that collects predictionsfrom other servers to obtain a consensus struc-ture. A metaserver for AAS prediction coulddo the same, so that researchers studying pro-tein variation could easily obtain a consensusprediction, with the best accuracy available.

The progress that has been made over thepast few years with AAS prediction meth-ods is promising: methodology has improvedand applications have proliferated. AAS pre-diction methods have proven successful forMendelian traits and may eventually play animportant role in identifying complex diseasevariants. Because AASs are a source of funda-mental changes between and within species,AAS prediction methods will continue to be ofmajor importance in the future. These meth-ods, in conjunction with those that predictgene regulatory and splicing variants (12, 13,15, 20, 66, 76), will guide us to a better under-standing of functional diversity in genomes.

SUMMARY POINTS

1. Approximately half of the known disease-causing mutations result from amino acidsubstitutions (AASs).

2. AAS prediction methods can successfully distinguish between AASs that causeMendelian disease and functionally neutral AASs.

3. Automated AAS prediction methods have been applied on a genome-wide scale.

4. Nonsynonymous SNPs (nsSNPs) predicted to be damaging tend to have low minorallele frequencies, which indicates purifying selection and validates AAS predictionmethods.

5. AAS prediction methods have been applied to individual genes and predictions havebeen experimentally confirmed.

6. Currently, AAS prediction methods cannot distinguish between common variantsinvolved in common disease and normal variation.

7. AAS prediction methods can identify rare variants involved in common disease. Whenputative damaging rare variants are classified by gene, disease genes can be identified.

www.annualreviews.org • Amino Acid Substitutions on Protein Function 75

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

8. AAS prediction methods can be used on a wide range of problems, such as analysis ofinterspecies differences and large-scale mutagenesis projects.

FUTURE DIRECTIONS

1. Because there are many new and improved amino acid substitution (AAS) predic-tion methods with complementary strengths, better accuracy should be possible bycombining prediction methods. A competition similar to the Critical Assessment ofTechniques for Protein Structure Prediction (CASP) would help to evaluate progressand identify strengths and weaknesses of prediction algorithms.

2. While the genetics community is actively engaged in discovering genetic variantsinvolved in complex disease, AAS prediction methods will need to be continuallyreassessed and possibly redesigned for optimal prediction of complex disease-causingvariants.

3. Although AASs currently account for a large proportion of the genetic variation con-tributing to human disease, new bioinformatics methods that evaluate gene regulatoryand splicing variants will broaden our understanding of functional variation.

ACKNOWLEDGMENTS

We thank Jorja Henikoff and Sarah Shaw Murray for manuscript comments.

LITERATURE CITED

1. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, et al. 2005. The UniversalProtein Resource (UniProt). Nucleic Acids Res. 33:D154–59

2. Deleted in proof3. Beutler E, Vulliamy TJ. 2002. Hematologically important mutations: glucose-6-phosphate

dehydrogenase. Blood Cells Mol. Dis. 28:93–1034. Botstein D, Risch N. 2003. Discovering genotypes underlying human phenotypes: past

successes for Mendelian disease, future approaches for complex disease. Nat. Genet.33(Suppl.):228–37

5. Brickman TJ, Armstrong SK. 2002. Bordetella interspecies allelic variation in AlcR inducerrequirements: identification of a critical determinant of AlcR inducer responsiveness andconstruction of an alcR(Con) mutant allele. J. Bacteriol. 184:1530–39

6. Brooks-Wilson AR, Kaurah P, Suriano G, Leach S, Senz J, et al. 2004. Germline E-cadherin mutations in hereditary diffuse gastric cancer: assessment of 42 new families andreview of genetic screening criteria. J. Med. Genet. 41:508–17

7. Cai Z, Tsung EF, Marinescu VD, Ramoni MF, Riva A, Kohane IS. 2004. Bayesian approachto discovering pathogenic SNPs in conserved protein domains. Hum. Mutat. 24:178–84

8. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, et al. 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231–38

9. One of the firstpapers to establishstructure-/sequence-basedrules to predictamino substitutionsthat affect proteinfunction.

9. Chasman D, Adams RM. 2001. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment ofamino acid variation. J. Mol. Biol. 307:683–706

76 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

10. An AASprediction methodpredicts asignificant numberof the rarenonsynonymousvariants found inindividuals at riskfor coronaryatherosclerosis tobe damaging. Thispaper shows thatAAS predictionmethods candistinguish the rarevariants likely tocause commondisease fromneutral variants.

10. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. 2004.Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science

305:869–7211. Collins FS, Guyer MS, Charkravarti A. 1997. Variations on a theme: cataloging human

DNA sequence variation. Science 278:1580–8112. Conde L, Vaquerizas JM, Ferrer-Costa C, de la Cruz X, Orozco M, Dopazo J. 2005.

PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect ingenes, for genotyping purposes. Nucleic Acids Res. 33:W501–5

13. Conde L, Vaquerizas JM, Santoyo J, Al-Shahrour F, Ruiz-Llorente S, et al. 2004. PupaSNPFinder: a web tool for finding SNPs with putative effect at transcriptional level. NucleicAcids Res. 32:W242–48

14. del Sol Mesa A, Pazos F, Valencia A. 2003. Automatic methods for predicting functionallyimportant residues. J. Mol. Biol. 326:1289–302

15. Fairbrother WG, Holste D, Burge CB, Sharp PA. 2004. Single nucleotide polymorphism-based validation of exonic splicing enhancers. PLoS Biol. 2:E268

16. Ferrer-Costa C, Gelpi JL, Zamakola L, Parraga I, de la Cruz X, Orozco M. 2005. PMUT:a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics21:3176–78

17. Ferrer-Costa C, Orozco M, de la Cruz X. 2002. Characterization of disease-associatedsingle amino acid polymorphisms in terms of sequence and structure properties. J. Mol.Biol. 315:771–86

18. Ferrer-Costa C, Orozco M, de la Cruz X. 2004. Sequence-based prediction of pathologicalmutations. Proteins 57:811–19

19. Fleming MA, Potter JD, Ramirez CJ, Ostrander GK, Ostrander EA. 2003. Understandingmissense mutations in the BRCA1 gene: an evolutionary approach. Proc. Natl. Acad. Sci.USA 100:1151–56

20. Freimuth RR, Stormo GD, McLeod HL. 2005. PolyMAPr: programs for polymorphismdatabase mining, annotation, and functional analysis. Hum. Mutat. 25:110–17

21. Gottlieb B, Lehvaslaiho H, Beitel LK, Lumbroso R, Pinsky L, Trifiro M. 1998. TheAndrogen Receptor Gene Mutations Database. Nucleic Acids Res. 26:234–38

22. Hainaut P, Hernandez T, Robinson A, Rodriguez-Tome P, Flores T, et al. 1998. IARCDatabase of p53 gene mutations in human tumors and cell lines: updated compilation,revised formats and new visualisation tools. Nucleic Acids Res. 26:205–13

23. Halushka MK, Fan JB, Bentley K, Hsie L, Shen N, et al. 1999. Patterns of single-nucleotidepolymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22:239–47

24. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. 2005. Online MendelianInheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.Nucleic Acids Res. 33:D514–17

25. Hardison RC, Chui DH, Giardine B, Riemer C, Patrinos GP, et al. 2002. HbVar: arelational database of human hemoglobin variants and thalassemia mutations at the globingene server. Hum. Mutat. 19:225–33

26. Henikoff S, Comai L. 2003. Single-nucleotide mutations for plant functional genomics.Annu. Rev. Plant Biol. 54:375–401

27. Herrgard S, Cammer SA, Hoffman BT, Knutson S, Gallina M, et al. 2003. Prediction ofdeleterious functional effects of amino acid mutations using a library of structure-basedfunction descriptors. Proteins 53:806–16

28. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. 2002. A comprehensive review ofgenetic association studies. Genet. Med. 4:45–61

www.annualreviews.org • Amino Acid Substitutions on Protein Function 77

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

29. Johnson MM, Houck J, Chen C. 2005. Screening for deleterious nonsynonymous single-nucleotide polymorphisms in genes involved in steroid hormone metabolism and response.Cancer Epidemiol. Biomarkers Prev. 14:1326–29

30. Kanetsky PA, Ge F, Najarian D, Swoyer J, Panossian S, et al. 2004. Assessment of poly-morphic variants in the melanocortin-1 receptor gene with cutaneous pigmentation usingan evolutionary approach. Cancer Epidemiol. Biomarkers Prev. 13:808–19

31. Krishnan VG, Westhead DR. 2003. A comparative study of machine-learning methods topredict the effects of single nucleotide polymorphisms on protein function. Bioinformatics19:2199–209

32. Kwok CJ, Martin AC, Au SW, Lam VM. 2002. G6PDdb, an integrated database of glucose-6-phosphate dehydrogenase (G6PD) mutations. Hum. Mutat. 19:217–24

33. Lau AY, Chasman DI. 2004. Functional classification of proteins and protein variants. Proc.Natl. Acad. Sci. USA 101:6576–81

34. Leabman MK, Huang CC, DeYoung J, Carlson EJ, Taylor TR, et al. 2003. Natural varia-tion in human membrane transporter genes reveals evolutionary and functional constraints.Proc. Natl. Acad. Sci. 100:5896–901

35. Livingston RJ, von Niederhausern A, Jegga AG, Crawford DC, Carlson CS, et al. 2004.Pattern of sequence variation across 213 environmental response genes. Genome Res.14:1821–31

36. Loeb DD, Swanstrom R, Everitt L, Manchester M, Stamper SE, Hutchison CA 3rd. 1989.Complete mutagenesis of the HIV-1 protease. Nature 340:397–400

37. Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN. 2003. Meta-analysis ofgenetic association studies supports a contribution of common variants to susceptibility tocommon disease. Nat. Genet. 33:177–82

38. Maitra A, Cohen Y, Gillespie SE, Mambo E, Fukushima N, et al. 2004. The Human Mi-toChip: a high-throughput sequencing microarray for mitochondrial mutation detection.Genome Res. 14:812–19

39. An early paperthat analyzed sevendisease genes andrecognized thatdisease mutationsappear at positionsconservedthroughoutevolution,indicating thatsequence-basedprediction waspossible.

39. Miller MP, Kumar S. 2001. Understanding human disease mutations through theuse of interspecific genetic variation. Hum. Mol. Genet. 10:2319–28

40. Mooney S. 2005. Bioinformatics approaches and resources for single nucleotide polymor-phism functional analysis. Brief. Bioinform. 6:44–56

41. Mooney SD, Altman RB. 2003. MutDB: annotating human variation with functionallyrelevant data. Bioinformatics 19:1858–60

42. Mooney SD, Klein TE, Altman RB, Trifiro MA, Gottlieb B. 2003. A functional analysisof disease-associated mutations in the androgen receptor gene. Nucl. Acids Res. 31:e42

43. Moult J. 2005. A decade of CASP: progress, bottlenecks and prognosis in protein structureprediction. Curr. Opin. Struct. Biol. 15:285–89

44. Mouse Genome Sequencing Consortium. 2002. Initial sequencing and comparative anal-ysis of the mouse genome. Nature 420:520–62

45. The first AASprediction methodimplemented on aWeb server.

45. Ng PC, Henikoff S. 2001. Predicting deleterious amino acid substitutions. Genome

Res. 11:863–74

46. Asequence-basedAAS predictionmethod thatsuccessfullypredicteddisease-causingmutations. Becausethis method doesnot require proteinstructure,predictions wereobtained for amajority of thensSNPs in dbSNP.

46. Ng PC, Henikoff S. 2002. Accounting for human polymorphisms predicted to affectprotein function. Genome Res. 12:436–46

47. Ng PC, Henikoff S. 2003. SIFT: predicting amino acid changes that affect protein function.Nucleic Acids Res. 31:3812–14

48. Olson MV. 1999. When less is more: gene loss as an engine of evolutionary change. Am.J. Hum. Genet. 64:18–23

49. Pace HC, Kercher MA, Lu P, Markiewicz P, Miller JH, et al. 1997. Lac repressor geneticmap in real space. Trends Biochem. Sci. 22:334–39

78 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

50. Patrinos GP, Giardine B, Riemer C, Miller W, Chui DH, et al. 2004. Improvements in theHbVar database of human hemoglobin variants and thalassemia mutations for populationand sequence variation studies. Nucleic Acids Res. 32:D537–41

51. Ponting CP, Goodstadt L. 2005. Statistical genetics: usual suspects in complex disease.Eur. J. Hum. Genet. 13:269–70

52. Pritchard JK. 2001. Are rare variants responsible for susceptibility to complex diseases?Am. J. Hum. Genet. 69:124–37

53. Ramensky V, Bork P, Sunyaev S. 2002. Human non-synonymous SNPs: server and survey.Nucleic Acids Res. 30:3894–900

54. Rennell D, Bouvier SE, Hardy LW, Poteete AR. 1991. Systematic mutation of bacterio-phage T4 lysozyme. J. Mol. Biol. 222:67–88

55. Rhee SY, Gonzales MJ, Kantor R, Betts BJ, Ravela J, Shafer RW. 2003. Human immun-odeficiency virus reverse transcriptase and protease sequence database. Nucleic Acids Res.31:298–303

56. Risch N, Merikangas K. 1996. The future of genetic studies of complex human diseases.Science 273:1516–17

57. Rychlewski L, Fischer D. 2005. LiveBench-8: the large-scale, continuous assessment ofautomated protein structure prediction. Protein Sci. 14:240–45

58. Santibanez Koref MF, Gangeswaran R, Santibanez Koref IP, Shanahan N, Hancock JM.2003. A phylogenetic approach to assessing the significance of missense mutations in diseasegenes. Hum. Mutat. 22:51–58

59. Carefullyexamines thecontribution ofstructure andsequence toprediction.

59. Saunders CT, Baker D. 2002. Evaluation of structural and evolutionary contribu-tions to deleterious mutation prediction. J. Mol. Biol. 322:891–901

60. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. 2001. dbSNP: the NCBIdatabase of genetic variation. Nucleic Acids Res. 29:308–11

61. Shows that AASprediction methodscan predict the rarevariants involved incommon diseaseand theaggregation ofthese results can beused to identify thegenes involved indisease.

61. Smigrodzki R, Parks J, Parker WD. 2004. High frequency of mitochondrial complexI mutations in Parkinson’s disease and aging. Neurobiol. Aging 25:1273–81

62. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, et al. 2003. Human Gene MutationDatabase (HGMD): 2003 update. Hum. Mutat. 21:577–81

63. Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, et al. 2001. Haplotype vari-ation and linkage disequilibrium in 313 human genes. Science 293:489–93

64. Stitziel NO, Binkowski TA, Tseng YY, Kasif S, Liang J. 2004. topoSNP: a topographicdatabase of non-synonymous single nucleotide polymorphisms with and without knowndisease association. Nucleic Acids Res. 32:D520–22

65. Stitziel NO, Tseng YY, Pervouchine D, Goddeau D, Kasif S, Liang J. 2003. Structurallocation of disease-associated single-nucleotide polymorphisms. J. Mol. Biol. 327:1021–30

66. Stone EA, Cooper GM, Sidow A. 2005. Trade-offs in detecting evolutionarily constrainedsequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6:143–64

67. Stone EA, Sidow A. 2005. Physicochemical constraint violation by missense substitutionsmediates impairment of protein function and disease severity. Genome Res. 15:978–86

68. One of the firstpapers to establishstructure-/sequence-basedrules to distinguishdisease mutationsfrom neutral aminoacid substitutions.

68. Sunyaev S, Ramensky V, Bork P. 2000. Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet. 16:198–200

69. Sunyaev S, Ramensky V, Koch I, Lathe W 3rd, Kondrashov AS, Bork P. 2001. Predictionof deleterious human alleles. Hum. Mol. Genet. 10:591–97

70. Szabo C, Masiello A, Ryan JF, Brody LC. 2000. The breast cancer information core:database design, structure, and scope. Hum. Mutat. 16:123–31

71. Tchernitchko D, Goossens M, Wajcman H. 2004. In silico prediction of the deleteriouseffect of a mutation: proceed with caution in clinical genetics. Clin. Chem. 50:1974–78

www.annualreviews.org • Amino Acid Substitutions on Protein Function 79

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

ANRV285-GG07-03 ARI 8 August 2006 1:43

72. Terp BN, Cooper DN, Christensen IT, Jorgensen FS, Bross P, et al. 2002. Assessing therelative importance of the biophysical properties of amino acid substitutions associatedwith human genetic disease. Hum. Mutat. 20:98–109

73. Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, et al. 2003. PANTHER: a libraryof protein families and subfamilies indexed by function. Genome Res. 13:2129–41

74. An AASprediction methodis applied tocommon variantsimplicated incommon diseaseand the predictionscores are nodifferent from thatof neutral variation.The authorsconclude thatcurrent AASprediction methodsmay not be usefulfor findingcommon variantsinvolved incommon disease.

74. Thomas PD, Kejariwal A. 2004. Coding single-nucleotide polymorphisms associ-ated with complex vs. Mendelian disease: evolutionary evidence for differences inmolecular effects. Proc. Natl. Acad. Sci. USA 101:15398–403

75. Vitkup D, Sander C, Church GM. 2003. The amino-acid mutational spectrum of humangenetic disease. Genome Biol. 4:R72

76. Wang X, Tomso DJ, Liu X, Bell DA. 2005. Single nucleotide polymorphism in transcrip-tional regulatory regions and expression of environmentally responsive genes. Toxicol. Appl.Pharmacol. 207:84–90

77. One of the firstpapers to observethat a majority ofdisease mutationsaffect proteinstability andestablishstructure-basedrules thatdistinguisheddisease mutationsfromnonsynonymousSNPs.

77. Wang Z, Moult J. 2001. SNPs, protein structure, and disease. Hum. Mutat. 17:263–70

78. Wong GK-S, Yang Z, Passey DA, Kibukawa M, Paddock M, et al. 2003. A populationthreshold for functional polymorphisms. Genome Res. 13:1873–79

79. Wong GK, Liu B, Wang J, Zhang Y, Yang X, et al. 2004. A genetic variation map forchicken with 2.8 million single-nucleotide polymorphisms. Nature 432:717–22

80. Wrobel JA, Chao SF, Conrad MJ, Merker JD, Swanstrom R, et al. 1998. A genetic approachfor identifying critical residues in the fingers and palm subdomains of HIV-1 reversetranscriptase. Proc. Natl. Acad. Sci. USA 95:638–45

81. Xi T, Jones IM, Mohrenweiser HW. 2004. Many amino acid substitution variants identifiedin DNA repair genes during human population screenings are predicted to impact proteinfunction. Genomics 83:970–79

82. Yue P, Li Z, Moult J. 2005. Loss of protein structure stability as a major causative factorin monogenic disease. J. Mol. Biol. 353:459–73

83. Yue P, Moult J. 2005. Identification and analysis of deleterious human SNPs. J. Mol. Biol.356:1263–74

84. Zhang EY, Fu D-J, Pak YA, Stewart T, Mukhopadhyay N, et al. 2004. Genetic polymor-phisms in human proton-dependent dipeptide transporter PEPT1: implications for thefunctional role of Pro586. J. Pharmacol. Exp. Ther. 310:437–45

85. Zwick ME, Cutler DJ, Chakravarti A. 2000. Patterns of genetic variation in Mendelianand complex traits. Annu. Rev. Genomics Hum. Genet. 1:387–407

80 Ng · Henikoff

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

Contents ARI 8 August 2006 1:10

Annual Review ofGenomics andHuman Genetics

Volume 7, 2006

Contents

A 60-Year Tale of Spots, Maps, and GenesVictor A. McKusick � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 1

Transcriptional Regulatory Elements in the Human GenomeGlenn A. Maston, Sara K. Evans, and Michael R. Green � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �29

Predicting the Effects of Amino Acid Substitutions on ProteinFunctionPauline C. Ng and Steven Henikoff � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �61

Genome-Wide Analysis of Protein-DNA InteractionsTae Hoon Kim and Bing Ren � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �81

Protein Misfolding and Human DiseaseNiels Gregersen, Peter Bross, Søren Vang, and Jane H. Christensen � � � � � � � � � � � � � � � � � � � � � 103

The Ciliopathies: An Emerging Class of Human Genetic DisordersJose L. Badano, Norimasa Mitsuma, Phil L. Beales, and Nicholas Katsanis � � � � � � � � � � � � � 125

The Evolutionary Dynamics of Human Endogenous RetroviralFamiliesNorbert Bannert and Reinhard Kurth � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 149

Genetic Disorders of Adipose Tissue Development, Differentiation,and DeathAnil K. Agarwal and Abhimanyu Garg � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 175

Preimplantation Genetic Diagnosis: An Overview of Socio-Ethicaland Legal ConsiderationsBartha M. Knoppers, Sylvie Bordet, and Rosario M. Isasi � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 201

Pharmacogenetics and Pharmacogenomics: Development, Science,and TranslationRichard M. Weinshilboum and Liewei Wang � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 223

Mouse Chromosome Engineering for Modeling Human DiseaseLouise van der Weyden and Allan Bradley � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 247

v

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.

Contents ARI 8 August 2006 1:10

The Killer Immunoglobulin-Like Receptor Gene Cluster: Tuning theGenome for DefenseArman A. Bashirova, Maureen P. Martin, Daniel W. McVicar,and Mary Carrington � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 277

Structural and Functional Dynamics of Human CentromericChromatinMary G. Schueler and Beth A. Sullivan � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 301

Prediction of Genomic Functional ElementsSteven J.M. Jones � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 315

Of Flies and Man: Drosophila as a Model for Human Complex TraitsTrudy F.C. Mackay and Robert R.H. Anholt � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 339

The Laminopathies: The Functional Architecture of the Nucleus andIts Contribution to DiseaseBrian Burke and Colin L. Stewart � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 369

Structural Variation of the Human GenomeAndrew J. Sharp, Ze Cheng, and Evan E. Eichler � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 407

Resources for Genetic Variation StudiesDavid Serre and Thomas J. Hudson � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 443

Indexes

Subject Index � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 459

Cumulative Index of Contributing Authors, Volumes 1–7 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 477

Cumulative Index of Chapter Titles, Volumes 1–7 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 480

Errata

An online log of corrections to Annual Review of Genomics and Human Genetics chaptersmay be found at http://genom.annualreviews.org/

vi Contents

Ann

u. R

ev. G

enom

. Hum

an G

enet

. 200

6.7:

61-8

0. D

ownl

oade

d fr

om a

rjou

rnal

s.an

nual

revi

ews.

org

by I

NST

ITU

TE

FO

R G

EN

OM

IC R

ESE

AR

CH

on

09/1

4/06

. For

per

sona

l use

onl

y.