86, march the quarterly review of biologydhoule/publications/houle@@11measurement.pdfof biology...

33
Measurement and Meaning in Biology Author(s): David Houle, Christophe Pélabon, Günter P. Wagner, Thomas F. Hansen Source: The Quarterly Review of Biology, Vol. 86, No. 1 (March 2011), pp. 3-34 Published by: The University of Chicago Press Stable URL: http://www.jstor.org/stable/10.1086/658408 . Accessed: 18/03/2011 17:28 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=ucpress. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to The Quarterly Review of Biology. http://www.jstor.org

Upload: others

Post on 26-Dec-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

Measurement and Meaning in BiologyAuthor(s): David Houle, Christophe Pélabon, Günter P. Wagner, Thomas F. HansenSource: The Quarterly Review of Biology, Vol. 86, No. 1 (March 2011), pp. 3-34Published by: The University of Chicago PressStable URL: http://www.jstor.org/stable/10.1086/658408 .Accessed: 18/03/2011 17:28

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .http://www.jstor.org/action/showPublisher?publisherCode=ucpress. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to TheQuarterly Review of Biology.

http://www.jstor.org

Page 2: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

THE QUARTERLY REVIEW

of Biology

MEASUREMENT AND MEANING IN BIOLOGY

David Houle*Centre for Ecological & Evolutionary Synthesis, University of Oslo, 0316 Oslo, Norway and Department of

Biological Science, Florida State UniversityTallahassee, Florida 32306-4295 USA

e-mail: [email protected]

Christophe PelabonDepartment of Biology, Centre for Conservation Biology, NTNU

N-7491 Trondheim, Norway

Gunter P. WagnerDepartment of Ecology & Evolutionary Biology, Yale University

New Haven, Connecticut 06405 USA

Thomas F. HansenCentre for Ecological & Evolutionary Synthesis, Department of Biology, University of Oslo

0316 Oslo, Norway

keywordsmeasurement theory, meaningfulness, dimensional analysis, scale type,

scaling, significance testing, modeling

abstractMeasurement—the assignment of numbers to attributes of the natural world—is central to all

scientific inference. Measurement theory concerns the relationship between measurements and reality;

*Authorship is on a nominal scale.

The Quarterly Review of Biology, March 2011, Vol. 86, No. 1

Copyright © 2011 by The University of Chicago Press. All rights reserved.

0033-5770/2011/8601-0001$15.00

Volume 86, No. 1 March 2011

3

Page 3: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

its goal is ensuring that inferences about measurements reflect the underlying reality we intend torepresent. The key principle of measurement theory is that theoretical context, the rationale for collectingmeasurements, is essential to defining appropriate measurements and interpreting their values.Theoretical context determines the scale type of measurements and which transformations of thosemeasurements can be made without compromising their meaningfulness. Despite this central role,measurement theory is almost unknown in biology, and its principles are frequently violated. In thisreview, we present the basic ideas of measurement theory and show how it applies to theoretical as wellas empirical work. We then consider examples of empirical and theoretical evolutionary studies whosemeaningfulness have been compromised by violations of measurement-theoretic principles. Commonerrors include not paying attention to theoretical context, inappropriate transformations of data, andinadequate reporting of units, effect sizes, or estimation error. The frequency of such violations revealsthe importance of raising awareness of measurement theory among biologists.

Introduction

PROGRESS IN science often involvesquantification. Ideas progress from

loose verbal accounts to become rigorousmathematical models. At the same time,concepts and entities progress from in-complete verbal definitions to becomevariables and parameters that derive theirmeaning from a precise theoretical con-text. For example, ecology developed froman unconnected set of verbal ideas in 1920to a unified science with a rigorous foun-dation in mathematical population ecol-ogy by about 1970 (Kingsland 1985), andsimilar changes took place during themodern synthesis in evolutionary biology.Arguably, empirical progress results mainlyfrom better measurement. Measurementsimprove either because of more rigoroustheory, which defines what is importantto measure, or better instruments, whichallow more accurate measurements of fa-miliar quantities or make the previouslyunknown measurable. The high status ofquantification in science is therefore nosurprise, nor is the desire of researchers tosupport their work with quantitative mea-surements.

For measurements to be meaningful,however, they must retain their connectionto the theoretical and instrumental con-text from which they were derived. Mea-surement theory concerns the relationshipbetween measurements and reality; its goalis ensuring that inferences about mea-surements reflect the underlying realitywe intend to represent. Unfortunately, inbiology, the connection between conceptsand measurements is often lost during the

measurement process. Quantitative mea-surements flourish, but they are often usedand manipulated in ways that render theconclusions drawn from them meaningless.This conclusion is familiar to any participantin journal clubs or discussion groups.

Consider some brief examples: Dunn etal. (1999) devised an index of concern forthe conservation of Canadian land birds byaveraging ordinal indices for abundance,breadth of range, and evidence of popula-tion decline. Wolman (2006) pointed outthat these averages are meaningless be-cause ordinal indices do not reflect themagnitudes of the attributes they measure.Post and Forchhammer (2002) reported anegative correlation between climate andvariation in abundance of caribou andmusk ox on Greenland. Vik et al. (2004)showed that this result is meaningless be-cause changing the units used to measureabundance could reverse this correlation.Diaz and Rutzler (2001) reviewed studiesof the importance of organisms on coralreefs, all of which used the area covered tomeasure abundance. Wulff (2001) pointedout that the most relevant attribute is bio-mass, and measuring area dramatically un-derestimates the biomass of sponges rela-tive to encrusting organisms because spongebiomass scales with volume. Harvey andClutton-Brock (1985) compiled a widelyused source of primate body-size data.They presented species means withoutstandard errors, sample sizes, or referencesto the source of the data. Smith andJungers (1997) traced the source of thesedata and found so many errors and inac-curacies that they concluded that “the data

4 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 4: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

table . . . should never have been acceptableas a source” (p. 549). We will make the casethat such errors are not isolated mistakesor differences of opinion, but systemic er-rors in the scientific process that resultfrom the absence of a theory of measure-ment and meaning in biology.

Scientific representation of biological re-ality also extends to the use of mathemati-cal models to capture biological relations.Mathematical models are representationsof hypotheses about reality, and their formand manipulation must be consistent withthe reality they are meant to represent.Measurement in the usual sense assignsnumbers to aspects of reality, whereasmathematical modeling assigns functionsto aspects of reality. This representationalaspect of modeling is often forgotten whenspecific models are used to represent gen-eral hypotheses. A typical example is May-nard Smith’s (1976) claim, based on ahighly specific population-genetic model,that Zahavi’s handicap principle, in whichindividuals display costly traits to enforcehonest signals of quality, cannot work. Thisclaim was not a mathematical error but arepresentational error, because the spe-cific functional forms assumed by MaynardSmith were not a complete representationof the universe of cases inherent in thehandicap hypothesis. Pomiankowski (1987,1988) and Grafen (1990a,b) showed thathandicaps can evolve under reasonableconditions on the basis of models withmore general functional forms.

Our purpose here is to bring a broadmeasurement theory framework for under-standing the relationship between mean-ing and measurement to the attention ofbiologists. We believe that awareness ofmeasurement theory helps us to do betterscience by providing tools to ensure themeaningfulness of our work. Our startingpoint is the field of representationalmeasurement theory, a well-establishedmathematical approach that enables us todetermine whether the relationshipsamong numerical measurements are con-sistent with the relationships among theattributes they are meant to represent.Representational measurement theory is

virtually unknown in biology, so our edu-cation in the theory has come mostly fromnonbiological sources (Stevens 1946, 1968;Krantz et al. 1971; Suppes et al. 1989; Luceet al. 1990; Hand 1996, 2004; Sarle 1997;Michell 1999). In a few exceptional cases,biologists have taken an explicitly measure-ment-theoretic approach (e.g., Stahl 1962;Rosen 1978b; Wolman 2006), but these seemto have had little general impact.

We also argue that measurement theorycan be used more broadly as the basis of atheory of scientific meaning that extendsbeyond the domain of the mathematicaltheory of representational measurement toinclude explicit consideration of what tomeasure and what conclusions can bedrawn from those measurements. We sug-gest the term “conceptual measurementtheory” for this broader approach to therelationship between measurements andmeaning. This broader aspect of extractingmeaning is more familiar to biologists, aswe are relatively well-versed in how to thinkabout hypotheses and their testing. Ourclaim is that thinking about these as part ofthe measurement process is tremendouslyuseful. We have found that this broaderconception of measurement has followednaturally as we have begun to incorporate ex-plicit representational measurement theoryinto our own work (Wagner et al. 1998;Hansen and Wagner 2001a,b; Wagner andLaubichler 2001; Hansen and Houle 2008;Wagner 2010). Measurement theory pro-vides a language for discussion of a part ofscientific practice that is both critical togood science and frequently done poorly.Consequently, we will use the term mea-surement theory to refer to all aspects ofextracting meaning from observations, ex-periments, and models.

In this review, we first summarize thebasic themes of formal, representationalmeasurement theory. We then lay outbroader conceptual measurement princi-ples. We drive home the importance ofthese points by making the case that viola-tions of these basic measurement princi-ples are widespread in biology, as in theexamples briefly touched on above. Fi-nally, we discuss what can be done to in-

March 2011 5MEASUREMENT AND MEANING IN BIOLOGY

Page 5: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

corporate measurement theory into theeducation and day-to-day thinking of biol-ogists.

MeasurementWe naturally identify entities that exist in

the world (with varying degrees of distinct-ness) and conceive of attributes of thoseentities that seem important to us. Theidea that some attributes are more deserv-ing of attention than others immediatelymakes clear that a priori ideas about whatis important and interesting underlie allmeasurement. For example, if our entitiesare individual organisms, we might be in-terested in the attribute “size,” whichmight predict which individuals will leavemore offspring. Implicit in this simplestatement are at least five complex con-cepts. First is the theoretical context thatleads us to care about the number of off-spring individuals produce, which mightbe, among many other possibilities, an evo-lutionary, an ecological, or an economiccontext. Second is at least a hypothesis andpossibly prior evidence that size might bean important predictor of number of off-spring. Third is an implicit definition ofsize. Fourth is a notion of what constitutesa countable offspring. Finally, we must cir-cumscribe the set of organisms to whichour hypothesis is applied.

Measurement consists of an assignmentof numbers to attributes of entities so thatthe relations between the numbers cancapture empirical relations among the at-tributes (Krantz et al. 1971; Luce et al.1990; Hand 1996, 2004). When what is trueabout the relations of the numbers is trueabout the relations of the attributes, the con-clusions we draw from the numbers aremeaningful conclusions about nature. Manynumerical relations could help to captureempirical realities, including order, differ-ences, ratios, and equivalence. Which ofthese is appropriate depends on our hy-pothesis about what might actually matter,the actual empirical relations, and howmeasurement is done.

A simple example of the measurementprocess is shown in Figure 1. The theoret-ical context of sexual selection has led to a

hypothesis that the length of a male guppypredicts his attractiveness to potentialmates. The empirical relationship of liningup two fish side-by-side and recordingwhich is longer is captured instead by mea-surement of length in standard units. Theconclusions about length that can bedrawn from the pair-wise empirical com-parisons can also be drawn from numbersproperly assigned to lengths. Drawing con-clusions about the relationship betweenmale lengths and the larger theoreticalcontext of male attractiveness to femalesthen proceeds by the usual scientific pro-cess, incorporating additional measure-ments and experiments. Note that whetherthe original hypothesis is true or not is notimportant to the measurement process.What is important is that each of the stepsfrom hypothesis to identification of entitiesand attributes to measurements and con-clusions be well motivated and consistent.False hypotheses will tend to be rejectedwhen all these connections hold.

representational measurementtheory

Representational measurement theory isa mathematical system for determiningwhen the relations among the numericalmeasurements assigned to attributes re-flect the corresponding empirical reality(Krantz et al. 1971; Suppes et al. 1989;Luce et al. 1990; Hand 1996, 2004; Sarle1997). The clearest introduction to repre-sentational measurement theory is Chap-ter 1 of Krantz et al. (1971), and we adopttheir terminology here. The relationshipof attributes in the world is an empiricalrelational structure and consists of a set ofentities (e.g., organisms, genotypes, genes,or proteins), the empirical operations thatcan be made with those entities (e.g., com-paring, combining, mutating, or placing inan environment), the attributes that arisefrom those operations (e.g., contest out-come, trait value, or fitness), and the infer-ences that can be made from comparisonof those attributes. The empirical rela-tional structure is then represented by anumerical relational structure in which at-

6 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 6: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

tributes are mapped to numbers and em-pirical operations and comparisons aremapped to mathematical relat ions(greater or less than) and operators (ad-dition or multiplication). A numerical re-lational structure has the property of

meaningfulness when inferences aboutnumbers can be translated into inferencesabout the original entities (Weitzenhoffer1951; Luce et al. 1990). Which kinds of nu-merical relationships are meaningful in thissense is a natural way to define scale type. Thus,

Figure 1. The Measurement ProcessWe imagine a study where the theoretical context is sexual selection. Within this context, we focus on

attractiveness of males and then on the hypothesis that size influences attractiveness in the guppy. Size can bemeasured in many ways, so the concept of “size” was referred to the attribute “overall length of the fish” underthe hypothesis that females prefer large males and treat the tail as part of the body. The middle third of thefigure represents the empirical relations that are at the root of measurement. For empirical comparison oflength, fish could be aligned with their noses against a flat surface, and the identity of the fish that extendsfarther noted. For a pair of fish A and B, the possible results are that A extends farther than B, which we canrepresent as A � B; that we cannot decide whether A extends farther than B, A � B; and that A extends less farthan B, A ≺ B. Representational measurement theory proves that the conclusions that can be drawn aboutlength on the basis of the pairwise empirical comparisons can also be drawn from a numerical system consistingof numbers (a and b) assigned to lengths of A and B plus a mapping of the empirical relations �, �, ≺ amongfish to the relations �, �, and � among the numbers.

March 2011 7MEASUREMENT AND MEANING IN BIOLOGY

Page 7: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

scale type encapsulates what properties of a setof measurements could be used to draw em-pirically meaningful conclusions.

Assignment of numbers to attributes—the act of measurement—usually dependson just a few empirical relations (reviewedin Krantz et al. 1971, Chapter 1; Hand2004). Primary among these is a procedureby which the order of attributes can beestablished—that an elephant is biggerthan a mouse, that 17 May 1814 is laterthan 4 July 1776, or that animal A is sociallydominant to animal B. For example, inFigure 1, the empirical procedure is toplace fish next to each other in a standardorientation, with their snouts against a flatobject, then see which fish’s tail extendsfarther. This simple operation gives usboth the order of lengths and operationalequivalence, when we judge that we cannottell which of the two extends farther. Asecond empirical operation critical tomany measurements is combining entitiesso that their attributes are also combinedin a meaningful way. This is termed concat-enation. For the fish in Figure 1, we couldcut rods equivalent to the length of eachfish, then lay the rods end to end to askquestions such as: “Is twice the length offish A greater or less than the length of fishB?” When concatenation operations arepossible, we can also construct a standardsequence of concatenated entities (a ruler,for example) that permits counting of at-tributes in standard units of measurement.A second example is measurement of mass.The empirical operation of placing an ele-phant and a mouse on a balance will tell usthat the elephant is heavier because thescale tips toward the elephant. We canthen “concatenate” mice by placing morethan one mouse (or other more conve-nient objects that are each equivalent inweight to a mouse) on the scale, to deter-mine how many mouse equivalents arenecessary to make something as heavy asan elephant, that is to measure mass of theelephant in mouse units. The empiricalrelational structure includes the animalswe wish to compare and the operations ofconcatenation (e.g., placing more thanone mouse on a scale) and comparison

(does the scale tip toward the elephant orthe mice?).

Measurement then proceeds by assign-ment of numbers to attributes and mathe-matical operations (such as addition) toempirical operations (such as placing ob-jects on a scale); together the numbers andthe operations define a numerical rela-tional structure. When both order determi-nation and concatenation are empiricallyrelevant, these operations can be mappedto a numerical relational structure thatuses positive real numbers to represent theattribute, addition to represent concatena-tion, and the � operator for comparisons.Measurement in these cases is termed exten-sive measurement; both lengths and weightsare therefore extensively measurable attri-butes.

A second important type of empiricalrelational structure arises when measure-ment depends on paired comparisons be-cause no natural concatenation operationis possible; this is termed intensive measure-ment. This approach was developed in psy-chology for measurement of attributessuch as attractiveness or utility of objects totest subjects (see, e.g., Luce 1959). Subjectsare confronted with pairs of objects, suchas faces of humans, and asked to rankthem. With repeated trials and under cer-tain well-defined conditions, an overalljudgment of the quality of the individualcan be made, for example, the “intrinsicattractiveness” of a face. Recently, a similarapproach has proven useful for defining ameasure of fitness based on pairwise com-petition experiments, for example, amonga set of bacterial strains (Wagner 2010 andsee below).

Representational measurement theoryproves which numerical relational struc-tures can correspond precisely to the em-pirical results one would obtain using theactual entities, such as fish or mice andelephants. Narens (1981, 1985) and others(Luce et al. 1990, Chapter 20) have shownthat only a small number of such numer-ical relational structures can preservemeaningfulness given these types of sim-ple empirical relations. More familiarly,these structures define the scale types first

8 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 8: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

identified by Stevens (1946, 1959, 1968).The scale types relevant to biological sys-tems are listed in Table 1 with examples ofeach. Figure 2 shows examples of scaletypes resulting from measurements on aset of fish. We emphasize that scale typescan be characterized in several differentways: most fundamentally, they are deter-mined by the empirical operations that thescientist wishes to represent using numeri-cal operations. It is very important to notethat the theoretical context is an irreduc-ible part of this formulation. The samedata (for example, lengths) may reside ona different scale type depending on thequestion asked. We give specific examplesof this relationship between scale type andhypothesis below. A second convenient char-acterization of scale type refers to the kinds ofmathematical relationships among the mea-surements that are potentially meaningful,and this has led to the names of the scaletypes: nominal, ordinal, interval, log-interval,difference, ratio, signed ratio, and absolute.Third, the scale types can be formally char-acterized by three related properties: thepermissible transformations that preservethe relevant relationships among measure-ments, the number of arbitrary parametersthat must be adopted to establish the nu-merical system, and the domain of num-bers (or symbols) to which they apply.

An understanding of scale types is mosteasily developed from specific examples.For the examples of length and weight de-veloped above, note that the empirical re-lations can be used to establish the orderof attributes (fish A is longer than fish B)and their differences (elephant A isheavier than elephant B by 20 mouseunits) and ratios (the concatenation of twofishes the length of A is equivalent to thelength of fish B). Any numerical relationalsystem that reflects all of those relation-ships is by definition on a ratio scale. Pos-itive real numbers are the domain of themeasurements, as a negative weight orlength has no physical equivalent. To mapattributes to numbers, we had to specifyone arbitrary parameter in these cases—aunit of length or mass. Permissible transfor-mations of the numerical data are definedas those transformations that preserve thecorrespondence of the numerical relation-ships to all the empirical relationships thatcould be measured. For ratio scales theonly permissible transformation is multipli-cation by a constant. To see this, assumethat we have four entities and that the em-pirical attributes, say lengths, are repre-sented by the letters A, B, C, and D. Wethen measure the four lengths and mapthe nonnumerical attribute A to a numbera, the attribute B to b, and so forth. If we

TABLE 1Classification of scale types (after Stevens 1946, 1959, 1968; Luce et al. 1990:113)

Scale typePermissible

transformations DomainArbitrary

parametersMeaningfulcomparisons Biological examples

Nominal Any one-to-onemapping

Any set of symbols Countable Equivalence Species, genes

Ordinal Any monotonicallyincreasingfunction

Ordered symbols Countable Order Social dominance

Interval x -� ax�b Real numbers 2 Order, differences Dates, Malthusian fitnessLog-interval x -� axb, a, b�0 Positive real numbers 2 Order, ratios Body sizeDifference x -� x�a Real numbers 1 Order, differences Log-transformed ratio-

scale variablesRatio x -� ax Positive real numbers 1 Order, ratios,

differencesLength, mass, duration

Signed ratio* x -� ax Real numbers 1 Order, ratios,differences

Signed asymmetry, intrin-sic growth rate (r)

Absolute None Defined 0 Any Probability

* Luce et al. (1990) defined this ratio scale but did not discuss or name it. Stevens did not consider this scale.

March 2011 9MEASUREMENT AND MEANING IN BIOLOGY

Page 9: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

can show, with an empirical operation, thatthe number of As that must be concate-nated to be equivalent to B is less than the

number of Cs that must be concatenated tobe equivalent to D, we want our numericalmeasures to have the properties a/b � c/d

Figure 2. Representational Measurement TheoryA set of entities, individual fish, can be associated with different empirical relational structures depending on what

attributes we focus on and the context we are interested in. Each attribute induces more or less strict order andequivalence relations among the fish. Representational measurement consists of mapping these relations to anumerical system such that the relations among the attributes are preserved in the relations among the numbersassigned to them. The scale types refer to mappings that preserve different relations. The attribute of mass illustratesa ratio scale type. In many conceptual contexts, for example, studies of metabolic rate, a compelling logical reasonfavors adoption of mass as the appropriate attribute to measure. In this case, we require a mapping that preservesthe order of all fish according to mass (as illustrated) and also the order of differences and ratios between theirmasses. We can use a standard unit (say, the gram) that we could add up to describe the mass of any of the fishes.The only operation we could apply to our measurements that would preserve these properties would be multiplyingby a constant (for example, changing the unit from gram to pound). If we were interested in size, but had no reasonto choose mass as a measure of size over other possible measures such as length or area, then we would have to admitraising to a power to the permissible transformations, yielding the log-interval scale. On this scale type, the order ofdifferences in fish size is meaningless because the order of differences can change depending on the choice ofexponent, but the order of their ratios is meaningful because the order of ratios does not change with the choiceof exponent. Note that whether each of these measures of body size is on a ratio or log-interval scale depends onthe conceptual context, not on the attribute itself. The residuals of a regression of tail length on body lengthmeasure relative tail size. This example illustrates an interval scale type, where differences between relative tail sizesare meaningful (i.e., a natural unit exists), but their ratios are meaningless. The differences themselves reside on asigned ratio scale, so we can say that the relative tail area of the left fish differs from that of the second fish (0.39)by 11% more than that of the last fish differs from that of the third fish (0.35). If we seek only to represent order,as can be illustrated by contrast in the context of male attractiveness judged by females, we have an ordinal scale type.In that case, statements such as “the difference between fish A and fish B is larger than the difference between fishB and fish C” are meaningless. Finally, a nominal scale type, as illustrated by imaginary throat morphs, preserves onlyequivalence relations; order is meaningless.

10 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 10: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

and a�b � c�d. Multiplying all of the nu-merical measurements by the same con-stant obviously preserves the truth of thesestatements; any other type of transforma-tion, such as exponentiation or adding aconstant to each number, can readily leadto violation of one or both statements.

Log-transformation of ratio-scale vari-ables places the data on a new scale type,the difference scale, with the domain ofthe real numbers. On this scale type, dif-ferences are equivalent to ratios of the cor-responding exponentially transformed val-ues. The only permissible transformation isaddition of a constant.

Body size is fundamental to much of bi-ology, and a considerable literature ad-dresses allometry and scaling relationshipsbetween size and a wide variety of otherphenotypes (e.g., Schmidt-Nielsen 1984).Inspection of this literature, however, revealstwo features. First, in most cases, measures ofsize are log-transformed before analysis. Sec-ond, measures of size can be measured on alinear, area, or volume scale, and these aretreated interchangeably (Solow and Wang2008). For example, to investigate Cope’sRule that body size tends to increase withinlineages, Alroy (1998) studied the log bodymass of fossil mammals predicted from lin-ear tooth dimensions, whereas Hone et al.(2005) used log bone length. The inter-changeability of these scales in practice sug-gests that biologists consider raising mea-sures of size to a power to be a permissibletransformation. Choice of (for example) lin-ear over volume measures is seen as a matterof convenience or convention. A key impli-cation of this practice is that ratios are mean-ingful—the order of ratios, say a/b � c/d, ispreserved when one raises all the measure-ments to a power—but that the order ofdifferences is not. For example, say that a �5, b � 1, c � 10, and d � 7. On this scale,a/b � c/d (5 � 1.43) and a-b � c-d (4 � 3).When the measurements are raised to thepower of 2, the ratio a2/b2 is still greaterthan c2/d2 (25 � 2.04), but the order of thedifferences is reversed: a2 � b2 � c2 � d2

(24 � 51). The scale type with these permis-sible transformations is the log-interval—sonamed because once the data are log trans-

formed the ratios that are meaningful on theoriginal scale become differences residingon an interval scale type.

A key example of the context depen-dence of scale type now presents itself:above, we claimed that body lengths andbody mass are on a ratio scale, and now wesay that they are on a log-interval scale. Thedifference is that different theoretical con-texts are applied in the two cases. When ameasure of length is treated as a measureof length it is on a ratio scale. For example,if, as in Figure 1, we hypothesize (or haveevidence) that female guppies really choosetheir mates on the basis of body length, ratherthan volume, then the choice is dictated bythis hypothesis. In an allometric context,we choose to regard powers of body size aspotentially relevant to the questions. Thisdistinction is a key part of our argumentthat measurement theory must incorpo-rate theoretical context and is thereforerelevant to the wider issues of scientificreasoning.

If our data consisted of dates of occur-rences, the interval scale type would beappropriate. The key difference betweeninterval and ratio scales is that the empiri-cal relation of concatenation does not ap-ply: one cannot directly concatenate dates.How would you decide how many 4 July1776s add up to give you a 17 May 1814?Without this operation, the magnitudes ofdates cannot be directly judged; the ratioof dates does not correspond to an empir-ical operation and is therefore not mean-ingful. We can, however, use the differencebetween two dates, say January 1 and Jan-uary 2 as our concatenatable unit—howlong we wait until the date changes. Thenthe difference between 4 July 1776 and 17May 1814 can be expressed as the numberof days on top of 4 July 1776 that areneeded to reach 17 May 1814. The permis-sible transformations therefore include ad-dition of a constant, for example, adoptingthe Judaic or the Islamic calendar insteadof the Gregorian calendar, as well as mul-tiplication by a constant (adopting the Ju-lian year, rather than Gregorian one). Toarrive at this system, we must specify twoarbitrary parameters, the starting and stop-

March 2011 11MEASUREMENT AND MEANING IN BIOLOGY

Page 11: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

ping dates we use as a standard unit oftime. The use of concatenation in measur-ing the difference between two dates ortemperatures means that the differencesbetween dates are extensively measurableand therefore on a ratio scale.

The case of an interval scale allows us toillustrate the important distinction be-tween scale and scale type. Other familiarexamples of interval scale type are the Cel-sius and Fahrenheit temperature scales. Alltemperature scales based on arbitrary nu-merical assignments to two states, such asfreezing and boiling points of water, are onthe interval scale type regardless of the unitof temperature used, but once a unit oftemperature has been chosen, we have ascale on which units are essential to theinterpretation of the numbers. When wesay, on the basis of scale type, that differ-ences can be meaningfully compared,what we mean is that the relationships ofthe differences are preserved under thepermissible transformations and not thatthe numerical values of the differences arepreserved. For example, if temperaturechanges in one hour from 8 to 10 in de-grees Celsius, and then in the second hourto 11 degrees C, we can conclude that thetemperature rose twice as fast in the firsthour as in the second hour. If we convertto the Fahrenheit scale, the temperatures(46.4 to 50 to 51.8 degrees F) and thedifferences (3.6 and 1.8 degrees F) change,but temperature can still be inferred tohave risen twice as fast in hour one as inhour two.

Now consider the case of social domi-nance, in which animals can be placed ona simple linear hierarchy. The single rele-vant empirical operation is to place twoanimals together and observe which one isdominant. The outcome of such a contestdoes not directly tell us anything about themagnitude of the difference between ani-mals. No empirical operation correspondsto concatenation, either of the animalsthemselves or of their differences. Even ifwe could duplicate animal A and put twoAs with one B, the result would no longertell us anything about pairwise dominance.Similarly, knowing that both A and B are

dominant to C does not necessarily tell usanything about the relation between A andB. Therefore, any numerical system thatpreserves the order is permissible, and thescale type is ordinal. The assignment ofnumbers to submissive and dominant indi-viduals could as well be 1 and 2 as 1 and1000 or 999 and 1000.

The absolute scale type differs from theothers in that no transformation is permis-sible. For example, the axiomatic defini-tion of a probability constrains its domainto the interval 0 and 1 and specifies theprecise meaning of any value on thatrange. An arcsine square-root transformedprobability is not a probability.

In Table 1, we also consider two exam-ples of a relatively unknown scale type, thesigned ratio scale. This type had previouslybeen identified on mathematical grounds(Luce et al. 1990), but to our knowledgeno real examples have been discussed. Thisscale differs from the ratio scale in that thedomain includes all real numbers, so ratioshave a sign that can be negative or positive.For example, for a measurement of signedasymmetry (length of structure on the leftside of the body minus length of thesame structure on the right side), boththe ratio of the asymmetry and whetherthe asymmetry is in the same direction aremeaningful. Similarly, the intrinsic rate of pop-ulation increase has both a magnitude and asign.

pragmatic measurement theoryIn many situations, the relationship be-

tween attributes and the numerical valueswe use to represent them is not entirelyclear. At one extreme, these uncertaintiesstem from quantifiable factors such as mea-surement error that need not challengeour understanding of the empirical rela-tional system. At the other extreme, we maybe studying an attribute about which we can-not be sure what measurements can actuallyrepresent it or even whether a hypothesizedattribute actually exists. Numerous examplesof such attributes exist in behavioral re-search, and these have been much discussedin the application of measurement to psy-chological research (see, e.g., Michell 1999).

12 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 12: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

For example, it is still controversial whetherthe perception of sensory stimuli leads to aquantitative attribute called sensation, orwhether intellectual ability is really a quanti-tative trait. These uncertainties arise particu-larly when the goal of measurement is topredict some future outcome, without neces-sarily having any underlying model of therelationship between the attributes that aremeasured and the outcome (Breiman 2000).

These considerations have occasionedconsiderable debate about the usefulnessof representational measurement theory.In the extreme, some claim that represen-tation is an illusion and that measurementscapture only what the measurement proce-dure measures. This view leads to the phil-osophical position of operationalism, inwhich scientific concepts and variables aredefined solely in terms of their measure-ment procedures (Bridgman 1927). In thisview, intelligence is nothing more thanwhat is measured by intelligence tests (Bor-ing 1945). Sneath and Sokal (1973), forexample, used operationalism to argue forquantification and algorithmic decision-making in biological classification becauseit would be repeatable and objective, con-sciously giving up the goal that such classi-fications would reflect evolutionary history.Such positions contrast sharply with repre-sentational measurement theory, which as-sumes that the point of measurement is toensure meaningful statements about anempirical relational structure that may ex-ist whether we are there to measure it ornot.

Today, pure operationalism is largelydiscounted by most scientists, but many au-thors acknowledge that many measure-ments are not purely representational,even when representation of an empiricalsystem is the goal (Hand 1996, 2004).Hand (2004) called these nonrepresenta-tional aspects of measurement pragmatic.The more pragmatic the measurementsare, the greater the uncertainty aboutwhether the measurements actually repre-sent the attribute we want to characterizeand, therefore, about the potential mean-ing of the results. Pragmatic considerationsoften arise when researchers must choose

between several alternative measurementsof the same underlying entity. For exam-ple, we might wish to measure the attribute“sexual attractiveness to females” of a sam-ple of males in a population, but unless wefully understand what goes on in the brainand nervous system of those females, wecannot know how best to assess attractive-ness.

McGhee et al. (2007) measured two as-pects of male-female mating interactions inthe bluefin killifish (Lucania goodei): thetime a female spent associating with eachof two confined males and which of twomales was successful when two males andone female were placed in a single tankwith no barriers. Association and matingsuccess were poorly correlated, and whichof them is a better measure of attractive-ness is unclear. One interpretation ofMcGhee et al.’s finding is that female in-terest in the confined males measures attrac-tiveness, which then gets overwhelmed bymale-male interactions. A second is thatmale-male interactions furnish additionalinformation on mate suitability to females,changing their preferences (Wiley and Pos-ton 1996). A measurement-theoretical re-sponse would suggest additional investiga-tion of the link between the measurementand the underlying entity (e.g., Michell1999), but this is often a very difficulttask. Most studies make the pragmaticchoice of one measure of an attribute suchas attractiveness, without verifying that it isthe best such measure. This pragmatic ap-proach may be the best way forward becausea standard assay may at least be comparableacross different experiments. The possibleshortcomings of one’s measure should beborne in mind and reevaluated when theopportunity arises. Humility and com-mon sense are necessary complements torepresentational measurement theory.

Discussion of these issues can be facilitatedby the concept of validity, which “de-scribes how well the measured variablerepresents the attribute being measured”(Hand 2004:129). Validity is widely dis-cussed in the social and behavioral sci-ences, and to some extent in medicine,particularly with respect to the represen-

March 2011 13MEASUREMENT AND MEANING IN BIOLOGY

Page 13: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

tation of mental states. It can be usefullyapplied in evolutionary biology as well.

taking a broad view of measurementtheory

Although the formal results of represen-tational measurement theory are not famil-iar to most biologists, the fundamental is-sues that they address are the concern ofevery working scientist: how we can bestunderstand reality. All measurement stemsfrom a theoretical or conceptual context;good science depends on preserving theconnection of measurement, data han-dling, analysis, and interpretation to thatconceptual context. Our provisional ideasabout the study entities specify the relevantattribute and, thus, the hypothesized em-pirical relational structure we should study.Once this step is taken, representationalmeasurement theory ensures that our num-bers reflect the necessary empirical relations.Specifying an empirical relation system tostudy and drawing conclusions from theobserved empirical relations back to theconcepts—i.e., “conceptual measurementtheory”—is part of the measurement pro-cess.

The importance of concepts and hypoth-eses for measurement can be illustratedwith the question: Why are giraffes (Giraffacamelopardalis) so tall? The obvious expla-nation proposed by many (e.g., Darwin1871, Chapter 7) is that giraffes are tall sothat they can browse on tall trees. Whenthis hypothesis is investigated, height isclearly the relevant attribute. Alternatively,Simmons and Scheepers (1996) proposedthat giraffe necks are an adaptation formale-male competition. Male giraffes com-pete for dominance by “necking,” combatin which males swing their armored headsat one another. This hypothesis suggeststhat height is not the most informative as-pect of form to measure; instead measure-ments might focus on the amount of forcethat can be delivered by the swinging head,which must be a nonlinear function of themass of the head and length of the neck.The cause of the elongated shape of gi-raffes has not been resolved (Cameron anddu Toit 2007), but different hypotheses

clearly assign different meanings to thesame phenomenon.

The importance of multiple hypothesesin this example suggests a measurement-theoretic explanation for the effectivenessof “strong inference” (Platt 1964) and the“method of multiple working hypotheses”(Chamberlin 1890, 1965). The commonelement to these approaches is that theresearcher should seek to consider all rea-sonable hypotheses simultaneously ratherthan to focus on only one. In Chamberlin’sevocative terms, “the investigator thus becomesthe parent of a family of hypotheses: and, by hisparental relation to all, he is forbidden to fas-ten his affections unduly upon any one”(Chamberlin 1890, 1965:756). In the measure-ment context, the investigator considers thatthe same phenomenon may reflect a varietyof underlying empirical relational systemsand, thus, that a variety of measures may bedesirable. The same data may have a dif-ferent meaning under each hypothesis.

Measurement Theory as an Aid toModeling and Statistics

The premise of measurement theory isthat we make measurements to learn aboutan empirical relational structure. Hypoth-esized empirical relational structures arealso the basis for theoretical models, so animportant but underappreciated role formeasurement theory is as a guide to thedefinition and identification of parametersin models. A good definition must identifyforms and parameters that are operationalin that they represent the relevant empiri-cal relations and can also be related toobtainable data. This principle is linked toLewontin’s (1974) conceptions of a dynam-ically sufficient system as one that wouldallow prediction if its parameters wereknown and an empirically sufficient systemas a dynamically sufficient system whoseparameters are estimable with sufficientaccuracy to allow prediction. Dynamicallysufficient parameters are meaningful froma measurement-theory perspective. Empir-ically sufficient ones lead to actual mea-surements that are meaningful.

The quantitative genetic concepts of ad-ditive effects and additive genetic variance

14 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 14: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

are good examples of variables that are bothdynamically meaningful and empirically op-erational. Fisher (1918) developed an ab-stract model of the genotype-phenotype mapand proposed population-level measures ofthe effect of allele substitutions, the averageexcess, and the additive effect (Fisher 1941).The additive genetic variance is the popula-tion variance of these effects (summed overloci) and, because the response to selectiondepends on variance (Fisher 1930; Price1970), the additive variance emerges as a keyparameter for prediction of the short-termevolvability—capacity of a population toevolve (Houle 1992; Wagner and Altenberg1996). Fisher’s measures are approximatelydynamically sufficient in that they capturethat part of the effect of alleles on the phe-notype that determines how their frequen-cies change under an episode of selection,although they are not dynamically sufficientfor long-term evolution (Frank 1995). Inaddition, Fisher also showed that theseparameters are all estimable from data onthe phenotypes of relatives and thus empiri-cally sufficient for the task of predictingshort-term response to selection.

dimensional analysisDimensional analysis is perhaps the most

important practical application of mea-surement theory in theoretical physics andshould be integral to all model building. Inits simplest form, it requires that mathe-matical models be dimensionally consis-tent (i.e., that the units on the left- and theright-hand sides of an equation be equal;apples cannot equal oranges), and it pro-vides a technique for reducing the modelto a minimum number of dimensionless es-sential parameters. Despite sporadic applica-tions (e.g., by Stahl 1961, 1962; Rosen 1962,1978a,b; Gunther 1975; Heusner 1982, 1983,1984; McMahon and Bonner 1983; Prothero1986, 2002; Stephens and Dunbar 1993;Gunther and Morgado 2003; Frank 2009),more formal dimensional analysis has playedlittle role in biological theory or modeling.This is reflected in the arbitrary assumptionsabout functional forms that are commonlyfound in biological models.

The key result in formal dimensional anal-

ysis is Buckingham’s � theorem (Bridgman1922, Chapter 4; Krantz et al. 1971, Theorem10.4). This theorem considers a putative lawor model that relates a set of ratio-scale vari-ables, xi, as f(x1,...xn) � 0, where f is an arbi-trary function. The units of these variablescan be used to identify a set of fundamentalvariables (and scales) that span the n x vari-ables (for the technical meaning of “span”see Krantz et al. 1971). If exactly m funda-mental scales exist, the � theorem guaranteesthat the law can be reformulated as a functionof n - m new variables that all are products ofpowers of the original variables as g(�1, . . ,�n-m) � 0, where �i � x1

a1 � x2a2 � . . . � xn

an, whereeach �i has a different combination of ex-ponents, aj.

If it is known what variables enter a prob-lem, this theorem can be surprisingly pow-erful in simplifying and solving models. Asan example, it can be used to derive the basicexponential growth equation of populationecology from minimal assumptions. Assumethat all we know is that population growthinvolves the three ratio-scale variables popu-lation number at time t, N(t); populationnumber at time zero, N(0); time, t; and asigned ratio-scale variable, the rate param-eter, r. Our model or “law” of populationgrowth then takes the form f(N(t), N(0), r,t) � 0, for an unspecified function f, wherewe also assume that N(t) can be uniquelyexpressed by the rest of the variables. Theunits of N(t) and N(0) are individuals (orindividuals per area), the unit of time is atime interval (e.g., seconds or genera-tions), and the unit of r is the inverse of thetime interval. We therefore have four vari-ables and two fundamental scales, whichare individuals and time. The � theoremimplies we can write the model in terms oftwo dimensionless variables that will haveto involve powers of N(t)/N(0) and powersof rt. The result is g(N(t)/N(0), rt) � 0 forsome function g. By the assumption thatN(t) is a function of the other variables, thisfunction can then be rewritten as N(t) �N(0)h(rt) for some function h. Because hdoes not depend on any parameters beyond rt,this implies that N(t) � N(s)h(r(t�s)) for anytime s � t. Hence, we must have N(0)h(rt) �N(s)h(r(t�s)), and that implies

March 2011 15MEASUREMENT AND MEANING IN BIOLOGY

Page 15: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

h(rt) � (N(s)/N(0))h(rt � rs)� h(rs)h(rt � rs),

which gives us the functional equationh(rt) � h(rs)h(rt � rs). Let k(x) � ln(h(x))and write the equation as k(rt) � k(rs) �k(rt � rs). This shows that k(x) � ax forsome constant, a, and therefore thath(x) � eax. By subsuming the constant ainto r, we have derived the exponentialgrowth equation:

N�t� � N(0)ert

for r � 0 (a similar argument based on r �0 completes the derivation). Note that wedid this without specifying any model ofpopulation growth. The law of exponentialgrowth follows entirely from knowing therelevant variables and their scales. Suchresults are, at first, very surprising, but theysignal the general power of dimensionalanalysis. Knowing which variables are rele-vant to a problem conveys a huge amountof information. Of course, the law of expo-nential growth may not hold if other vari-ables or parameters were involved. If, forexample, we add a carrying capacity, K,with the same units as N, the � theoremwould stipulate three dimensionless vari-ables in g, and additional assumptionswould be needed to produce a specific so-lution. The approach will give misleadingresults if some relevant variable or param-eter is omitted.

Among earlier applications of dimen-sional analysis in biology, we particularlynote Robert Rosen’s program for theoret-ical biology, in which dimensional analysisplayed a central role (Rosen 1978b). Forexample, Rosen (1962) derived D’ArcyThompson’s (1917) theory of transforma-tions of biological form from dimensionalarguments based on fitness optimization.His development was, however, highly ab-stract and, as far as we know, it has not ledto empirical research. Dimensional analy-sis has also played a minor role in discus-sion of physiological scaling relationships(Stahl 1961, 1962; Heusner 1982, 1983,1984; but see Butler et al. 1987 for adevastating critique of Heusner’s use ofdimensional arguments), and such argu-

ments are also central to Charnov’s (1993)theory of life-history invariants, althoughCharnov does not make the connection tomeasurement theory explicit. Stephens andDunbar (1993) developed applications of di-mensional analysis in behavioral ecologywith admirable clarity. They showed how themarginal-value theorem (Charnov 1976) canbe partially derived and illuminated by for-mal dimensional analysis, and they showedthat certain models of optimal territory sizefrom the literature are in fact dimensionallyinconsistent.

We believe that dimensional analysis hasbeen underused in biology. It has the poten-tial to clarify the necessary and sufficient as-sumptions for fundamental models and con-cepts in biology along the lines we havebriefly illustrated for the exponential growthlaw.

meaningful modelingThe formal derivation of the exponen-

tial growth law from dimensional consider-ations shows that it has a status similar tothe familiar “laws” of physics, such as New-ton’s law of gravitation. To see the analogyone must appreciate that Newton’s law isalso an abstraction for the most symmetri-cal case. The gravitational law applies onlyto a rotationally symmetrical gravitationalfield, which does not exist in the solar sys-tem because of variations in the density ofmatter in the planets and the presence ofother bodies. A similar argument can bemade for the general validity of Wright’s se-lection equation, which can also be directlyderived from measurement-theoretical con-siderations (Wagner 2010). Biology includesrather few such laws, however, and we mustaccept that most models in biology are rep-resentations of qualitative relationships thatare too complex to capture in simple law-likerelations. The complexity and high dimen-sionality of the relevant empirical relationalstructures require that some aspects of rep-resentation must necessarily be given up.

Levins (1966) described three modelingstrategies based on which aspects of realityare left out. In type I models, generality issacrificed in favor of precision and realism,as in highly complex fisheries models in

16 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 16: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

which as many specific aspects as possibleof the stock in question are included. Intype II models, realism is sacrificed to gen-erality and precision, as in the Lotka-Volterra models of population ecology. Intype III models, precision is sacrificed togenerality and realism, as in Levins’s ownmodels of selection in heterogeneous en-vironments based on general qualitative as-sumptions about the fitness function (e.g.Levins 1962).

Levins himself favored a research strat-egy in which type III models or multipletype II models are used to reach robustqualitative insights and predictions. Thisapproach contrasts sharply with that takenin most modeling papers in the fields withwhich we are familiar. Typically, a questionis investigated by analysis of a single spe-cific model of type I or II. In many cases,results are obtained solely by simulationsbased on algorithms including numerousauxiliary specifications, so the results nec-essarily have limited applicability. Type Imodels can only make predictions forhighly specific circumstances, and type IImodels can only demonstrate the possibil-ity of a phenomenon, not its plausibility.

These limitations are often forgotten intheoretical biology, where general conclu-sions are routinely drawn from specific mod-els. In the introduction, we mentioned May-nard Smith’s (1976) claim, based on a typeII model, that the handicap principlecould not work. Maynard Smith’s claim waslater shown to be incorrect (Pomiankowski1987, 1988; Grafen, 1990a,b) and the hand-icap principle is now a cornerstone of manysexual-selection models. Using measurementtheory to diagnose the problem, we can seethat arbitrary specific model assumptions,such as linearity, put constraints on the em-pirical relational structure that prevent itfrom capturing the full range of possibilitiesin the idea being modeled.

This dynamic of overly broad claimsfrom narrow models being overturned bytype III models is quite common. For ex-ample, Broom et al. (2005) claimed on thebasis of specific families of cost and benefitfunctions that automimicry, a situation inwhich some defenseless (e.g., nonpoison-

ous) individuals exist within a generallyaposematic species, cannot be maintainedas a stable genetic polymorphism. Sven-nungsen and Holen (2007) considered awider variety of functions and showed thatautomimicry can indeed be evolutionarilystable under a number of realistic condi-tions. A second example is the large lit-erature on whether plasticity, learning,and variation increase or decrease a re-sponse to selection, known as the Bald-win effect (Baldwin 1896). Multiple mod-els have been published showing thateither the Baldwin effect of accelerationwas predicted (e.g., Hinton and Nowlan1987) or favoring the opposite result ofslower evolution (e.g., Borenstein et al.2006). Paenke et al. (2007) demonstratedthat the conclusions of these models weredictated by the change in the slope of therelationship between fitness and the focaltrait. When plasticity increases this slope,the variance in fitness increases accelerat-ing evolution; the converse occurs whenthe slope decreases. The general modelfocuses our attention on a key biologicalaspect of plastic systems that was not pre-viously considered explicitly.

meaningful statisticsMeasurement theory requires that num-

bers only be manipulated in ways that re-tain their representation of the empiricalrelational structure of interest. Statisticalmodels, however, only make assumptionsabout the distributional properties of thedata and can be applied to any set of num-bers that fulfill the distributional criteria.Consequently, statistical education and prac-tice often recommend transformations ofthe data that make the numbers fit thestatistical model. This practice often bringsmeasurement principles and statistical practiceinto conflict. Measurement theory puts se-vere constraints on the statistical manipu-lations that can be done without loss ofsome or all of the meaning present in thedata.

This fundamental truth has not had animportant place in the biostatistical litera-ture, but has been the subject of consider-able debate in psychology (reviewed by

March 2011 17MEASUREMENT AND MEANING IN BIOLOGY

Page 17: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

Hand 1996, 2004; Michell 1999). Somehold that measurement theory is irrelevantto statistics because the “numbers do notremember where they came from” (Lord1953:751), and much statistical practice inbiology seems based on this viewpoint. Forexample, the discussion of transformationsin the context of ANOVA in the widelyused biostatistics textbook by Sokal andRohlf (1995) focuses on convincing thereader that “the scale of measurement isarbitrary, you simply have to look at thedistributions of transformed variates to de-cide which transformation most closely sat-isfies the assumptions of the analysis ofvariance” (p. 412), a sentiment echoed byothers (Dytham 2003; Logan 2010). If thatis statistics, we want no part of it, as scienceis about nature, not numbers. We followAdams et al. (1965) and define a meaning-ful statistical statement as one whose truth isinvariant to permissible scale transforma-tions (in the measurement-theoretic sense)of the underlying data. The sad results ofignoring scale during statistical analysis arelegion. For example, if we have ordinal-scale variables, such as ranks of items, thestatement that the arithmetic mean of onesample is larger than that of another isnot meaningful, because rankings reflectperceived order only, whereas the act ofaveraging assumes that the values reflectmagnitude. Permissible transformationsof ordinal data will usually exist that alterthe order of the means. For example, ifwe have two entities in category B, andthree in category A, and the Bs areranked 2 and 3 out of five, then meanrank of Bs is 2.5 while the mean rank ofthe As is 3.3. Now assume that more en-tities were included in the ranking, sothat the As are now ranked as 1, 9, and10, and the Bs as 7 and 8, yielding a lowermean score for category A (6.7) than forcategory B (7.5). Despite the reversal ofthe order of the means, both rankings areequally valid. A mean is a meaningless sta-tistic for ordinal variables. Wolman (2006)pointed out that conservation biologistsmake precisely this error when making rec-ommendations based on the average of ex-

pert rankings of, for example, species vul-nerabilities or conservation values.

Once an impermissible transformationis used during the analysis of the data, theaspect of reality to which the measure-ments and the associated statistical resultsapply is, by definition, changed. Similarly, non-parametric approaches usually test a hypothe-sis about the sample median rather than themeans. This point is often ignored in theinterpretation of the results of statisticaltests, potentially resulting in meaninglessconclusions. Sometimes transformations ofdata lead to understandable changes inmeaning, as when a log transformationmaps ratio relations into difference rela-tions and changes a log-interval scale typeinto an interval-scale type, but transformationscommonly lead to fundamental changes inthe meaning of the numbers. More oftenthan not, such changes are not communi-cated by the authors of the study. We pre-sent two such examples later in this paper.Fortunately, the generalization of statisticalmodels to encompass distributions otherthan the normal and the rise of numericalmethods of analysis usually obviate the needfor unprincipled transformations of data. Inmany cases, statistical tests are surprisinglyrobust to violations of assumptions (see e.g.,Whitlock and Schluter 2009:403). Neverthe-less, difficult cases will remain in which theassumptions of the available statistical analy-ses are not met, but the transformations thatwould correct the problem would divorcethe measurements from the empirical rela-tions. Collaborations between biologists andstatisticians may be necessary in such cases.

The divorce of statistics from meaning isperhaps most apparent when only thequalitative results of statistical tests aregiven or interpreted, ignoring the numer-ical values of the estimated parameters. Insuch cases, a nominal or ordinal conclu-sion is drawn from estimates that are on amuch stronger scale type. For example, theconclusion of Takahashi et al.’s (2008) studyof sexual selection in peacocks, which issummed up in their paper’s title Peahens DoNot Prefer Peacocks with More ElaborateTrains, was based on a failure to reject thenull hypothesis of no selection. Their best

18 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 18: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

estimate of the strength of selection wasthat a peacock gained 3% more matingsfor every additional eye spot in his tail. Thisis extremely strong selection, three timesthe strength of selection on fitness itself(Hereford et al. 2004). The range in thenumber of eyespots among males was 42,suggesting that the peacock with the mosteyespots had an expected mating success1.0341 3.4 times that of the one with thefewest. Consideration of the estimate ofthe strength of selection, rather than itslack of statistical significance would prop-erly lead to the conclusion that this studylacks the power to detect even extremelystrong female preference rather than theauthors’ misleading conclusion that pea-hens are indifferent to the plumage of pea-cocks.

The confusion of biological with statisti-cal significance is exceedingly common inecology and evolution (see, e.g., Yoccoz1991; Anderson et al. 2000). The serious-ness of the problem is underscored by themore than 300 criticisms of this practicethat Anderson et al. found in the scientificliterature through the year 2000; manymore have surely accumulated by now. Theprevalence of such errors in the face of somany efforts to prevent them indicates asystemic problem; we believe the problemis the lack of awareness of what measure-ment theory tells us about the meaning ofnumbers. Biologists accept P-values as mea-sures of effects for the same reason thatthey commonly neglect to report and con-sider units and transformations. The skillof interpreting numbers is neither taughtnor practiced: too often quantification iswindow dressing for qualitative arguments.

Measurement theory is a description ofwhat meaningful quantification entails. Itis precisely the medicine needed to restoremeaning to statistical practice. Fulfillingthe assumptions of the statistical model isimportant, but it is by itself useless if thestatistical model does not respect the em-pirical content of the data.

Measurement in Biological PracticeThe principles we have outlined above

are so near to platitudes that it may seem

odd that we think them worth emphasiz-ing. There are, however, numerous exam-ples of errors arising from violations ofthese principles—sometimes for represen-tational errors, but just as frequently forsimple conceptual errors that lead to irrel-evant measurements. Even more troublingis that the perceived best practice in a par-ticular field may incorporate a measure-ment error that renders an entire class ofstudies meaningless. We present a sam-pling of such errors here.

example 1: remember contextThe actual theoretical context and the

measurements performed can sometimes bemismatched. The most spectacular errors ofthis type occur when the investigator losestrack of the theoretical context that is theostensible purpose of a study. A striking ex-ample of this can be found in much recentwork on the evolution of allometry. JulianHuxley (1924, 1932) introduced allometry asa simple scaling relationship between a trait,Y, and overall body size, X, as a power law,Y � aXb, which is usually and more conve-niently expressed as the linear relationshiplog(Y) � log(a) � b log(X). Empirically,such linear log-log relations are commonboth within species (static allometry) and be-tween species (evolutionary allometry). Hux-ley showed that static allometries betweentwo traits will result when they are undercommon growth regulation, and the ideathat evolution may be constrained to followstatic allometries has received considerableattention (e.g., from Gould 1977). On thebasis of his model, Huxley (1932:5–6) feltthat the intercept log(a) was “of no particu-lar biological significance,” but the expo-nent, b, “has an important meaning.”Indeed, in physiology, biomechanics, andlife-history theory much theoretical andempirical work has gone into establishingthe exact value of the allometric exponentfor different traits (e.g., Schmidt-Nielsen1984; Charnov 1993). When the overall re-lationship between size and another traitvaries, the intercept is usually far more vari-able than the slope (Teissier 1936; Whiteand Gould 1965; Jerison 1969; Greenewalt1975; Dudley 2000).

March 2011 19MEASUREMENT AND MEANING IN BIOLOGY

Page 19: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

Although many authors follow Huxley’sconception of allometry as “the slope ofa . . . log-log regression of the size of astructure on body size” (Eberhard 2009:48), some have adopted a “broad” sense ofallometry as “[c]hanges in shape which ac-company change in size” (Mosimann 1970:930). In this usage, all kinds of nonlinearand even sigmoid and discontinuous(threshold) relationships are referred to asallometric (Frankino et al. 2010). Thus,broad-sense allometry is essentially synony-mous with shape and, therefore, withoutprecise connection to any of the theoreti-cal models that have motivated the interestin the allometric slope as an importantevolutionary constraint, whether Huxley’shypothesis of common growth regulationor more sophisticated models from physi-ology, biomechanics, or life-history theory.

Several investigators over the last 20years have claimed to alter allometry byartificial selection on trait indices (Weber1990, 1992; Wilkinson 1993; Frankino etal. 2005, 2007). Unfortunately, they makeonly passing reference to the theoreticalconcepts that have defined the interest ofallometry to biology. None, for example,log-transformed their data. The study byFrankino et al. (2005) is illustrative. Thetitle of the study, the abstract, and the textformulate the problem as explaining theconservatism of “allometry” or “scaling re-lationships,” but include no discussion ofspecific allometric models. The implica-tion, for those familiar with the usual con-ception of allometry as slope on a log-logscale, is that the study will be about theevolution of this slope. In their experi-ment, Frankino et al. (2005) sought to al-ter wing loading, the relationship betweenwing area and body mass, in the butterflyBicyclus anynana. To do so, they selectedfor increased or decreased individual devi-ations from the major axis of variation be-tween forewing area and body mass. Theyachieved statistically significant responsesin this index.

What do these results mean for allome-try, the stated subject of the study? Theanswer is entirely unclear, because the typeof selection used by Frankino et al. will

favor changes in both intercept (log(a))and allometric coefficient b. The only win-dow through which the reader can assessthe nature of the responses is in a figure(Frankino et al. 2005, Figure 2A) that pres-ents the distributions at the end of theexperiment on an arithmetic scale. Noanalyses are made on the log scale, how-ever, and no attempt to determine whetherthe differences are in the slope or the in-tercept. The theoretical context of allome-try is not incorporated into the design oranalysis of this study, despite the author’sinvocation of the concept of allometry.

All of these experiments (Weber 1990,1992; Wilkinson 1993; Frankino et al. 2005,2007) alter something about the relationshipbetween traits and clearly demonstrate thatbody shape shows genetic variation, avaluable conclusion. All of them invokethe concept of allometry, but how to in-terpret their results in terms of allometryis perfectly ambiguous. One extreme in-terpretation is that all of the response isin the parameter a, suggesting that, con-trary to the conclusions of these papers,the allometric slope is constrained by alack of genetic variation. A second inter-pretation is that some part of the response is inthe exponent b and that natural selection canindeed reshape allometric relationships easily.

Meaning is lost in this class of studiesbecause a theoretical context is invokedbut then ignored. The design of these ex-periments precluded testing the claim thatallometry can readily be altered, becausethe measurements and analyses did not re-flect hypotheses about allometry. This is anerror in specifying the attribute of interestas any aspect of the relationship betweenpairs of traits rather than as the slope oftheir relationship measured from log-transformed data.

example 2: do not use a rubber rulerConsider an animal foraging in an envi-

ronment containing two equally commontypes of resource patches, where one typeof patch has twice the payoff of the other.The animal chooses one patch at randomand finds it has payoff x, but it cannot tellwhether it is a good or a bad patch. Should

20 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 20: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

it then switch patches? If it switches, it hasa 50% chance of encountering the samepayoff x but also a 50% chance of encoun-tering a different payoff. In the case inwhich the payoff is different, the chancesare equal that the new patch will have twicethe value (2x) or half the value (x/2) of theoriginal patch. From optimal foraging the-ory, we might then argue that the animalshould switch patches, because the ex-pected payoff for the new patch would be2x�1/2�x/2�1/2 � 5x/4 � x. Notice thatthis conclusion would be the same whetherthe animal chose the good patch or thepoor patch on the first attempt. The grassis always greener on the other side!

This conclusion is clearly absurd, butidentifying the error is not trivial. Thecause is failure to adopt a common scale.To see the problem, first define the payoffin the poor patch as z and that in the betterpatch as 2z. In terms of this fixed scale, weeasily see that the payoff of the first patch isx1 � 2z�1/2 � z�1/2 � 3z/2, whereas thepayoff of the second is x2 � z�1/2 � 2z�1/2 � 3z/2, giving the obviously correct an-swer that x1 � x2. With this definition inmind, we can return to the first equation,and diagnose the problem; x is used toequal both z and 2z in the same equation.

This example, derived from the enve-lope paradox in probability theory (Hand2004), illustrates the dangers of a lack ofawareness of scale. In fact, scales that de-pend on the entity to be measured arecommon in evolutionary biology. The useof heritability to measure evolutionarypotential is one important example. Wecan define short-term evolvability as theexpected response to a given selection gra-dient (Houle 1992; Hansen et al. 2003;Hansen and Houle 2008), and under cer-tain assumptions it equals the additive ge-netic variance (Lande 1979). To compareadditive variances across traits and popula-tions, we need a common scale. The stan-dard practice has been to use the trait’spopulation variance as the scaling unit. Di-viding the additive variance by the popula-tion variance yields the heritability, h2,which is a dimensionless number. Note,however, that the population variance con-

tains the additive variance as a componentand, furthermore, that its other compo-nents, such as epistatic and environmentalvariances, tend to be correlated with theadditive genetic variance (Houle 1992).Therefore, when heritabilities are used forcomparison of evolvability across traits andpopulations, we use a scale that dependsstrongly on the entity to be measured. Thenaive use of h2 as a measure of evolutionarypotential has lead to some highly dubiousgeneralizations, such as the idea that life-history traits and fitness components areless evolvable than morphological traits(see, e.g., Roff and Mousseau 1987). Houle(1992, 1998; Houle et al. 1996) proposedusing a mean-standardized scale (such as acoefficient of variation or its square IA) be-cause it is not necessarily a function of theattribute to be measured; on this scale, thelower h2 of life-history traits is clearly due tohigh levels of environmental variance. Theuse of the variance-standardized scale givesa misleading conclusion for reasons strik-ingly similar to the one that generates thepatch paradox. The naıve expectation thatevolvability can be measured equally wellby either h2 or mean-standardized additivevariance is wrong; in fact they are almostuncorrelated (Houle 1992).

In this case, the measurement error is inselecting an attribute to measure, h2, that isunsuited to the theoretical context. Heri-tability is unsuitable because it standard-izes a quantity of interest with somethingto which it is autocorrelated. This exampleshows that meaning is drastically altered byseemingly innocent choices of scale. Thechoice of scale is a fundamental and non-trivial part of both model building and sta-tistical estimation. A lack of awareness ofsuch issues is a major source of misreport-ing and misinterpretation of biological re-sults.

example 3: interpret your numbersYet another way of robbing a study of

meaning is to fail to specify a theoreticalcontext. A good example is Kingsolver etal.’s (2001) paper, The Strength of Pheno-typic Selection in Natural Populations.One of their conclusions was that “direc-

March 2011 21MEASUREMENT AND MEANING IN BIOLOGY

Page 21: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

tional selection on most traits and in mostsystems is quite weak” (p. 253), yet King-solver et al. never specified a theoreticalcontext that would help to determine thestrength of selection. As pointed out byConner (2001), this omission leaves read-ers wondering whether selection is reallystrong or weak.

Kingsolver et al. reviewed a large sampleof the available estimates of linear selec-tion gradients obtained in wild popula-tions (Lande and Arnold 1983; Endler1986). The linear selection gradient mea-sures the change in relative fitness for aunit change in the value of a trait. Theyfollowed Lande and Arnold (1983) inadopting the unit of trait standard devia-tion as the basis for comparison across dif-ferent traits and populations. Despite theclaim about weak selection in the abstract,the text of the paper contained just twobrief statements about selection strength,on pages 250–251 and 253. The medianstrength of selection of 0.16 is termed“rather modest,” and values greater than0.5 are “very strong,” but no discussion isincluded of criteria for deciding what con-stitutes strong or weak selection. Their ar-gument for the rarity of strong selection isinstead based the approximately exponen-tial distribution of the absolute values ofthe gradients with a mode at 0. Thischange of emphasis from that implied bythe title and abstract is made explicit in theMethods section, where Kingsolver et al.state “the overall ‘average’ strength of se-lection is unlikely to be very informative.We focus our analyses on the distributionsof selection strengths” (p. 248). The con-clusion drawn from this analysis would bemore accurately stated as “most selectionestimates are much smaller than the ex-treme estimates.”

With respect to the strength of selection,therefore, Kingsolver et al. (2001) implythat they have a theoretical context, butthen decline to apply any theory or con-cepts related to the units of measurement.Symptomatic of this problem is that theunits of measurement, although clearlystated in the Methods section, are neverattached to any of the actual measure-

ments in the paper. If the units arestated, Kingsolver et al. (2001) state-ments that the median strength of selec-tion, a 16% change in relative fitness witha one-standard-deviation change in thetrait, is modest, whereas a change of 50%with a one-standard-deviation change isvery strong selection, sound rather con-tradictory.

Clearly, for comparison of the strengthsof selection on different traits in differentorganisms, some standard scale must beadopted, but which scale? Hereford et al.(2004) proposed two criteria for judgingthe strength of selection and showed thateach is naturally addressed on a differentscale. Under frequency-independent selec-tion, the strength of selection is naturallyviewed as a function of the adaptive land-scape, which is external to the populationsubject to selection. Keeping this theoreti-cal context in mind makes clear that stan-dardizing the selection gradient by thetrait standard deviation has problems sim-ilar to those of basing a measure of evolv-ability on heritability. To get at this con-cept of the steepness of the adaptivelandscape in the neighborhood of the pop-ulation, the variance-standardized measureimmediately requires a second measure ofthe variation in the trait and in fitness itselfto reveal whether a large value of the stan-dardized gradient is due to a steep land-scape in a population with small variationor to a population with a large amount ofvariation on a relatively flat landscape.

Hereford et al. (2004) instead standard-ized the selection gradient by the trait mean,which gives the change in relative fitness fora proportional change in trait. On a mean-standardized scale, the strength of selectionon fitness itself is one, providing a bench-mark for strong selection (Hansen et al.2003). Hereford et al. (2004) showed thatthe median mean-standardized selection gra-dient was 0.31, or 31% of the strength ofselection on fitness, and on this basis con-cluded that the results were “notable forthe extremely strong selection observed”(p. 2141).

As noted above, however, an alternativeidea of strength of selection could be based

22 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 22: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

on the change of fitness within the range ofthe population—that is, strong selection oc-curs when the expected fitnesses of individ-uals within the population are typically verydifferent (Hereford et al. 2004). In this the-oretical context, a variance-standardizedruler is no longer rubber, but tells us justwhat we need to know. For example, in apopulation with a range of phenotypes offour standard deviations (readily foundwithin a normally distributed populationof modest population size), the medianvariance-standardized selection gradientfrom Kingsolver et al. (2001) of 0.16 pre-dicts that a typical least-fit individual twostandard deviations below the mean has afitness only 52% that of the typical most-fitindividual, two standard deviations abovethe mean. This again sounds like strongselection, contrary to the conclusions ofKingsolver et al. (2001), but for completelydifferent reasons than in the mean-standardized case. Which scale is used re-ally does matter: if the fitness landscapeswere left unchanged, but the traits studiedhad much smaller coefficients of variation,the differences in fitness on a variance-standardized scale would also have beenmuch smaller.

example 4: respect scale typeFitness is one of the most fundamental

quantitative concepts in biology. It predictsthe evolutionary outcome of competitionamong genotypes or phenotypes for repre-sentation in the next generation. Twotypes of fitness measures are commonlyused in biology, Wrightian fitness w andMalthusian fitness m. Wrightian fitness nat-urally arises in models where changes ingene or genotype frequencies are pre-dicted by the equation

p �pww

,

where p is the relative frequency of a geno-type before selection, p the frequency afterselection, and w the mean fitness of the pop-ulation. The biological meaning of Wright-ian fitness is contained in the ratio of fitnessmeasures, as reflected in the use of “relative

fitness” in population genetic theory. Wag-ner (2010) showed that Wrightian fitness,w, is a ratio-scale measure under the as-sumption that the time scale is fixed. Werelax this assumption here because conclu-sions about fitness can be drawn on anytime scale. Fitness, therefore, becomes alog-interval-scale variable, as the conclu-sions are invariant to power transforma-tions of fitness, so we are equally entitled todraw conclusions about two or more epi-sodes of selection. In this case, the expo-nent is determined by the time scale. Theresult of selection is therefore invariant toa transformation: w3 awb, a, b � 0, but notto the transformation w 3 w�c.

The Malthusian parameter m measures fit-ness in continuous time. If the age structureof the population is stable, the instantaneousrate of change in genotype frequency isgiven by the Crow-Kimura differential equa-tion:

p � p�m � m�.

The Malthusian fitness measure m is ap-proximately related to Wrightian fitness(under weak selection) by m � ln w. In thiscase, the biological meaning of the fitnessmeasures is contained in the differences ofthe fitness values, rather than the ratios,and thus m is an interval-scale variable. Thepermissible scale transformations are there-fore of the form m 3 am � b, where adetermines the change in time scale. Thisconclusion is of course unsurprising as m isthe log of w and w is a log-interval-scalevariable. A somewhat counterintuitive con-sequence of this mathematical fact is thatonly the differences between ms have evo-lutionary meaning; neither the sign northe absolute magnitude is meaningful forevolution, although they have ecologicalmeaning as growth rates.

The difference in scale types of fitnessmeasures has important consequences inexperimental practice that have not al-ways been respected. An example of theproblem arises in Remold and Lenski’s(2001) study of the causes of genotype-by-environment interaction in Escherichiacoli. Remold and Lenski studied the ef-fects of single mutations on fitness when

March 2011 23MEASUREMENT AND MEANING IN BIOLOGY

Page 23: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

bacteria were grown in different environ-ments. They used a fitness assay in which amutant strain, M, competed against a stan-dard strain, S, starting at low density andequal frequencies of the two genotypes(Lenski 1988). They estimated the initialand final frequencies by spreading sam-ples on agar plates and recording thenumber of colonies of each type. Thesecounts are used to estimate the Wright-ian fitness of genotype x as Nx/Nx � wx

where Nx is the initial count and Nx is thefinal count. To estimate the Malthusianfitness, m, Remold and Lenski treated thechanges in bacterial density as the resultof an exponential growth process

Nx�t� � Nx�0�emt

(Lenski 1988). One can therefore use thecounts of bacterial cultures Nx and Nx toestimate m for each genotype as

ln�NM

NM� � ln�wM� � mM

.

ln�NS

NS� � ln�wS� � mS .

Recall that the biological meaning is inthe magnitude of the difference, mM � mS,rather than in their ratios. Remold andLenski (Lenski et al. 1991; Remold andLenski 2001), however, took their ratio todefine a “relative fitness”:

fM,S �mM

mS.

The measure fM,S is still a measure of rela-tive fitness in the sense that, if genotype Mhas higher fitness than S, then fM,S�1, andif a third genotype R has higher fitnessthan M, then fR,S�fM,S. This measure pre-serves the order of fitness values betweengenotypes and can therefore be used as anordinal-scale variable, but the magnitudeof these fM,S values has no biological mean-ing. This problem is largely innocuous iffM,S values are used only to test for theexistence of differences in fitness in a sin-gle environment, say between an ancestraland a derived genotype, but fM,S does not

measure the magnitude of the fitness dif-ference and is therefore meaningless whenwe compare fitness in different environ-ments, as Remold and Lenski (2001) did.

To make the problem concrete, imagineexperiments where genotypic fitnesses aremeasured in only two environments andthe fitnesses of the two genotypes in theenvironments are different, mM�1 mM�2

and mS�1 mS�2. First assume that the differ-ences between the fitness values in the twoenvironments are the same, mM�1 � mS�1 �mM�2 � mS�2, so that the allele frequenciesfollow exactly the same trajectory in bothenvironments. If we compare the alternativerelative fitness measures based on ratios ofms, we find that fM,S�1 fM,S�2, erroneouslysuggesting the presence of genotype-by-environment interaction. For example, if weassume that the Malthusian fitnesses aremM�1 � 0.02, mS�1 � 0.03, mM�2 � 0.01, andmS�2 � 0.02, the fitness differences withineach environment is 0.01, and genotype andenvironment do not interact because selec-tion proceeds exactly the same in each envi-ronment. Using these numbers, however,fM,S�1 � 2/3 and fM,S�2 � 1/2, leads us to con-clude that a genotype-by-environment inter-action is present. If, on the other hand,mM�1 � 0.015, mS�1 � 0.03, mM�2 � 0.01, andmS�2 � 0.02, the ratios are constant, leadingone to infer an absence of interactions, whenthe differences show a true genotype-by-environment interaction. Using this approach,Remold and Lenski (2001) concluded that noevidence supported genotype-by-environmentinteractions involving temperature, but that in-teractions involving the type of carbon re-source offered were abundant. Neither ofthese conclusions is justified.

Remold and Lenski (2001) came to mean-ingless conclusions because they used ratiosto interpret data on an interval scale type.Respect for scale type is essential for draw-ing inferences about an empirical systemfrom the measurements obtained by exper-iment. In this case, the actual measure-ments were appropriate to the question,and the measurement error came from amismatch between the summary statisticchosen and the scale type.

24 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 24: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

example 5: do not let statisticsoverrule meaning

An example of the conflict between sta-tistics and meaning is a study of the rela-tionship between morphological diver-gence and divergence in genetic variancematrices performed by Podolsky et al.(1997). The theoretical context for thisstudy is the prediction of longer-term evo-lutionary potential from estimates of thewithin-population genetic variation. Thatgenetic variation is summarized in a ma-trix, G, which contains the additive geneticvariances for each traits and the covari-ances between them. The Lande model(Lande 1979) predicts how variation af-fects divergence over many generations ifG remains sufficiently stable over the rele-vant time scale. The question of whetherstability is expected or observed is stilllargely unresolved (Steppan et al. 2002;Arnold et al. 2008), as G matrices are com-plex objects that are difficult to estimateprecisely. The Podolsky et al. study was arelatively early attempt to address this ques-tion in a maximum-likelihood frameworkand had a positive impact by highlightingthe problems of statistical power in suchstudies.

Podolsky et al. (1997) compared popu-lation means and G matrices for lengths ofnine structures in 11 populations of theplant Clarkia dudleyana. They judged dis-parity between matrices using the averagesquared difference between their ele-ments. Divergence between populationswas measured as Mahalanobis distance, D2,which is the distance in variance units inthe direction between the two samples inmultivariate space. The problematic aspectof their analysis is that they used a widevariety of transformations of the underly-ing data before estimating the G matrices;five of the nine lengths were untransformed,three were square-root transformed, and onewas transformed as y � ln(ln(x) � 1). Podol-sky et al.’s (1997) stated goal in choosingtransformations was to transform “to nor-mality as feasible” (p. 1787). Departuresfrom normality can cause serious estima-tion problems for maximum-likelihood

algorithms based on the Gaussian likeli-hood, and the transformations thereforehave a good statistical justification giventhat the method of analysis is tailored tonormally distributed data.

On the other hand, transformations alsoalter the relationships between means andvariances, which were the subject of Podol-sky et al.’s (1997) analysis. Equalization ofvariances among groups in an ANOVA isan alternative reason for choosing particu-lar transformations and one that most stat-isticians regard as more important thantransformation to normality. For example,size, such as the lengths of body parts usedin this study, are often approximately log-normally distributed, so variance and cova-riances scale with the mean on the linearscale, but would be independent on thelog scale. By choosing a variety of transfor-mations of the variables with regard to nor-mality of residuals, Podolsky et al. alteredthe mean-variance relationships that theywished to study in different ways for differ-ent traits. These transformations affectedboth the distance used to judge divergenceof means and the estimates of G whosedisparity was predicted. Calculated fromthe transformed data, D2 combined datafrom many incompatible scales, renderingany test of mean-variance relationshipsmeaningless; if different transformationshad been applied, the results could verywell have been different (Adams et al.1965).

We do not mean to suggest that no trans-formations should ever be used. A com-mon and principled type of transformationwould be to log transform all of the data,converting to a different but equally validscale type (interval or difference), wheredifferences play the role of ratios on theoriginal scale. In other theoretical con-texts, transformations to still other scaleswould be appropriate. For example, if theinvestigator had valid reasons to believethat natural selection had a simple rela-tionship with the square root of length,square-root transformation would be justi-fied in a study of natural selection. Rarely,however, are such empirically based argu-

March 2011 25MEASUREMENT AND MEANING IN BIOLOGY

Page 25: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

ments offered for the use of anything otherthan a logarithmic transformation.

Transformations very frequently changethe scale type and, therefore, the meaningof measurements. In many cases, the resultis meaningless tests of hypotheses. In thiscase, the error was a general one of com-paring quantities that were incommensu-rate because transformation placed themon different scales.

example 6: treat measurements asmeasurements

An exceedingly common measurementerror in biology is the neglect of units,which changes the carefully gathered mea-surements into a mere set of numbers. Thiserror is obvious to the prepared mindwhen a table of naked numbers is shame-lessly paraded in front of us. Sometimesthe units of each measurement can be di-vined, by laborious shuffling from methodsto appendices to tables to results, but this isa chore no reader should be subjected to.A deeper symptom of the insidious neglectof units is evident from the curious case ofquadratic selection gradients.

Lande and Arnold (1983) showed thatthe relationship between fitness and traitvalues could be approximated from linearand quadratic regression coefficients. If z isthe deviation of an individual trait fromthe population mean, then the Lande-Arnold model rewrites relative fitness as

w � � � �z �12

�z2 �

where � is mean fitness and is residualvariation. Extension to the multivariatecase is straightforward. The linear gradi-ent, �, is necessary for prediction of theshort-term response of the mean to naturalselection, whereas the more complete de-scription of the selection surface including� is important for understanding how vari-ance is changed by selection (Lande 1980),for visualizing the selective surface (Phil-lips and Arnold 1989), and for understand-ing whether the one-generation predictionof the response to selection is likely to holdover longer time scales. The quadratic ver-

sion of the Lande-Arnold model has beenwidely used. Kingsolver et al. (2001) found573 estimates of � in a sample of 63 studies,and Stinchcombe et al.’s (2008) muchmore limited survey of one journal from2002 through 2007 turned up 32 addi-tional studies with 673 estimates of �.

Surprisingly, Stinchcombe et al. (2008)discovered that the majority of published es-timates of univariate � were off by a factor of2. The source of this error is that two differ-ent models are used in the Lande-Arnoldapproach: conceptually, Lande and Arnold(1983) preferred to think of the average cur-vature of the selective surface around thepopulation mean and so defined the param-eter � as the second derivative of the fitnesslandscape; fitness is a function of (1⁄2)�. Onthe practical side, Lande and Arnold showedthat the multiple regression of trait values onrelative fitness can be used to estimate theparameters � and � in the above model.Regression models are parameterized

w � � � bz � qz2 � ,

so that, although b��, q�1/2�. Stinch-combe et al. (2008) estimated the pro-portion of studies that reported q as � byasking the authors of a sample of papershow they calculated �. A stunning 78% ofthe authors reported using q as �.

How is it that hundreds of papers con-taining thousands of estimates can be pub-lished over 27 years before anyone noticesthat the majority of the estimates arewrong? The answer seems to be a systemiclack of respect for measurement and mod-els in biology. Despite the wide use of theLande-Arnold method, the interest of mostusers has been in determining whether lin-ear or quadratic selection can be detectedby a statistical test for selection, not in theactual strength or shape of selection. Ex-ceedingly few authors have used theLande-Arnold parameters � or � to makepredictions about evolution or variation,and even fewer have tried to test thosepredictions (for exceptions see Postma etal. 2007). The numerical estimates haveserved a mainly decorative purpose.

Biology is in general poised between be-ing a descriptive, qualitative science and a

26 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 26: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

quantitative one. Many questions have onlyqualitative answers: Is this organism knownto science or undescribed? Which piece ofland should be purchased to further con-servation goals? Other attributes of naturemay productively be treated as qualitative,even though underlain by quantitative re-ality. For example, knowing whether wecan reject the idea that a particular bit ofDNA is evolving at the neutral rate is use-ful. Consequently, we sometimes forgetthat there are cases where quantity matters,such as quadratic selection gradients. Thelack of attention to the definition of � hasdamaged the meaning of 20 years of pub-lished estimates. This problem matters.Stinchcombe et al. demonstrated that thisnumerical error can suggest that multivar-iate fitness landscapes have a shape quali-tatively different from their real one. Manyhypotheses about the maintenance of ge-netic variation depend on the actual valueof � (see, e.g., Turelli 1984).

example 7: know what yourparameters mean

Our final two examples are cases whereparameters of models are assigned a mean-ing that they do not actually have. A goodexample is the idea that negative geneticcorrelations were the necessary conse-quence of an evolutionary theory of limitson life history, which somehow becamewidespread during the 1980s (see, e.g., Belland Koufopanou 1986). This misconcep-tion occasioned considerable debate andadditional experiments when the data didnot conform to this expectation (see, e.g.,Rose 1984; Reznick 1985). The entire con-troversy rests on a misinterpretation ofwhat a genetic correlation, and the geneticcovariance it standardizes, means. Geneticcovariance quantifies the degree of depen-dence of one trait on another. No discontin-uous change in the response to selectionaccompanies a change in sign of a geneticcorrelation (Via and Lande 1985; Charles-worth 1990; Houle 1991; Fry 1993). Whenthe covariance goes from negative to posi-tive, the correlated effects of selection simplygo from slightly negative to slightly positive.Instead, the conclusion that tradeoffs are not

perfect (because correlations are ��1)should have been apparent from the be-ginning. Half a generation of biologistsworking on the genetics of life historiesspent too much of their time worrying thattheir estimates of correlations were notnegative enough because some referred areasonable hypothesis (that tradeoffs areimportant) to an irrelevant attribute (thesign of the genetic correlation). The hy-pothesis of constraint directly predicts aboundary beyond which fitness cannotevolve. It does not predict genetic correla-tions without subsidiary assumptions, suchas a lack of deleterious mutations.

example 8: make meaningful measuresOne of our roads to discovering mea-

surement theory came through a desireto represent the evolutionary effects ofepistasis (Wagner et al. 1998; Hansenand Wagner 2001b). Fisher’s definitionof additive effects has proven extremelyfruitful because it was defined to quantifythe importance of parent-offspring re-semblance in the response to selection.In contrast, the statistical measures ofgene interaction (Fisher 1918; Cockerham1954; Kempthorne 1954)—dominance vari-ance, additive-by-additive variance, andadditive-by-dominance variance, among oth-ers—have not proven useful for understand-ing of the dynamical role of gene interaction inevolution because they represent a statisticalmeasure of departures from simpler models inan ANOVA framework rather than a measureof the dynamical consequences of epistasis. Inspite of this, the use of statistical measures ofepistasis has persisted over many years becausethese were for a long time the only measures ofepistasis that had been suggested. As a result,when biologists wished to study the interestingphenomenon of epistasis, for example, in-spired by Wright’s shifting balance theory, theynaturally turned to the statistical measures,even in the absence of explicit theory to justifytheir relevance. As a result, the statistical mea-sures of epistasis have been assigned variousmeanings that they do not possess, with effectssimilar to the case of the sign of genetic corre-lations. For example, the “statistical” conceptu-alization excludes systematic directional effects

March 2011 27MEASUREMENT AND MEANING IN BIOLOGY

Page 27: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

by definition and has lead to the widespreadbut incorrect notion that the influence of epis-tasis on the dynamics of quantitative charactersunder selection can be ignored (see Carter etal. 2005; Hansen et al. 2006; Pavlicev et al.2010). The only explicit dynamical role sug-gested for statistical epistatic variance was that itcould be converted to additive variance by ge-netic drift during population bottlenecks andthus boost evolvability during founder events(see, e.g., Goodnight 1987; Cheverud andRoutman 1996). This interpretation is mislead-ing because, although epistasis causes additiveeffects to change when allele frequencieschange, the increase in additive variance is notproportional to any decrease in epistatic vari-ance components (Barton and Turelli 2004).The change in additive variance under geneticdrift depends strongly on the pattern of epista-sis and the additive variance may even decreasein expectation under some kinds of epistasis(Hansen and Wagner 2001b).

Seeing this, however, required definingnew measures of epistasis that capture dy-namically important properties. A non-statistical representation of epistasis hadbeen proposed by Cheverud and Routman(1995), and Wagner et al. (1998) followedthis proposition by defining epistasis interms of changes in the effect of allelesubstitutions with changes in the geneticbackground (i.e., substitutions at otherloci). Hansen and Wagner (2001b) devel-oped the multilinear representation of thegenotype-phenotype map along these linesand showed how the new representation ofepistasis can be related to the statisticalrepresentation used in quantitative genet-ics (for further development see Alvarez-Castro and Carlborg 2007). Using thismodel, we could show how different patternsof epistasis affect evolutionary dynamics(Hansen and Wagner 2001a,b; Hermissonet al. 2003; Carter et al. 2005; Hansen et al.2006; Fierst and Hansen 2010). An impor-tant result is that epistasis can accelerate aresponse to selection in the presence of asystematic pattern of positive interactionsbetween the effects of substitutions at dif-ferent loci. Conversely, a systematic pat-tern of negative interactions leads to ca-nalization. Carter et al. (2005) identified

a “directional” epistatic parameter thatmeasures these effects. This directionalparameter is dynamically meaningful in away similar to that of Fisher’s additiveeffect and additive variance. It capturesthe second-order effect of evolutionarychanges in the additive effects, as shownin Figure 3.

What Can We Do?The problems that we have described

are widespread in biology. We have cho-sen examples mostly from the evolution-ary biology literature because that is ourown area of expertise, and those areasare the ones in which we can more con-fidently diagnose the issues raised byeach example. We strongly suspect thatecologists, for example, will find similar

Figure 3. Effect of Pairwise Epistasis onresponse to Directional Selection

Each trajectory is the mean of 100 replicateindividual-based simulations; the bars show � 1 S.E.The error bars are smaller than the symbols for mostof the cases. Each simulation models populations of1000 individuals with 20 loci influencing a single traitwith identical allele frequencies and allelic effectsthat furnish an initial additive genetic variance of 1.0.Mutation is absent. Populations were subjected to 100generations of positive directional selection ofstrength 1% change in relative fitness per trait unit.Directional epistasis is defined as in Carter et al.(2005), where is the mean strength of directionalepistasis, and � is the standard deviation of the di-rectional epistasis between pairs of loci. The fourcases shown are: No epistasis � 0, � � 0; Nondi-rectional epistasis � 0, � � 1.41; Positive epistasis � 1, � � 1; and Negative epistasis � �1, � � 1.

28 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 28: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

problems with the measurements meantto instantiate such concepts as popula-tion size, density dependence, interac-tion strength, and productivity (see, e.g.,Wulff 2001; Vik et al. 2004).

Some readers may feel that our exam-ples are simply unfortunate isolated errorsby individual researchers. Such errors arecommon, but ultimately inconsequentialbecause of the self-correcting nature of sci-ence. We emphasize that the examples wediscuss are symptomatic of systemic errors thatreflect widely held beliefs and attitudes. Manyof our examples involve highly competent, in-deed leading, researchers, and the problemsare repeated in many other studies. For exam-ple, the much-criticized errors in Harvey andClutton-Brock’s (1985) compilation of primatebody-size data might be treated as their respon-sibility alone, but Smith and Jungers (1997)traced the way these data were transferredfrom source to source and often lost theirmeaning and context along the way. Eventu-ally, in their final destination, the result wasabsurdities such as assigning six “means” tomales and females of three species of Ateleson the basis of what were originally mea-surements from a total of three individuals.The shortcomings and errors by Harveyand Clutton-Brock (1985) arose becausethe “means” were repeatedly presentedwithout standard errors, sample sizes, orreferences to original sources, and the er-rors were thus allowed to propagate (Smithand Jungers 1997). This case is a scandalbecause of the long chain of authors whosuccumbed to the temptation to use datawithout checking original sources, andall of the reviewers and editors who didnot object. Similarly, the ideas that thesign of genetic correlations was an impor-tant attribute and that heritability predictsevolvability became problematic not whenthe first author made this claim, but as thenumber of those who accepted the incorrectpremise grew; these are systemic problems.

A related objection is that our extensionof measurement theory to incorporate thetheoretical context and hypothesis genera-tion is not helpful because we are discuss-ing “good science, clear thinking, or anappreciation for the subtlety of an argu-

ment”—things that good scientists alreadyknow and value. We agree that this is pre-cisely what we are doing, but disagree thattaking a measurement-theory stance is nothelpful. We have known for years aboutmany of the examples we discuss above,but we find that the terminology and mind-set of measurement-theoretic thinking al-low us to diagnose and discuss commonfeatures of these cases that combine tomake them examples of flawed or uselessscience. For example, we find clarifying theability to declare that using the sign of agenetic correlation to indicate whether atradeoff exists is an attribute error. Ourclaim is not that conceptual measurementtheory is revolutionary, but that it aidsclear thinking about good science.

The most practical counterweight to themany possible pitfalls that accompany thetask of measurement is awareness. If ourexperience is any indication, as soon asone becomes explicitly aware of measure-ment as a discrete topic, much unsatisfac-tory science is readily diagnosed in thoseterms. In this way, measurement theorycan become a routine background to read-ing and reviewing the literature, discussingit with colleagues, and designing researchstrategies. Once a culture of thinkingabout meaning and measurement is inplace, many of the types of errors that wehave documented will be more readilyavoided during the design and analysis ofexperiments, caught by reviewers beforethey are published, or quickly noticedwhen they are.

A second way to address the measure-ment problem is through education. Mostgraduate programs include a requiredbackground of at least one course in statis-tics, but we have never come across acourse in meaningfulness or even measure-ment theory—it is curious that so little ef-fort is spent learning what our conclusionsmight mean, but substantial effort is spenton how to quantify those conclusions. Wehope that someday the theory of measure-ment and meaning will have its place inscientific education alongside, and as apartial counterweight to, statistics. Lookingfarther down the road, if explicit measure-

March 2011 29MEASUREMENT AND MEANING IN BIOLOGY

Page 29: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

ment theory takes root in biology, we canlook forward to a time when a backlog ofwell-studied biological examples will beavailable as a familiar entree to the study ofmeasurement and meaning.

In the meantime, we offer a modest listof measurement principles as an aid tothinking about both good and bad mea-surement:

1. Keep theoretical context in mind.2. Honor your family of hypotheses.3. Make meaningful definitions.4. Know what the numbers mean.5. Remember where the numbers come

from.6. Respect scale type.7. Know the limits of your model.8. Never substitute a test for an estimate.

9. Clothe estimates in the modest raimentof uncertainty.

10. Never separate a number from its unit.

acknowledgments

Our understanding of the ideas in this paper was devel-oped in discussion with members of the Hansen andHoule laboratories and with numerous visitors to theCentre for Ecological and Evolutionary Synthesis(CEES) at the University of Oslo, Norway, and the Na-tional Evolutionary Synthesis Center (NESCENT) atDuke University, U.S. We thank Stevan J. Arnold, Don-ald Griffin, Øistein Holen, Mihaela Pavlicev, Jura Pintar,Claus Rueffler, Tore Schweder, Patrick Phillips, MichaelWade, and two anonymous reviewers for their com-ments on the manuscript. Arnaud LeRouzic performedthe simulations in Figure 3. David Houle was supportedin this research by a visiting professorship at CEES anda visiting scientist appointment at NESCENT.

REFERENCES

Adams E. W., Fagot R. F., Robinson R. E. 1965. Atheory of appropriate statistics. Psychometrika 30:99–127.

Alroy J. 1998. Cope’s rule and the dynamics of bodymass evolution in North American fossil mam-mals. Science 280:731–734.

Alvarez-Castro J. M., Carlborg Ö. 2007. A unifiedmodel for functional and statistical epistasis andits application in quantitative trait loci analysis.Genetics 176:1151–1167.

Anderson D. R., Burnham K. P., Thompson W. L.2000. Null hypothesis testing: problems, preva-lence, and an alternative. Journal of Wildlife Man-agement 64:912–923.

Arnold S. J., Burger R., Hohenlohe P. A., Ajie B. C.,Jones A. G. 2008. Understanding the evolutionand stability of the G-matrix. Evolution 62:2451–2461.

Baldwin J. M. 1896. A new factor in evolution. Ameri-can Naturalist 30:441–451, 536–553.

Barton N. H., Turelli M. 2004. Effects of genetic drifton variance components under a general modelof epistasis. Evolution 58:2111–2132.

Bell G., Koufopanou V. 1986. The cost of reproduc-tion. Pages 83–131 in Oxford Surveys in EvolutionaryBiology, Volume 3, edited by R. Dawkins and M.Ridley. Oxford (UK): Oxford University Press.

Borenstein E., Meilijson I., Ruppin E. 2006. The effectof phenotypic plasticity on evolution in multi-peaked fitness landscapes. Journal of EvolutionaryBiology 19:1555–1570.

Boring E. G. 1945. The use of operational definitionsin science. Psychological Review 52:243–245.

Breiman L. 2000. Statistical modeling: the two cul-tures. Statistical Science 16:199–215.

Bridgman P. W. 1922. Dimensional Analysis. New Ha-ven (CT): Yale University Press.

Bridgman P. W. 1927. The Logic of Modern Physics. NewYork: MacMillan Company.

Broom M., Speed M. P., Ruxton G. D. 2005. Evolu-tionarily stable investment in secondary defences.Functional Ecology 19:836–843.

Butler J. P., Feldman H. A., Fredberg J. J. 1987. Dimen-sional analysis does not determine a mass exponentfor metabolic scaling. American Journal of Physiology-Regulatory, Integrative and Comparative Physiology 253:R195–R199.

Cameron E. Z., du Toit J. T. 2007. Winning by a neck:tall giraffes avoid competing with shorter brows-ers. American Naturalist 169:130–135.

Carter A. J. R., Hermisson J., Hansen T. F. 2005. Therole of epistatic gene interactions in the responseto selection and the evolution of evolvability. The-oretical Population Biology 68:179–196.

Chamberlin T. C. 1890. The method of multipleworking hypotheses. Science 15:92–96.

Chamberlin T. C. 1965. The method of multipleworking hypotheses: with this method the dangersof parental affection for a favorite theory can becircumvented. Science 148:754–759.

Charlesworth B. 1990. Optimization models, quantita-tive genetics, and mutation. Evolution 44:520–538.

Charnov E. L. 1976. Optimal foraging, the marginalvalue theorem. Theoretical Population Biology 9:129–136.

Charnov E. L. 1993. Life History Invariants: Some Explo-

30 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 30: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

rations of Symmetry in Evolutionary Ecology. Oxford(UK): Oxford University Press.

Cheverud J. M., Routman E. J. 1995. Epistasis and itscontribution to genetic variance components. Ge-netics 139:1455–1461.

Cheverud J. M., Routman E. J. 1996. Epistasis as asource of increased additive genetic variance atpopulation bottlenecks. Evolution 50:1042–1051.

Cockerham C. C. 1954. An extension of the conceptof partitioning hereditary variance for analysis ofcovariances among relatives when epistasis is pres-ent. Genetics 39:859–882.

Conner J. K. 2001. How strong is natural selection?Trends in Ecology and Evolution 16:215–217.

Darwin C. 1871. The Descent of Man, and Selection inRelation to Sex. London: John Murray and Sons.

Diaz M. C., Rutzler K. 2001. Sponges: an essentialcomponent of Caribbean coral reefs. Bulletin ofMarine Science 69:535–546.

Dudley R. 2000. The Biomechanics of Insect Flight: Form,Function, Evolution. Princeton (NJ): Princeton Uni-versity Press.

Dunn E. H., Hussell D. J. T., Welsh D. 1999. Priority-setting tool applied to Canada’s landbirds basedon concern and responsibility for species. Conser-vation Biology 13:1404–1415.

Dytham C. 2003. Choosing and Using Statistics: A Biolo-gist’s Guide. Second Edition. Malden (MA): Black-well Publishing.

Eberhard W. G. 2009. Static allometry and animalgenitalia. Evolution 63:48–66.

Endler J. A. 1986. Natural Selection in the Wild. Prince-ton (NJ): Princeton University Press.

Fierst J. L., Hansen T. F. 2010. Genetic architectureand postzygotic reproductive isolation: evolutionof Bateson-Dobzhansky-Muller incompatibilitiesin a polygenic model. Evolution 64:675–693.

Fisher R. A. 1918. The correlation between relativeson the supposition of Mendelian inheritance.Transactions of the Royal Society of Edinburgh 52:399–433.

Fisher R. A. 1930. The Genetical Theory of Natural Selec-tion. Oxford (UK): Clarendon Press.

Fisher R. A. 1941. Average excess and average effectof a gene substitution. Annals of Eugenics 11:53–63.

Frank S. A. 1995. George Price’s contributions toevolutionary genetics. Journal of Theoretical Biology175:373–388.

Frank S. A. 2009. The common patterns of nature.Journal of Evolutionary Biology 22:1563–1585.

Frankino W. A., Emlen D. J., Shingleton A. W. 2010.Experimental approaches to studying the evolu-tion of animal form: the shape of things to come.Pages 419–478 in Experimental Evolution: Concepts,Methods, and Applications of Selection Experiments, ed-ited by T. Garland, Jr. and M. R. Rose. Berkeley(CA): University of California Press.

Frankino W. A., Zwaan B. J., Stern D. L., BrakefieldP. M. 2005. Natural selection and developmentalconstraints in the evolution of allometries. Science307:718–720.

Frankino W. A., Zwaan B. J., Stern D. L., BrakefieldP. M. 2007. Internal and external constraints inthe evolution of morphological allometries in abutterfly. Evolution 61:2958–2970.

Fry J. D. 1993. The “general vigor” problem: canantagonistic pleiotropy be detected when geneticcovariances are positive? Evolution 47:327–333.

Goodnight C. J. 1987. On the effect of founder eventson epistatic genetic variance. Evolution 41:80–91.

Gould S. J. 1977. Ontogeny and Phylogeny. Cambridge(MA): Belknap Press.

Grafen A. 1990a. Biological signals as handicaps. Jour-nal of Theoretical Biology 144:517–546.

Grafen A. 1990b. Sexual selection unhandicapped bythe Fisher process. Journal of Theoretical Biology 144:473–516.

Greenewalt C. H. 1975. The flight of birds: the signif-icant dimensions, their departure from the re-quirements for dimensional similarity, and the ef-fect on flight aerodynamics of that departure.Transactions of the American Philosophical Society 65:1–67.

Gunther B. 1975. Dimensional analysis and theory ofbiological similarity. Physiological Reviews 55:659–699.

Gunther B., Morgado E. 2003. Dimensional analysisrevisited. Biological Research 36:405–410.

Hand D. J. 1996. Statistics and the theory of measure-ment. Journal of the Royal Statistical Society: Series A(Statistics in Society) 159:445–473.

Hand D. J. 2004. Measurement Theory and Practice: TheWorld Through Quantification. London (UK): Arnold.

Hansen T. F., Alvarez-Castro J. M., Carter A. J. R.,Hermisson J., Wagner G. P. 2006. Evolution ofgenetic architecture under directional selection.Evolution 60:1523–1536.

Hansen T. F., Houle D. 2008. Measuring and compar-ing evolvability and constraint in multivariatecharacters. Journal of Evolutionary Biology 21:1201–1219.

Hansen T. F., Pelabon C., Armbruster W. S., CarlsonM. L. 2003. Evolvability and genetic constraint inDalechampia blossoms: components of varianceand measures of evolvability. Journal of EvolutionaryBiology 16:754–766.

Hansen T. F., Wagner G. P. 2001a. Epistasis and themutation load: a measurement-theoretical ap-proach. Genetics 158:477–485.

Hansen T. F., Wagner G. P. 2001b. Modeling geneticarchitecture: a multilinear theory of gene interac-tion. Theoretical Population Biology 59:61–86.

Harvey P. H., Clutton-Brock T. H. 1985. Life historyvariation in primates. Evolution 39:559–581.

March 2011 31MEASUREMENT AND MEANING IN BIOLOGY

Page 31: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

Hereford J., Hansen T. F., Houle D. 2004. Comparingstrengths of directional selection: how strong isstrong? Evolution 58:2133–2143.

Hermisson J., Hansen T. F., Wagner G. P. 2003. Epis-tasis in polygenic traits and the evolution of ge-netic architecture under stabilizing selection.American Naturalist 161:708–734.

Heusner A. A. 1982. Energy-metabolism and bodysize. II. Dimensional analysis and energetic non-similarity. Respiration Physiology 48:13–25.

Heusner A. A. 1983. Body size, energy-metabolism,and the lungs. Journal of Applied Physiology 54:867–873.

Heusner A. A. 1984. Biological similitude: statisticaland functional relationships in comparative phys-iology. American Journal of Physiology-Regulatory,Integrative and Comparative Physiology 246:R839–R846.

Hinton G. E., Nowlan S. J. 1987. How learning canguide evolution. Complex Systems 1:495–502.

Hone D. W. E., Keesey T. M., Pisani D., Purvis A. 2005.Macroevolutionary trends in the dinosauria:Cope’s rule. Journal of Evolutionary Biology 18:587–595.

Houle D. 1991. Genetic covariance of fitness corre-lates: what genetic correlations are made of andwhy it matters. Evolution 45:630–648.

Houle D. 1992. Comparing evolvability and variabilityof quantitative traits. Genetics 130:195–204.

Houle D. 1998. How should we explain variation inthe genetic variance of traits? Genetica 102–103:241–253.

Houle D., Morikawa B., Lynch M. 1996. Comparingmutational variabilities. Genetics 143:1467–1483.

Huxley J. S. 1924. Constant differential growth-ratiosand their significance. Nature 114:895–896.

Huxley J. S. 1932. Problems of Relative Growth. NewYork: L. MacVeagh, The Dial Press.

Jerison H. J. 1969. Brain evolution and dinosaurbrains. American Naturalist 103:575–588.

Kempthorne O. 1954. The correlation between rela-tives in a random mating population. Proceedings ofthe Royal Society of London, Series B: Biological Sciences143:103–113.

Kingsland S. E. 1985. Modeling Nature: Episodes in theHistory of Population Ecology. Chicago (IL): Univer-sity of Chicago Press.

Kingsolver J. G., Hoekstra H. E., Hoekstra J. M., Ber-rigan D., Vignieri S. N., Hill C. E., Hoang A.,Gibert P., Beerli P. 2001. The strength of pheno-typic selection in natural populations. AmericanNaturalist 157:245–261.

Krantz D. H., Luce R. D., Suppes P., Tversky A. 1971.Foundations of Measurement, Volume I: Additive andPolynomial Representations. New York: AcademicPress.

Lande R. 1979. Quantitative genetic analysis of mul-

tivariate evolution, applied to brain:body size al-lometry. Evolution 33:402–416.

Lande R. 1980. The genetic covariance between char-acters maintained by pleiotropic mutations. Genet-ics 94:203–215.

Lande R., Arnold S. J. 1983. The measurement ofselection on correlated characters. Evolution 37:1210–1226.

Lenski R. E. 1988. Experimental studies of pleiotropyand epistasis in Escherichia coli. I. Variation in com-petitive fitness among mutants resistant to virusT4. Evolution 42:425–432.

Lenski R. E., Rose M. R., Simpson S. C., Tadler S. C.1991. Long-term experimental evolution in Esche-richia coli. I. Adaptation and divergence during2,000 generations. American Naturalist 138:1315–1341.

Levins R. 1962. Theory of fitness in a heterogeneousenvironment. I. Fitness set and adaptive function.American Naturalist 96:361–373.

Levins R. 1966. Strategy of model building in popu-lation biology. American Scientist 54:421–431.

Lewontin R. C. 1974. The Genetic Basis of EvolutionaryChange. New York: Columbia University Press.

Logan M. 2010. Biostatistical Design and Analysis Using R:A Practical Guide. Hoboken (NJ): Wiley-Blackwell.

Lord F. M. 1953. On the statistical treatment of foot-ball numbers. American Psychologist 8:750–751.

Luce R. D. 1959. Individual Choice Behavior: A Theoret-ical Analysis. New York: Wiley and Sons.

Luce R. D., Krantz D. H., Suppes P., Tversky A. 1990.Foundations of Measurement, Volume III: Representa-tion, Axiomatization, and Invariance. New York: Ac-ademic Press.

Maynard Smith J. 1976. Sexual selection and thehandicap principle. Journal of Theoretical Biology 57:239–242.

McGhee K. E., Fuller R. C., Travis J. 2007. Male com-petition and female choice interact to determinemating success in the bluefin killifish. BehavioralEcology 18:822–830.

McMahon T. A., Bonner J. T. 1983. On Size and Life.New York: W. H. Freeman.

Michell J. 1999. Measurement in Psychology: A CriticalHistory of a Methodological Concept. Cambridge(UK): Cambridge University Press.

Mosimann J. E. 1970. Size allometry: size and shapevariables with characterizations of the lognormaland generalized gamma distributions. Journal of theAmerican Statistical Association 65:930–945.

Narens L. 1981. On the scales of measurement. Jour-nal of Mathematical Psychology 24:249–275.

Narens L. 1985. Abstract Measurement Theory. Cam-bridge (MA): MIT Press.

Paenke I., Sendhoff B., Kawecki T. J. 2007. Influenceof plasticity and learning on evolution under di-

32 Volume 86THE QUARTERLY REVIEW OF BIOLOGY

Page 32: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

rectional selection. American Naturalist 170:E47–E58.

Pavlicev M., Le Rouzic A., Cheverud J. M., WagnerG. P., Hansen T. F. 2010. Directionality of epistasisin a murine intercross population. Genetics 185:1489–1505.

Phillips P. C., Arnold S. J. 1989. Visualizing multivar-iate selection. Evolution 43:1209–1222.

Platt J. R. 1964. Strong inference: certain systematicmethods of scientific thinking may produce muchmore rapid progress than others. Science 146:347–353.

Podolsky R. H., Shaw R. G., Shaw F. H. 1997. Popu-lation structure of morphological traits in Clarkiadudleyana. II. Constancy of within-population ge-netic variance. Evolution 51:1785–1796.

Pomiankowski A. 1987. Sexual selection: the handi-cap principle does work-sometimes. Proceedings ofthe Royal Society of London, Series B: Biological Sciences231:123–145.

Pomiankowski A. N. 1988. The evolution of femalemate preferences for male genetic quality. Pages136–184 in Oxford Surveys in Evolutionary Biology,Volume 5, edited by P. H. Harvey and L. Partridge.Oxford (UK): Oxford University Press.

Post E., Forchhammer M. C. 2002. Synchronization ofanimal population dynamics by large-scale cli-mate. Nature 420:168–171.

Postma E., Visser J., van Noordwijk A. J. 2007. Strongartificial selection in the wild results in predictedsmall evolutionary change. Journal of EvolutionaryBiology 20:1823–1832.

Price G. R. 1970. Selection and covariance. Nature227:520–521.

Prothero J. 1986. Methodological aspects of scaling inbiology. Journal of Theoretical Biology 118:259–286.

Prothero J. 2002. Perspectives on dimensional analysisin scaling studies. Perspectives in Biology and Medi-cine 45:175–189.

Remold S. K., Lenski R. E. 2001. Contribution ofindividual random mutations to genotype-by-environment interactions in Escherichia coli. Pro-ceedings of the National Academy of Sciences USA 98:11388–11393.

Reznick D. 1985. Costs of reproduction: an evaluationof the empirical evidence. Oikos 44:257–267.

Roff D. A., Mousseau T. A. 1987. Quantitative geneticsand fitness: lessons from Drosophila. Heredity 58:103–118.

Rose M. R. 1984. Genetic covariation in Drosophila lifehistory: untangling the data. American Naturalist123:565–569.

Rosen R. 1962. The derivation of D’Arcy Thompson’stheory of transformations from theory of optimaldesign. Bulletin of Mathematical Biophysics 24:279–290.

Rosen R. 1978a. Dynamical similarity and theory of

biological transformations. Bulletin of MathematicalBiology 40:549–579.

Rosen R. 1978b. Fundamentals of Measurement and Rep-resentation of Natural Systems. New York: ElsevierNorth-Holland.

Sarle W. S. 14 September 1997. Measurement Theory:Frequently Asked Questions. ftp://ftp.sas.com/pub/neural/measurement.html. (Accessed 10 Janu-ary 2008).

Schmidt-Nielsen K. 1984. Scaling: Why is Animal Size SoImportant? Cambridge (UK): Cambridge Univer-sity Press.

Simmons R. E., Scheepers L. 1996. Winning by aneck: sexual selection in the evolution of giraffe.American Naturalist 148:771–786.

Smith R. J., Jungers W. L. 1997. Body mass in com-parative primatology. Journal of Human Evolution32:523–559.

Sneath P. H. A., Sokal R. R. 1973. Numerical Taxonomy:The Principles and Practice of Numerical Classification.San Francisco (CA): W. H. Freeman.

Sokal R. R., Rohlf F. J. 1995. Biometry: The Principlesand Practice of Statistics in Biological Research. ThirdEdition. New York: W. H. Freeman.

Solow A. R., Wang S. C. 2008. Some problems withassessing Cope’s Rule. Evolution 62:2092–2096.

Stahl W. R. 1961. Dimensional analysis in mathemat-ical biology. I. General discussion. Bulletin of Math-ematical Biology 23:355–376.

Stahl W. R. 1962. Similarity and dimensional methodsin biology. Science 137:205–212.

Stephens D. W., Dunbar S. R. 1993. Dimensional anal-ysis in behavioral ecology. Behavioral Ecology 4:172–183.

Steppan S. J., Phillips P. C., Houle D. 2002. Compar-ative quantitative genetics: evolution of the G ma-trix. Trends in Ecology and Evolution 17:320–327.

Stevens S. S. 1946. On the theory of scales of mea-surement. Science 103:677–680.

Stevens S. S. 1959. Measurement, psychophysics andutility. Pages 18–63 in Measurement: Definitions andTheories, edited by C. W. Churchman and P. Ra-toosh. New York: Wiley.

Stevens S. S. 1968. Measurement, statistics, and theschemapiric view. Science 161:849–856.

Stinchcombe J. R., Agrawal A. F., Hohenlohe P. A.,Arnold S. J., Blows M. W. 2008. Estimating nonlinearselection gradients using quadratic regression coef-ficients: double or nothing? Evolution 62:2435–2440.

Suppes P., Krantz D. H., Luce R. D., Tversky A. 1989.Foundations of Measurement, Volume II: Geometrical,Threshold, and Probabilistic Respresentations. NewYork: Academic Press.

Svennungsen T. O., Holen Ø. H. 2007. The evolution-ary stability of automimicry. Proceedings of the RoyalSociety of London, Series B: Biological Sciences 274:2055–2063.

March 2011 33MEASUREMENT AND MEANING IN BIOLOGY

Page 33: 86, March THE QUARTERLY REVIEW of Biologydhoule/Publications/Houle@@11Measurement.pdfof Biology MEASUREMENT AND MEANING IN BIOLOGY David Houle* Centre for Ecological & Evolutionary

Takahashi M., Arita H., Hiraiwa-Hasegawa M., Hase-gawa T. 2008. Peahens do not prefer peacockswith more elaborate trains. Animal Behaviour 75:1209–1219.

Teissier G. 1936. Croissance comparee des formeslocales d’une meme espece. Memoires du MuseeRoyal D’Histoire Naturelle de Belgique 3:627–634.

Thompson D. W. 1917. On Growth and Form. Cam-bridge (UK): Cambridge University Press.

Turelli M. 1984. Heritable genetic variation viamutation-selection balance: Lerch’s zeta meets theabdominal bristle. Theoretical Population Biology 25:138–193.

Via S., Lande R. 1985. Genotype-environment inter-action and the evolution of phenotypic plasticity.Evolution 39:505–522.

Vik J. O., Stenseth N. C., Tavecchia G., Mysterud A.,Lingjærde O. C. 2004. Ecology (communicationarising): living in synchrony on Greenland coasts?Nature 427:697–698.

Wagner G. P. 2010. The measurement theory of fit-ness. Evolution 64:1358–1376.

Wagner G. P., Altenberg L. 1996. Perspective: com-plex adaptations and the evolution of evolvability.Evolution 50:967–976.

Wagner G. P., Laubichler M. D. 2001. Character iden-tification: the role of the organsm. Pages 141–164in The Character Concept in Evolutionary Biology, ed-ited by G. P. Wagner. San Diego (CA): AcademicPress.

Wagner G. P., Laubichler M. D., Bagheri-Chaichian

H. 1998. Genetic measurement theory of epistaticeffects. Genetica 102–103:569–580.

Weber K. E. 1990. Selection on wing allometry inDrosophila melanogaster. Genetics 126:975–989.

Weber K. E. 1992. How small are the smallest select-able domains of form? Genetics 130:345–353.

Weitzenhoffer A. M. 1951. Mathematical structuresand psychological measurements. Psychometrika 16:387–406.

White J. F., Gould S. J. 1965. Interpretation of thecoefficient in the allometric equation. AmericanNaturalist 99:5–18.

Whitlock M. C., Schluter D. 2009. The Analysis of Bio-logical Data. Greenwood Village (CO): Robertsand Company.

Wiley R. H., Poston J. 1996. Perspective: indirect matechoice, competition for mates, and coevolution ofthe sexes. Evolution 50:1371–1381.

Wilkinson G. S. 1993. Artificial sexual selection altersallometry in the stalk-eyed fly Cyrtodiopsis dalmanni(Diptera: Diopsidae). Genetical Research 62:213–222.

Wolman A. G. 2006. Measurement and meaningful-ness in conservation science. Conservation Biology20:1626–1634.

Wulff J. 2001. Assessing and monitoring coral reefsponges: why and how? Bulletin of Marine Science69:831–846.

Yoccoz N. G. 1991. Use, overuse, and misuse of signifi-cance tests in evolutionary biology and ecology. Bul-letin of the Ecological Society of America 72:106–111.

34 Volume 86THE QUARTERLY REVIEW OF BIOLOGY