retrospectives who invented instrumental variable regression? · one definition of instrumental...

Journal of Economic Perspectives?Volume 17, Number 3?Summer 2003?Pages 177-194

Retrospectives

Who Invented Instrumental Variable

Regression?

James H. Stock and Francesco Trebbi

This feature addresses the history of economic words and ideas. The hope is to

deepen the workaday dialogue of economists, while perhaps also casting new light on ongoing questions. If you have suggestions for future topics or authors, please write to Joseph Persky, c/o Journal of Economic Perspectives, Department of Econom?

ics (M/C 144), University of Illinois at Chicago, 601 South Morgan Street, Room

2103, Chicago, Illinois 60607-7121.

Introduction

The earliest known solution to the identification problem in econometrics?the

problem of identifying and estimating one or more coefficients of a system of simul?

taneous equations?appears in Appendix B of a book written by Philip G. Wright, The

Tariff on Animal and Vegetable Oils, published in 1928. Its first 285 pages are a painfully detailed treatise on animal and vegetable oils, their production, uses, markets and

tariffs. Then, out of the blue, comes Appendix B: a succinct and insightful explanation of why data on price and quantity alone are in general inadequate for estimating either

supply or demand; two separate and correct derivations of the instrumental variables

estimators of the supply and demand elasticities; and an empirical application to butter

and flaxseed. The great breakthrough of Appendix B was showing that instrumental

variables regression can be used to estimate the coefficient on an endogenous regres?

sor, something ordinary least squares regression cannot do, which makes instrumental

variables regression a central technique of modern micro- and macroeconometrics.

? James H. Stock is Professor of Economics, and Francesco Trebbi is a graduate student in the

Department of Economics, Harvard University, Cambridge, Massachusetts. Stock is also

Research Associate, National Bureau of Economic Research, Cambridge, Massachusetts.

178 Journal of Economic Perspectives

Perhaps because Appendix B differs so from the rest of the book, its author-

ship has been in doubt. There is, in fact, a plausible alternative author: Philip

Wright's eldest son, Sewall, who by 1928 was already an important genetic statisti-

cian. Indeed, the second of the two derivations of the instrumental variable

estimator in Appendix B uses the method of "path coefficients," which Sewall had

recently developed (S. Wright, 1921, 1923). Some histories, including Goldberger

(1972), Crow (1978, 1994) and Manski (1988), attribute Appendix B to Sewall

Wright. Morgan (1990) and Angrist and Krueger (2001) attribute authorship to

Philip, but opine that Sewall probably deserves some intellectual credit. Others,

including Christ (1994) and Stock and Watson (2003), state that authorship is in

question, but do not take a stand.

So who wrote Appendix B, and, by inference, who solved the identification

problem in econometrics? The simplest way to solve this puzzle would have been to

ask Sewall, but apparently nobody did, and he died in 1988.

Lacking eyewitnesses, we investigate this mystery by other means: searching for

traces of literary fingerprints hidden in Appendix B. The field of stylometrics?the statistical analysis of literary styles?postulates that subtle differences in style among authors can be used to attribute texts of ambiguous authorship. A classic stylometric

study is Mosteller and Wallace's (1963) authorship analysis of the unsigned Feder-

alist Papers. More recently, Foster (1996) used stylometrics to attribute the author?

ship of the political novel Primary Colors to Joseph Klein, a charge he denied until

confronted by the Washington Post with editorial corrections in his handwriting. Our detective work entails using stylometric data (numerical measures of

word usage and grammatical constructions) to assess whether Appendix B is

most likely by Philip or Sewall?or, potentially, by neither. Although stylometrics sounds exotic, its main methods are just versions of standard econometric tools. In

fact, our stylometric investigation provides a simple (and, we hope, fun) illustra-

tion of some econometric methods for analyzing high-dimensional data sets, in

which the number of explanatory variables exceeds the number of observations. As

we shall see, this econometric sleuthing clearly points to the true author of

Appendix B.

The History of Instrumental Variable Regression and Appendix B

The first known publication in English to describe the identification problem in an empirical context was a book review by Philip Wright of Henry Moore's

Economic Cycles: Their Law and Cause (Moore, 1914; P. G. Wright, 1915). Philip

explained why what Moore famously called a "new type" of demand curve?an

upward-sloping demand curve for pig iron?could just be the supply curve, traced

out by a shifting demand curve. Philip's treatment was very brief (less than one

page) and followed what must have been a difficult discussion of autocorrelations

and frequency domain methods. In any event, Philip's analysis seems to have been

largely overlooked, even though it is cited in E. J. Working's (1927) influential

James H. Stock and Francesco Trebbi 179

exposition of the identification problem.1 Philip Wright (1929) later elaborated on

his 1915 analysis in his review of Henry Schultz's (1928) Statistical Laws of Demand

and Supply with Special Application to Sugar. One definition of instrumental variable estimation is the use of additional

"instrumental" variables, not contained in the equation of interest, to estimate

the unknown parameters of that equation. Thus defined, instrumental variables

estimation predated Appendix B. As discussed in detail by Goldberger (1972), Sewall Wright (1925) used instrumental variables to estimate the coefficients of

a multiple equation model of corn and hog cycles. In that work, he derived the

instrumental variables estimating equations (the equations among correlations

from which the coefficients of interest could in turn be estimated) using the

method of "path coefficients," which he had recently introduced in S. Wright

(1921, 1923).2 But the equations in his 1925 model are not simultaneous and all

the regressors are exogenous, so ordinary least squares would have sufficed;

there was no identification problem to solve, so instrumental variables estima?

tion was unnecessary and appears to have been merely a computational expe- dient.3 Moreover, in his 1925 paper, Sewall stated (footnote 7) that his method

of path coefficients, as it then existed, could not handle systems of simultaneous

equations. The idea that instrumental variables estimation can be used to solve the

identification problem?that is, can be used to estimate the coefficient on an

endogenous variable?first appeared in Appendix B. Elaborating on P. G. Wright

(1915) (but not citing Working, 1927), the author first presented the now-standard

graphical demonstration of why movements in demand and supply can produce an

arbitrary scatterplot of price-quantity points, which will trace out neither supply nor

demand unless one of the curves is fixed; his key figure is reprinted as Exhibit 1.

Then (pp. 311-312):

In the absence of intimate knowledge of demand and supply conditions,

statistical methods for imputing fixity to one of the curves while the other

changes its position must be based on the introduction of additional factors.

Such additional factors may be factors which (A) affect demand conditions

1 Christ (1985, 1994) and Morgan (1990) provide engaging histories of the identification problem in econometrics and its solution. A single paragraph also suggesting that Moore had estimated a supply curve appeared later that year in Lehfeldt's (1915) review of Moore (1914). According to Christ (1985), the first known explanation of the identification problem was in French by Lenoir (1913) (translated as

chapter 17 in Hendry and Morgan, 1995), but this is not referenced in other early work on this problem. 2 The method of path coefficients begins by drawing a flow diagram with one-way arrows pointing from causal variables to intermediate variables to outcomes. This diagram allows one to trace the connection between any two variables by following the paths of arrows between them and produces a set of equations among correlations that can be solved to estimate the path coefficients. In Sewall Wright's (1921,1923) initial

expositions, the method of path coefficients is equivalent to multiple regression using ordinary least squares. Goldberger (1972) provides a clear discussion of path analysis and the estimation of path coefficients. 3 Because S. Wright (1925) set to zero sample correlations that were nearly so, the instrumental variables

estimating equations were simpler than the ordinary least squares equations in his four-regressor models.


Exhibit 1

The Graphical Demonstration of the Identification Problem in Appendix B (p. 296)

FicruRB 4. Price-output Data Fail to Revbal Either Supply or Demand Curve.

without affecting cost conditions or which (B) affect cost conditions without

affecting demand conditions.

Appendix B then provides two derivations of the instrumental variable estimator

as the solution to the identification problem. The first (pp. 313-314) is the "limited-

information," or single-equation, approach, in which the instrumental variable A is

used to estimate the supply elasticity; this derivation is summarized in Exhibit 2. The

second derivation (pp. 315-316) is the "full-information," or system, derivation and

uses Sewall Wright's (1921,1923) method of path coefficients, extended to a system of

two simultaneous equations. This derivation in effect solves the two simultaneous

equations so that price and quantity are expressed as functions of A and B. Because A

and B are exogenous, the resulting coefficients can be estimated by ordinary least

squares, and thence, the supply and demand elasticities can be deduced. In modern

terminology, this estimator of the elasticities is the indirect least squares estimator that,

because the system is exacdy identified, is the instrumental variables estimator obtained

in the first derivation.4

The author of Appendix B refers to instrumental variable estimation as "the

method of introducing external factors," which he then uses to estimate the supply and demand elasticities for butter and flaxseed. The external factors actually used

4 From a modern perspective, the only flaw in the derivations is the loose treatment of the distinction between sample and population moments. This strikes us as a minor slip that can be excused by the early date at which Appendix B was written.

Who Invented Instrumental Variable Regression? 181

Exhibit 2

The Single-Equation Derivation of the Instrumental Variable Estimator in

Appendix B

The derivation in Appendix B of the instrumental variable estimator of the

coefficients of a single equation has two steps. The author tackled the supply curve first. Adopt his original notation, let O be the percentage deviation of

output from its mean (then, as now, typically computed by taking the logarithm of the original quantity data, relative to its sample mean) and let P be the

percentage deviation of price from its mean. Starting with the familiar supply and demand diagram, he first derived the supply curve with an additive distur-

bance,

0= eP+ Sl9

where e is the elasticity of supply, Sx represents the shift in the supply curve

"brought about by a change in supply conditions," relative to when prices and

output are at their long-run mean value, and the intercept is zero because the

variables are deviated from their means. The author rearranges this expression as eP = O -

Sl9 then writes (p. 314):

Now multiply each term in this equation by A (the corresponding deviation

in the price of a substitute) and we shall have:

eA X P= A X O- A X Sx.

Suppose this multiplication to be performed for every pair of price-output deviations and the results added, then:

^ ^ _ 2AX O-^AX S, e2sAXP=ZAXO-Z,AXS1 or

e=-^AXP-*

But A was a factor which did not affect supply conditions; hence it is uncor?

related with Sx; hence 2 A X Sx = 0; and hence e = (2 A X 0)/(2 A X P).

(The shading has been added for the stylometric work carried out later.) The

final expression for e is the formula for the instrumental variable estimator with

a single instrument and a single included endogenous variable.

are not stated, but from context they appear to be the price of a substitute (A, which

shifts demand) and the yield per acre (B, which shifts supply). It is striking that Appendix B provides both limited-information (single equa?

tion) and full-information (system) derivations. Tinbergen (1930), apparently un-

aware of Appendix B, provided only one derivation, a full-information derivation


(using algebra, not path analysis) of the indirect least squares estimator.5 The

limited-information interpretation of the method of external factors apparently was

not rediscovered, also independently, until the postwar work of the Cowles

Commission.

Philip and Sewall Wright

Philip G. Wright (1861-1934) received a bachelor's degree from Tufts in 1884

and an MA. in economics from Harvard in 1887.6 Sewall Wright (1889-1988) was

born in Massachusetts, and in 1892, the family (now including a brother, Quincy, born in 1890) moved to Galesburg, Illinois, where Philip became Professor of

Mathematics and Economics at Lombard College, a small college that later folded

in 1930. At Lombard, Philip taught economics, mathematics (including calculus),

astronomy, fiscal history, writing, literature and physical education; he also ran the

college printing press. Philip had a passion for poetry and used the press to publish the first books of poems by a particularly promising student of his, Carl Sandburg. Sewall graduated from high school in Galesburg in 1906 and attended Lombard

College, where many of Sewall's courses, including his college mathematics

courses, were taught by his father.

In 1912, Philip and Sewall moved to Massachusetts. Philip took a visiting

position teaching at Williams College, and Sewall entered graduate school at

Harvard. In 1913, Philip took a position at Harvard, first as an assistant to his former

advisor, Professor Frank W. Taussig, then as an instructor. Taussig was subsequently

appointed head of the U.S. Tariff Commission in Washington, D.C. In 1917, Philip left Harvard for a position at the Commission, then in 1922 took a research job at

the Institute of Economics, part of what would shortly become the Brookings Institution. In 1915, Sewall received his Sc.D. from Harvard and took a position as

Senior Animal Husbandman at the U.S. Department of Agriculture in Washington,

D.C, where his responsibilities involved applying genetics to livestock breeding. In

1926, Sewall moved to the Department of Zoology at the University of Chicago, where he was promoted to professor in 1930.

When Philip had the time to write, he was prolific. While at Harvard, in

addition to his 1915 review of Moore's book, he wrote a number of articles in the

Quarterly Journal of Economics, and while at Brookings, he wrote several books and

published articles and reviews in the Journal of the American Statistical Association, the

Journal of Political Economy and the American Economic Review. Some of his writings

5 Tinbergen (1930) discusses two estimators, the indirect least squares estimator and the "direct," or

ordinary least squares, estimator. In his empirical application to the demand for potatoes, he averages the indirect least squares and ordinary least squares estimates of the demand elasticity. According to Morgan (1990, footnote 17, p. 182) and Magnus and Morgan (1987), at this point Tinbergen did not understand the statistical implications of the identification problem and saw no flaws with ordinary least squares estimation in simultaneous equations systems. Appendix B does not make this mistake. 6 The biographical information in this section draws on Provine (1986), Crow (1994), Philip Wright's alumnus file archived at Harvard University and his personnel file archived at the Brookings Institution.


used algebra and calculus, typically following graphical expositions. Although

Philip wrote on a wide range of topics, the identification problem was a recurrent

theme in his work (P. G. Wright, 1915, 1929, 1930). In his later years, Philip was particularly concerned about tariffs, and he wrote passionately about the

damage being done by recent tariff increases to international relations (P. G.

Wright, 1933). Sewall Wright became an eminent genetic statistician. In addition to develop?

ing the method of path analysis, he made fundamental contributions to evolution?

ary theory and population genetics. Evolutionary biology remained at the center of

Sewall Wright's interests in his 76 years of publishing activity from 1912 to 1988, the

year of his death at age 98. His only publications in economics were his 1925

analysis of the hog and corn markets undertaken as part of his duties at the USDA

during World War I and a section of S. Wright (1934) that he coauthored with his

father. According to Provine (1986, Table 1.2), Sewall expressed no more interest

in economics than in Greek, Latin, astronomy or athletics.

Although Philip and Sewall may have experienced some tension over Sewall's

choice of biology as a career (Provine, 1986), it appears that the two were intellec-

tually close. In P. G. Wright (1915), Philip thanked Sewall for "valuable suggestions, and assistance in making the computations." Moreover, Philip and Sewall collabo-

rated on a long section of a paper by Sewall explicating the method of path coefficients (S. Wright, 1934). That section elaborates upon the terse system derivation in Appendix B and shows that identification can be achieved by impos?

ing other restrictions. In particular, they show that if there is only one instrument

(for example, an instrument for supply, but not for demand), then system identi?

fication can be achieved by further assuming that the supply and demand errors are

uncorrelated.

In short, it seems that either Philip or Sewall could have written Appendix B:

Philip had a clear understanding of the identification problem as early as 1915, and

Sewall's method of path coefficients was used in the second derivation of the

instrumental variable estimator. If this contextual evidence does not resolve the

mystery, perhaps textual evidence will.

Stylometric Analysis

When we started this project, we knew as much about grammar and style as

most econometricians. Fortunately, we could draw on an established body of

research that uses statistical methods to shed light on the authorship of disputed texts. The premise of stylometric analysis is that authors leave literary fingerprints on their work in the form of subconscious stylistic features that are largely inde?

pendent of the subject matter. Father and son have many sole-authored publica?

tions, and they come from different generations and different literary traditions:

Philip's passion was poetry; Sewall's, biology. Might quantifiable differences in their

writing styles allow a clear attribution of Appendix B?


Stylometric analysis has three steps: collecting the raw texts, computing quan? titative stylometric indicators and analyzing the resulting numerical data. In each,

we break no new ground. For surveys of stylometric analysis, see Holmes (1998), Rudman (1998) and Peng and Hengartner (2002).

Data

The raw data consist of a sample of texts (listed in the Appendix) with sole

authorship known to be Philip or Sewall, plus chapter 1 and Appendix B of The

Tariff on Animal and Vegetable Oils. Photocopies of the originals were converted to

text files using an optical character recognition program and checked for accuracy. The resulting text files were edited to eliminate footnotes, graphs and formulas.

Following Mannion and Dixon (1997), blocks of 1,000 words were selected from

these files. A total of 54 blocks were selected: 20 undisputedly written by Sewall, 25

by Philip, six from Appendix B and three from chapter 1. Although Philip's

authorship of chapter 1 has never been questioned, we treat its three blocks as

unknown to see if the authorship identification procedures correctly (we presume) attribute authorship to Philip.

We soon discovered that several of what we initially thought might be good

stylometric indicators, such as sentence length and use of the passive versus active

voice, have been found not to be useful for authorship identification because they are context specific or because they are subject to conscious manipulation by the

author. Instead, the stylometric literature focuses on subtler elements of style

(Rudman, 1994; Holmes, 1994, 1998). Rather than trying to develop our own

stylometric indicators, we adopted two different sets of indicators directly from the

literature.

The first set of stylometric indicators is the frequency of occurrence in each

block of 70 function words. This list was taken wholesale from Mosteller and

Wallace (1963, Table 2.5) and is presented in Table 1. These 70 function words

produced 70 numerical variables, each of which is the count, per 1,000 words, of an

individual function word in the block under analysis. Because the word "things" occurred only once in the 45 blocks with known authorship, it was dropped from

the data set, leaving 69 function word counts.

The second set of stylometric indicators, taken from Mannion and Dixon

(1997), concerns grammatical constructions. Many of their indicators involved

sequential word counts, for example, the average length of certain sentence seg- ments. We decided that such length-based indicators could be unreliable in the

context of mathematical writing (how many "words" is an equation?), so we did not

compute them. Further excluding overlaps with the function words in Table 1 left

18 grammatical constructions; these indicators, which are either frequency counts

per 1,000 or relative frequency counts, are listed in Table 2.

7 Two blocks (items 3 and 7 under S. Wright's publications in the Appendix) were fewer than 1,000 words, so counts were scaled accordingly.


Table 1

Function Words Used in the Stylometric Analysis

Notes: These are the function words listed in Mosteller and Wallace (1963, Table 2.5). a

Dropped from the data set because it occurred only once in the 45 blocks of known authorship.

Each 1,000-word block was processed to compute these stylometric indicators.

The data set thus consists of 54 observations, each corresponding to a different

block, on one dependent variable, authorship (Philip, Sewall or unknown) and 87

independent variables (69 function word counts and 18 grammatical statistics).8

Preliminary Data Analysis Econometricians are trained to be skeptical. Is there any reason to think that

these data can distinguish between Sewall's and Philip's known works, far less solve

the mystery of Appendix B?

If a stylometric indicator differs substantially between the two bodies of known

works, then it should be possible to detect that difference using a conventional

differences-of-means J-statistic. As it happens, many of these ^-statistics are large: of

the 87 ^-statistics (one for each indicator), 18 percent exceed 3 in absolute value

and 41 percent exceed 2 in absolute value. So many large ^-statistics would be quite

unlikely if there truly were no stylistic differences between the authors and if the

stylometric indicators were independently distributed.9

Table 3 presents summary statistics for the six stylometric indicators with the

largest ^-statistics; these indicators are the fourth grammatical statistic in Table 2 (a noun followed by a coordinating conjunction) and five function words. Evidently,

Philip used the words "to" and "now" much more frequently than Sewall, while

Sewall used the word "in" much more frequently than Philip. Can we glean any preliminary indications of authorship from the counts in

Table 3? One way to do so is to see whether the distribution of these indicators in

8 Additional details on data collection and processing, the code in Perl used to compute the stylometric indicators, an electronic copy of Appendix B, a teaching note on classification analysis, and related material, are available by following the links from Stock's home page at (http://post.economics. harvard.edu/faculty/stock/stock.html). 9 The indicators are not, however, independently distributed; for example, "now" and "then" tend to occur together, as do "if' and "would." Thus, formal joint inference on these ^-statistics (such as a

chi-squared test) is not straightforward.


Table 2

Grammatical Statistics Used in the Stylometric Analysis

occurrences of Saxon genitives forms 's or s' noun followed by adverb noun followed by auxiliary verb noun followed by coordinating conjunction coordinating conjunction followed by noun

coordinating conjunction followed by determiner total occurrences of nouns and pronouns total occurrences of main verbs total occurrences of adjectives total occurrences of adverbs total occurrences of determiners and numerals total occurrences of conjunctions and interrogatives total occurrences of prepositions dogmatic/tentative ratio: assertive elements versus concessive elements relative occurrence of "to be" and "to find" to occurrences of main verbs. relative occurrence of "the" followed by an adjective to occurrences of "the" relative occurrence of "this" and "these" to occurrences of "that" and "those" relative occurrence of "therefore" to occurrences of "thus"; 0 if no occurrences of "thus"

Notes: These grammatical statistics are the subset of those used by Mannion and Dixon (1997) after

dropping statistics that overlap with Table 1 or are sequential word counts, which are ambiguous in mathematical texts.

Appendix B is closer to that in Philip's or Sewall's known texts. The results are

suggestive: for all six indicators in Table 3, the mean and standard deviations of

counts in Appendix B are quite similar to those found in Philip's writings, but

different from those found in Sewall's.

Another way to get some insights into authorship is to see whether any of these

"top six" indicators appear in the single-equation derivation quoted in Exhibit 2; as

it happens, several do and are indicated by shading. The passage contains "now," a

word used 1.6 times per 1,000 words by Philip, but only 0.1 times per 1,000 by Sewall. It also contains an instance, "deviations and," of a noun followed by a

coordinating conjunction, a construction used almost twice as frequently by Philip as by Sewall, and it contains the word "to," which is used almost 50 percent more often

by Philip than Sewall. On the other hand, the passage also contains the word "in," which is used more frequendy by Sewall than by Philip. While this preliminary analysis

points toward Philip as the author of Appendix B, it is not decisive. For firmer evidence, we must examine the full data set, but to do so we need different techniques.

Empirical Methods

An econometrician's first instinct might be to regress the binary authorship variable on the stylometric indicators. But with 87 regressors and only 45 observa?

tions, instinct soon gives way to reason: somehow, the number of regressors must be

reduced before analyzing authorship. Two ways to handle this "dimension" prob? lem are principal components analysis and linear discriminant analysis.

Principal components analysis entails reducing a large number of regressors to

Table 3

Summary Statistics for the Six Stylometric Indicators with the Largest /-Statistics

Notes: The entries in columns 2 and 3 are the mean and standard deviations of the counts per 1,000 words of the stylometric indicator in column 1 in the 25 blocks undisputedly written by Philip Wright. Columns 4 and 5 contain this information for the 20 blocks undisputedly written by Sewall Wright. The next column contains the two-sample ?-statistic testing the hypothesis that the mean counts are the same for the two authors. The final two columns contain means and standard deviations for the 6 blocks from

Appendix B. Shaded indicators occur in the excerpt in Exhibit 2.

a few weighted averages, or linear combinations, chosen to capture as much of the

variation in the regressors as possible. The principal components approach begins

by standardizing each variable, that is, by subtracting its sample mean and dividing

by its sample standard deviation. The first principal component is the linear

combination of the variables with the maximum variance, subject to the restriction

that the squared weights sum to one. This procedure tends to give greater weight to regressors that are highly correlated. The second principal component is the

linear combination of the regressors that has the second highest variance and is not

correlated with the first principal component. The third, fourth and additional

principal components are calculated in the same way.10 For our main analysis, we regressed the binary authorship variable on the first

four principal components of the grammatical statistics, then repeated this for the

function words. This produced a pair of predicted values for each observation,

known or not.11 Authorship of an unknown block is assigned depending on

10 Specifically, let X denote the n X k matrix of n observations on the k standardized regressors. The first

principal component of Xis the linear combination of the regressors, Xal5 that has the largest variance, where a1 is a k X 1 vector of weights normalized so that ajc^ = 1. Because the sample variance of Xa is a' X' Xa/ (n - 1), maximizing this sample variance subject to a'a = 1 is equivalent to maximizing a'X'Xa/a'a, which is done when a is the eigenvector of X' X corresponding to its largest eigenvalue. The second principal component is the linear combination formed using the second eigenvector of X' X, and so forth. For applications of principal components analysis in the stylometric literature, see Burrows (1987), Holmes and Forsyth (1995) and Peng and Hengartner (2002). 11 This procedure can be applied generally to prediction or forecasting when the number of regressors is large, relative to the number of observations. For example, Stock and Watson (1999, 2002) and Forni et al. (2002) report promising results for macroeconomic forecasts based on the principal components of many predictors.


whether its pair of predicted values is closer to the means for Philip's or Sewall's

known blocks, where distance is measured using the inverse covariance matrix of

the pair of predicted values for the respective author. Several variations on this

approach are explored as robustness checks.

Our second method, Fisher's linear discriminant analysis, was used by Mos-

teller and Wallace (1963) to analyze the Federalist Papers, although it is used

infrequently in econometrics. Like principal components analysis, linear discrimi?

nant analysis constructs a linear combination of the stylometric indicators that can

be used to distinguish between the two authors. Unlike principal components

analysis, the linear discriminant analysis weights are computed using data on

authorship. The weight (w) placed on a given variable (X) in Fisher's linear

discriminant is the difference in the means for that variable between the known

works of Philip and Sewall, divided by the sum of the variances of that variable for

the known works of Philip and Sewall; that is,

?, l\j-p -&j-S Z = 2j WjXj, where Wj

= ?' ?' ,

where Xj.P and sJ.P are the sample mean and variance of variable j among works

known to be written by Philip, Xj.s and sj.s are defined similarly for Sewall, and k is

the number of stylometric indicators. When differences in the means are large, the

weights will tend to be large?so that indicators that are quite different between the

two authors receive greater weight than those that are similar. If the indicators are

normally distributed with the same variances for both authors, then Fisher's linear

discriminant is the optimal Bayes procedure for classifying authorship (Duda and

Hart, 1973). The linear discriminant was computed separately for the function words and

grammatical statistics, respectively producing ZFW and ZGS. An unknown work is

assigned to an author if the point (Zfw, Zfs) is closer to the average for Philip or

for Sewall, where distance is measured using the inverse covariance matrix for the

relevant author.

Cross-Validation Analysis We begin the empirical analysis by testing these methods using what is known

as "cross-validation" analysis. The idea of cross-validation is to drop an observation

with a known value of the dependent variable (authorship) and to predict that

value using the other observations; doing so repeatedly for all the observations

provides an estimate of the prediction error rate. Performing this "leave-one-out"

cross-validation analysis here entailed 45 repeated analyses; in each, 44 known texts

are used to predict authorship of the remaining "unknown" text. This produced 45

authorship estimates that, because authorship of the "unknown" text is actually

known, can be used to estimate the accuracy rate of our full sample analysis. The resulting estimated accuracy rates are summarized in Table 4. Depending

on author and statistical method, the estimated accuracy rate is 100 percent (that


Table 4

Cross-Validation Estimates of Accuracy Rates of

Assigned Authorship

Notes: Based on leave-one-out cross-validation analysis of 45 1,000- word blocks of known authorship.

is, all texts are correctly identified) in three of four cases and 90 percent in the

remaining case. The cross-validation estimates of 100 percent accuracy seem unre-

alistically high. Still, these results confirm that Philip and Sewall had different

writing styles that are effectively distinguished by the stylometric indicators.

A Full-Sample Analysis We now turn to our main statistical analysis, in which all 45 known texts are

used to compute the principal components regression coefficients and the linear

discriminant analysis weights.

Figure 1 is a scatterplot of the predicted values of the binary authorship variable from its regression on the first four principal components of the gram? matical statistics (Y axis) and the first four principal components of the function

words (X axis). (These principal components respectively explain 56 percent and 32 percent of the variance of the grammatical statistics and function words.)

Figure 1 shows a clear separation between the works of known authorship. This is

consistent with the high cross-validation accuracy rates and with the authors having

measurably different writing styles. The Figure 1 scatterplot also contains predicted values for the six blocks from

Appendix B and the three blocks from chapter 1. All the blocks from Appendix B

fall within the cluster of points associated with Philip's works, assigning authorship of Appendix B to Philip. All the blocks from chapter 1 also fall within Philip's

cluster, correctly (we presume) assigning its authorship to Philip.

Figure 2 presents the comparable scatterplot of (Zfw, Zfs), the values of the

linear discriminant for the grammatical statistics versus those of the function words.

The conclusions are the same as from Figure T. the values for Appendix B and

chapter 1 fall squarely within the cluster of Philip's known texts.

Robustness Checks

We conducted several robustness checks. First, Mosteller and Wallace (1963)

computed the linear discriminant using the differences of the medians instead of


Figure 1

Scatterplot of Predicted Values from Regression on First Four Principal Compo? nents: Grammatical Statistics versus Function Words

s = block undisputedly written by Sewall Wright p = block undisputedly written by Philip G. Wright 1 = block from chapter 1, The Tariff on Animal and Vegetable Oils B = block from Appendix B, The Tariff on Animal and Vegetable Oils

1.5

1 pb p p p

,PP BP jJ>B

1.5

Function words

Figure 2

Scatterplot of Linear Discriminant Based on Grammatical Statistics versus Linear

Discriminant Based on Function Words

-1 0 1

Function words

the sample means and the squared ranges of the data instead of the sample

variances, so we recalculated our linear discriminants using their alternative weight-

ing scheme. The results are similar to those in Figure 2, assigning all the Appendix B and chapter 1 blocks to Philip.

Second, we computed the two principal components regressions using only the

first two principal components, then again using the first six principal components. The results are similar to those in Figure 1, assigning all the Appendix B and

chapter 1 blocks to Philip.


Third, we regressed authorship against an intercept, the first two principal

components of the grammatical statistics and the first two principal components of

the function word counts, and we attribute authorship depending on whether the

predicted value is greater or less than 0.5. All works of known authorship were

correctly assigned. All blocks from Appendix B and chapter 1 were assigned to

Philip.

Fourth, we pooled all 87 stylometric indicators and computed their first four

principal components (these explained 31 percent of the total variance) and

assigned authorship first by regression, as in the preceding paragraph, and second

by minimum distance in the resulting four-dimensional space. Again, all blocks of

known authorship were correctly assigned, and all blocks from Appendix B and

chapter 1 were assigned to Philip.12

Discussion and Conclusions

Who wrote Appendix B? The stylometric evidence clearly points to Philip G.

Wright. Who first thought of using the instrumental variable estimator to solve the

identification problem in econometrics? Of this we cannot be so sure: perhaps it

was collaborative work or even Sewall's idea that Philip simply wrote up. Discussion

of intellectual attribution, as opposed to authorship, quickly becomes speculative.

Still, there is some relevant evidence.

In Sewall's favor, he was, after all, the inventor of the method of path analysis that was used in the second derivation of instrumental variable, and he had used an

instrumental variables estimator in his earlier work on corn and hog cycles. More?

over, Sewall provided corrections to the first draft of Crow's (1978) biography of

Sewall for the International Encyclopedia of the Social Sciences, in which manuscript Crow wrote of Philip Wright, "Later, in 1928, he wrote a book The Tariff on Animal

and Vegetable Oils, to which Sewall contributed an appendix." While Sewall made

other corrections to the entry, he did not amend this statement. As Crow pointed out in personal communication, however, Sewall missed a known factual error two

sentences earlier, so perhaps (at an age of 88 years) his attention lapsed; alterna-

tively, Sewall might have read "contributed an appendix" as "contributed to an

appendix." Also, Arthur Goldberger brought a telling line to our attention: in a

reprise, many years later, of the material in their 1934 coauthored section on supply and demand, Sewall Wright (1960, p. 431) wrote that "P.G. Wright

[1928] . . . made, at my suggestion, a comparison of the results of this mode of

approach [the method of path analysis] with results that he had arrived at by another method. . . ." Perhaps Sewall suggested the full-information derivation

12 We also repeated the analysis using a different stylometric indicator developed by Benedetto, Caglioti and Loreto (2002), which uses zipped text compression ratios. This indicator also identifies Philip as the author of Appendix B. Their code is proprietary, and their published article is insufficiently detailed to

permit replication, so we did not use this indicator for our main analysis.


using path analysis in Appendix B, even if Philip then carried out and wrote up the

analysis. In Philip's favor, it is evident from his 1915 book review that he clearly

understood the identification problem and how it could be solved if one curve

shifts while the other remains constant. Indeed, there are clear links between

Appendix B and P. G. Wright (1915); in particular, Figure 3A in Appendix B is the

same as Figure 3 in the 1915 book review, aside from unimportant differences in

labeling. The first derivation of the instrumental variable estimator (the single-

equation derivation) used graphical methods that would have been familiar to any economics instructor of the day, but we have not found any comparable derivations

in Sewall's works. Although the full-information derivation used the method of

path coefficients, it is plausible that Philip followed his son's research and saw its

applicability to the identification problem. Also, as Crow pointed out to us, Sewall

typically drew the published versions of path coefficients diagrams himself, but the

path coefficient drawing in Appendix B (Figure 10) is not in Sewall's hand, rather,

it was drawn by a professional draftsman. Finally, Sewall is not thanked anywhere in

The Tariff on Animal and Vegetable Oils. Philip is not elsewhere chary with his

acknowledgments: he thanks Sewall for suggestions and computational help in

P. G. Wright (1915). At the time the book was written, Philip lived near Sewall and

their families interacted (Provine, 1986, p. 102), yet Philip did not include his son

among the dozen people he thanked in the acknowledgment section of the book.

In our view, this evidence points toward Philip as being both the author of

Appendix B and the man who first solved the identification problem, first showed

the role of "extra factors" in that solution and first derived an explicit formula for

the instrumental variable estimator. Yet, as historians of econometrics like Christ

(1985) and Morgan (1990) point out, a greater mystery remains: Why was the

breakthrough in Appendix B ignored by the econometricians of the day, only to be

reinvented two decades later?

Appendix

Analyzed Texts of Known Authorship

Sewall Wright 1. "Inbreeding and Homozygosis," Proceedings of the National Academy of Sciences of the United States of

America, 19:4, pp. 411-20, April 15, 1933. 2. "Inbreeding and Recombination," Proceedings ofthe National Academy of Sciences of the United States

of America, 19:4, pp. 420-33, April 15, 1933. 3. "Complementary Factors for Eye Color in Drosophila," in Shorter Articles and Discussion,

American Naturalist, 66:704, pp. 282-83, May/June 1932. 4. "Statistical Methods in Biology," Journal of the American Statistical Association, Supplement: Proceed?

ings of the American Statistical Association, 26:173, pp. 155-63, March 1931. 5. "Statistical Theory of Evolution," Journal of the American Statistical Association, Supplement: Pro?

ceedings of the American Statistical Association, 26:173, pp. 201-08. March 1931. 6. "The Evolution of Dominance," in Shorter Articles and Discussion, American Naturalist, 63:689, pp.

556-61, November/December 1929.


7. "The Dominance of Bar Over Infra-Bar in Drosophila," in Shorter Articles and Discussion, American Naturalist, 63:688, pp. 479-80, September/October 1929.

8. "Fisher's Theory of Dominance," in Shorter Articles and Discussion, American Naturalist, 63:686, pp. 274-79, May/June 1929.

9. "Effects of Age of Parents on Characteristics of the Guinea Pig," American Naturalist, 60:671, pp. 552-59, November/December 1926.

10. "A Frequency Curve Adapted to Variation in Percentage Occurrence," Journal of the American Statistical Association, 21:154, pp. 162-78, June 1926.

11. "Two New Color Factors of the Guinea Pig," American Naturalist, 57:648, pp. 42-51, January/ February 1923.

Philip Wright 1. "The Bearing of Recent Tariff Legislation on International Relations," American Economic Review,

23:1, pp. 16-26, March 1933. 2. "Moore's Synthetic Economic," Journal of Political Economy, 38:3, pp. 328-44, June 1930. 3. "Cost of Production and Price," in Notes and Memoranda, Quarterly Journal of Economics, 33:3, pp.

560-67, May 1919. 4. "Value Theories Applied to the Sugar Industry," Quarterly Journal of Economics, 32:1, pp. 101-21,

November 1917. 5. "Total Utility and Consumers' Surplus Under Varying Conditions of the Distribution of Income,"

Quarterly Journal of Economics, 31:2, pp. 307-18, February 1917. 6. "The Contest in Congress Between Organized Labor and Organized Business," Quarterly Journal of

Economics, 29:2, pp. 235-61, February 1915.

? We thank James Crow for his recollections and for sharing his records with us, William

Provine for checking his audiotaped interviews of Sewall Wright and Sarah Chilton of the

Brookings Institution for archival research on Philip Wright. We are grateful to Carl Christ,

Arthur Goldberger, Peter Reinhard Hansen, James Heckman and Mark Watson for helpful comments and discussions and to Vittorio Loreto for providing us with results from his zipping

algorithm. We especially thank Alan Krueger for questioning, in a 2001 e-mail exchange, the

first author's assumption that Sewall Wright wrote Appendix B; we decided solid evidence was

needed. This research was supported in part by NSF Grant SBR-9730489.

References

Angrist, Joshua D. and Alan B. Krueger. 2001. "Instrumental Variables and the Search for Iden? tification: From Supply and Demand to Natural

Experiments." Journal of Economic Perspectives. Fall, 15:4, pp. 69-85.

Benedetto, Dario, Emanuele Caglioti and Vit- torio Loreto. 2002. "Language Trees and Zip- ping." Physical Review Letters. 88:4, pp. 048702-1- 048702-4.

Burrows, John F. 1987. "Word Patterns and

Story Shapes: The Statistical Analysis of Narra? tive Style." Literary and Linguistic Computing. 2:2, pp. 61-70.

Christ, Carl F. 1985. "Early Progress in Esti-

mating Quantitative Economic Relationships in America." American Economic Review. December, 75, pp. 39-52.

Christ, Carl F. 1994. "The Cowles Commis? sion's Contributions to Econometrics at Chi?

cago, 1939-1955." Journal of Economic Literature. 32:1, pp. 30-59.

Crow, James F. 1978. "Wright, Sewall." En?

try in the International Encyclopedia of the Social Sciences-Biographical Supplement. New York: Macmillan.

Crow, James F. 1994. "Sewall Wright," in Bio?

graphical Memoirs of the National Academy of Sci? ences. 64, pp. 439-69.


Duda, Richard O. and Peter E. Hart. 1973. Pattern Classification and Scene Analysis. New York:

Wiley. Forni, Mario, Mare Hallin, Marco Lippi and

Lucrezia Reichlin. 2002. "The Generalized Dy? namic Factor Model: One-Sided Estima? tion and Forecasting." CEPR Discussion Paper 3432.

Foster, Donald. 1996. "Primary Culprit: An

Analysis of a Novel of Politics." New York Maga? zine. February 26, 29:8, pp. 50-58.

Goldberger, Arthur S. 1972. "Structural Equa? tion Methods in the Social Sciences." Economet? rica. 40:6, pp. 979-1001.

Hendry, David F. and Mary S. Morgan. 1995. The Foundations of Econometric Analysis. Cam?

bridge, U.K.: Cambridge University Press. Holmes, David I. 1994. "Authorship Attribu-

tion." Computers and the Humanities. 28:2, pp. 87- 106.

Holmes, David I. 1998. "The Evolution of

Stylometry in Humanities Scholarship." Literary and Linguistic Computing. 13:3, pp. 111-17.

Holmes, David I. and Richard S. Forsyth. 1995. "The 'Federalist' Revisited: New Directions in Authorship Attribution." Literary and Linguis? tic Computing. 10:2, pp. 111-27.

Lehfeldt, Robert A. 1915. "Review of Eco? nomic Cycles: Their Law and Cause.'" Economic

Journal. 25, pp. 409-11. Lenoir, Marcel. 1913. Etudes sur la Formation et

le Mouvement des Prix. Paris.

Magnus, Jan R. and Mary S. Morgan. 1987. "The ET Interview: Professor J. Tinbergen." Econometric Theory. 3:1, pp. 117-42.

Mannion, David and Peter Dixon. 1997. "Au?

thorship Attribution: The Case of Oliver Gold- smith." Statistician. 46:1, pp. 1-18.

Manski, Charles F. 1988. Analog Estimation Methods in Econometrics. New York: Chapman and Hall.

Moore, Henry. 1914. Economic Cycles: Their Law and Cause. New York: Macmillan.

Morgan, Mary S. 1990. The History of Economet? ric Ideas. Cambridge, U.K.: Cambridge University Press.

Mosteller, Frederick and David L. Wallace. 1963. "Inference in an Authorship Problem."

Journal of the American Statistical Association. June, 58, pp. 275-309.

Peng, Roger D. and Nicolas W. Hengartner. 2002. "Quantitative Analysis of Literary Styles." American Statistician. 56:3, pp. 175-85.

Provine, William B. 1986. Sewall Wright and

Evolutionary Biology. Chicago: University of Chi?

cago Press. Rudman, Joseph. 1994. "Nontraditional Au-

thorship Attribution Studies in Eighteenth- Century Literature: Stylistics, Statistics and the

Computer." Working paper. Rudman, Joseph. 1998. "The State of Author?

ship Attribution Studies: Some Problems and Solutions." Computers and the Humanities. 31:4, pp. 351-65.

Schultz, Henry. 1928. Statistical Laws of Demand and Supply with Special Application to Sugar. Chi?

cago: University of Chicago Press. Stock, James H. and Mark W. Watson. 1999.

"Forecasting Inflation." Journal of Monetary Eco? nomics. 44:2, pp. 293-335.

Stock, James H. and Mark W. Watson. 2002. "Macroeconomic Forecasting Using Diffusion Indexes." Journal of Business and Economic Statis? tics. 20:2, pp. 147-62.

Stock, James H. and Mark W. Watson. 2003. Introduction to Econometrics. Boston: Addison

Wesley. Tinbergen, Jan. 1930. "Bestimmung und Deu-

tung von Angebotskurven: Ein Beispiel" Zietschrift fur Nationalokonomie. 1, pp. 669-79.

Working, Elmer J. 1927. "What Do Statistical 'Demand Curves' Show?" Quarterly Journal of Eco? nomics. 41:1, pp. 212-35.

Wright, Philip G. 1915. "Moore's Economic

Cycles." Quarterly Journal of Economics. 29:4, pp. 631-641.

Wright, Philip G. 1928. The Tariff on Animal and Vegetable Oils. New York: Macmillan.

Wright, Philip G. 1929. "Statistical Laws of Demand and Supply." Journal of the American Sta? tistical Association. 24:166, pp. 207-15.

Wright, Philip G. 1930. "Moore's 'Synthetic Economics.'" Journal of Political Economy. 38:3, pp. 328-244.

Wright, Philip G. 1933. "The Bearing of Re? cent Tariff Legislation on International Rela? tions." American Economic Review. 23:1, pp. 16? 26.

Wright, Sewall. 1921. "Correlation and Causa- tion." Journal of Agricultural Research. 20, pp. 557- 85.

Wright, Sewall. 1923. "The Theory of Path Coefficients: A Reply to Niles' Criticism." Genet? ics. May, 8, pp. 239-55.

Wright, Sewall. 1925. "Corn and Hog Correla? tions." U.S. Department of Agriculture Bulletin. 1300, pp. 1-60.

Wright, Sewall. 1934. "The Method of Path Coefficients." Annals of Mathematical Statistics. 5:3, pp. 161-215.

Wright, Sewall. 1960. "The Treatment of Re-

ciprocal Interaction, with or without Lag, in Path Analysis." Biometrics. 16:2, pp. 423-45.

retrospectives who invented instrumental variable regression? · one definition of instrumental...

Documents