addonald rubin datametrics research inc and harvard university introduction sample closeout the...

28
EVALUATION OF NEW PROCEDURE FOR ESTIMATING INCOME AND TAX AGGREGATES FROM ADVANCE DATA John Czajka Mathematica Policy Research Inc Sharon Hirabayashi Mathematica Policy Research Inc Roderick J.A Little Datametrics Research Inc and University of California Los Angeles Donald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and sole proprietorship tax sponse adjustment must account for differences returns filed with he Internal Revenue Service between early and late returns vis vis the IRS in any tax year are processed and posted income and tax items for which advance estimates to centralized data base called the Individual are required Master File IMF The Statistics of Income SOl Division of the IRS draws probability Current IRS Procedures sample of returns from the IMF and uses the data Following the customary method of adjustment to produce extensive tabulations for government for unit nonresponse the AD estimation is based clients and for publication in SOI statistical on reweighting of the advance sample series The Office of Tax Analysis and the Specifically the returns in each sampling Joint Committee on Taxation require key tabula stratum see below are weighted up to projec tions by the end of each November months before tion of the final population of returns in that the statistics for the full year are available stratum at the end of the calendar year14 To accommodate this need the SOI Division weeks after the AD closeout The weight prepares an Advance Data AD report from the representing the ratio of the projected full returns sampled through late September Sample year population to the advance sample count may observations are weighted to projections of be expressed as the product of two components total returns for the year by sampling stratum the inverse of the sampling fraction the base For most income and tax variables the weight and the projected growth in the popula advance estimates have tended to be very close tion of returns between the AD closeout and the to the final estimates prepared after the end of end of the year the processing year For some variables how ever the advance estimates have differed from Projected CR population the final estimates by substantial margins Furthermore recent changes in tax regulations AD sample count have contributed to an increase in the propor tion of returns posted to the IMF after the AD sample closeout thereby increasing the AD population Projected CR population potential for error in the AD estimates In ________________________ view of these considerations there is per AD sample AD population ceived need to improve the AD estimating methodology to project more accurately the number of lateposted returns and to adjust for Variation in the projected growth by stratum income and tax differences between early and the second term in the bottom expression is late returns in each stratum This paper the mechanism by which the AD estimation addresses the latter of these two needs We methodology adjusts for differences between propose apply and evaluate new method of advance and late returns For tax year 1981 the weighting the AD sample returns to improve their projected growth ratios ranged from 1.004 to representation of complete year income and tax 1.397 aggregates The new procedures make use of Because it is so central to the AD weighting propensity score methods developed by Rosenbaum procedure the stratification of the SOT sample and Rubin 1983 merits discussion The principal stratifying dimension is classification based on income ESTIMATION WITH ADVANCE DATA and type of return The sample design imple mented with tax year 1982 provides for 29 The returns processed and posted to the IMF categories or sample codes Twentyseven of during given calendar year excluding small the categories represent crossclassification subset primarily residents of Puerto Rico and of nine levels of income and three types of nomresidents constitute the universe of the return business nonbusiness farm and non SOI Complete Report CR for the preceding tax business monfarm The nine income levels year The CR is based on stratified represent combinations of total receipts farm probability sample of the returns present on the and business returns only and the larger IMF at the end of the calendar year with absolute value of positive amounts total PAT returns being sampled on continuing basis and negative amounts total NAT computed over throughout the year The objective of the AD 19 income items Two additional sample codes estimation is to anticipate the CR estimates of set aside for high income nontaxable returns and key income and tax items on the basis of returns business returns with high net profit or loss sampled through late September Given the complete the 29 strata The stratification is projections of total returns the remaining concentrated on the upper tail of the income problem may be viewed as one of adjusting for distribution In tax year 1982 for example 82 unit nonresponse where the missing units are percent of the population of 95.6 million tax returns posted to the IMF after the AD returns fell into the two lowest income non 109

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

EVALUATION OF NEW PROCEDURE FOR ESTIMATING INCOME AND TAX AGGREGATES FROM ADVANCE DATA

John Czajka Mathematica Policy Research IncSharon Hirabayashi Mathematica Policy Research Inc

Roderick J.A Little Datametrics Research Inc and University of California Los Angeles

Donald Rubin Datametrics Research Inc and Harvard University

INTRODUCTION sample closeout The distribution of returns

by posting date is not random hence the nonreIndividual and sole proprietorship tax sponse adjustment must account for differences

returns filed with he Internal Revenue Service between early and late returns vis vis the

IRS in any tax year are processed and posted income and tax items for which advance estimates

to centralized data base called the Individual are requiredMaster File IMF The Statistics of Income

SOl Division of the IRS draws probability Current IRS Procedures

sample of returns from the IMF and uses the data Following the customary method of adjustment

to produce extensive tabulations for government for unit nonresponse the AD estimation is based

clients and for publication in SOI statistical on reweighting of the advance sampleseries The Office of Tax Analysis and the Specifically the returns in each sampling

Joint Committee on Taxation require key tabula stratum see below are weighted up to projections by the end of each November months before tion of the final population of returns in that

the statistics for the full year are available stratum at the end of the calendar year14To accommodate this need the SOI Division weeks after the AD closeout The weight

prepares an Advance Data AD report from the representing the ratio of the projected fullreturns sampled through late September Sample year population to the advance sample count may

observations are weighted to projections of be expressed as the product of two componentstotal returns for the year by sampling stratum the inverse of the sampling fraction the base

For most income and tax variables the weight and the projected growth in the populaadvance estimates have tended to be very close tion of returns between the AD closeout and the

to the final estimates prepared after the end of end of the yearthe processing year For some variables however the advance estimates have differed from Projected CR population

the final estimates by substantial marginsFurthermore recent changes in tax regulations AD sample count

have contributed to an increase in the proportion of returns posted to the IMF after the AD

sample closeout thereby increasing the AD population Projected CR population

potential for error in the AD estimates In ________________________

view of these considerations there is per AD sample AD populationceived need to improve the AD estimating

methodology to project more accurately the

number of lateposted returns and to adjust for Variation in the projected growth by stratum

income and tax differences between early and the second term in the bottom expression is

late returns in each stratum This paper the mechanism by which the AD estimation

addresses the latter of these two needs We methodology adjusts for differences between

propose apply and evaluate new method of advance and late returns For tax year 1981 the

weighting the AD sample returns to improve their projected growth ratios ranged from 1.004 to

representation of complete year income and tax 1.397aggregates The new procedures make use of Because it is so central to the AD weighting

propensity score methods developed by Rosenbaum procedure the stratification of the SOT sample

and Rubin 1983 merits discussion The principal stratifying

dimension is classification based on income

ESTIMATION WITH ADVANCE DATA and type of return The sample design implemented with tax year 1982 provides for 29

The returns processed and posted to the IMF categories or sample codes Twentyseven of

during given calendar year excluding small the categories represent crossclassification

subset primarily residents of Puerto Rico and of nine levels of income and three types of

nomresidents constitute the universe of the return business nonbusiness farm and nonSOI Complete Report CR for the preceding tax business monfarm The nine income levels

year The CR is based on stratified represent combinations of total receipts farmprobability sample of the returns present on the and business returns only and the larger

IMF at the end of the calendar year with absolute value of positive amounts total PATreturns being sampled on continuing basis and negative amounts total NAT computed over

throughout the year The objective of the AD 19 income items Two additional sample codesestimation is to anticipate the CR estimates of set aside for high income nontaxable returns and

key income and tax items on the basis of returns business returns with high net profit or losssampled through late September Given the complete the 29 strata The stratification is

projections of total returns the remaining concentrated on the upper tail of the income

problem may be viewed as one of adjusting for distribution In tax year 1982 for example 82

unit nonresponse where the missing units are percent of the population of 95.6 million

tax returns posted to the IMF after the AD returns fell into the two lowest income non

109

Page 2: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

business nonfarm sample codes leaving 18 Czajka Little and Rubin 1984 In part as

percent to be apportioned among the remaining 27 result the AD overestimated adjusted gross

sample codes The sampling rates varied from income AGI one of the key components of the

.02 percent in the largest sample code to 100 1981 sample code definitions Elsewhere we

percent in the six highest income and two observe both small and large deviations posi

supplemental codes Internal Revenue Service tive and negative with no obvious pattern The

1984 16 numbers of returns with interest and dividends

In addition to sample code there is geogra are both overestimated by about .10 percent but

phic stratification with up to three levels in the interest Table 1981 comparison amounts

given year Returns from small states are are underestimated by .47 percent and dividend

sampled at higher rate than those from large amounts overestimated by .64 percent The

states to insure minimum size for each state number of returns with business net profit is

sample However this geographic stratification underestimated by .14 percent and the amount of

is unimportant in the reweighting of the advance net profit by .55 percent The number of

sample and we shall focus strictly on the returns with net loss is overestimated by .18

sample code percent but the magnitude of the losses is

The proportion of returns posted after the AD underestimated by nearly percent Net capital

closeout varies markedly across sample codes gains are underestimated by 4.52 percent while

The frequency of late posting rises monotonical farm net income is underestimated by 1.0

ly with the income level of the code except at percent The number of returns with income tax

the lowest levels and business returns exhibit is underestimated by only .06 percent but the

higher rates of late posting than nonbusiness amount of tax is overestimated by .61 percentreturns within the same income level Table The largest deviations for both numbers of

illustrates this pattern with tabulations from returns and amounts are found on the additional

the 1981 CR sample file These tabulations use tax items The number of returns with tax for

1982 sample code definitions consistent with tax preferences is underestimated by 5.92

the analyses reported below rather than the 21 percent and the amount of tax by nearly double

strata from which the 1981 sample was actually that The error for minimum tax is somewhat

drawn The percentage of returns posted after smaller while that for alternative minimum tax

the AD closeout in processing weeks or is slightly greater

cycles 3952 ranges from .8 and .9 percent in Estimates of percentage error over period

the lowest income nonbusiness codes to over 32 of years would inform the user as to the

percent in the highest income codes all three expected precision of the advance estimates By

types of returns Business and nonbusiness themselves however such error statistics do

returns are most sharply differentiated at the not provide clear picture of the effectiveness

lowest income level their rates of late posting of the AD methodology at projecting that which

converge as income rises is unknown at the time the AD sample is closed

It is clear from Table that the sample code outnamely the income and tax on returns

is an excellent stratifier for reweighting the posted after the AD closeout To examine this

advance sample to estimate full year totals question Table describes the incidence of

prior to the completion of processing If the late postisng and Table compares the AD and

variation in late posting within sample codes CR estimates among late returns

were unrelated to the income and tax items for The incidence of late posting exhibits wide

which AD estimates are prepared then the use of variation by type of income or tax item The

this single stratifier would be sufficient In distribution of returns and amounts by posting

fact however the performance of the AD period is reported in Table for selected

estimates suggests that further stratification income and tax items in tax years 1981 and 1982

is required as we shall see these data are based on IRS tabulations of the

total population of returns so the yearendPerformance of Advance Estimates totals differ marginally from the CR estimates

Table presents comparison of the AD and in Table In 1981 1.43 percent of returns

CR estimates of selected income and tax items were posted late In subpopulations this

for tax year 1981 Deviations of AD from CR percentage varied from low of .9 percent for

estimates are expressed as absolute quantities returns with an overpayment or with unemployment

numbers of returns or millions of dollars and compensation in their AGI to high of 15.71

as percentages of the CR estimates The AD percent for returns with alternative minimum

estmate of the total number of returns was .12 tax Variation was even greater for dollar

percent below the CR estimate If the returns amounts In 1981 lateposted returns accounted

Postedlate were undifferentiated from same for 1.81 percent of AGI the percentage of other

stratum returns posted prior to the AD close dollar amounts in lateposted returns varied

out and if the percentage error in the from 1.07 percent for unemployment compensation

projecti\ns of total returns were constant in AGI to 28.65 percent for alternative minimum

across a\ll the strata then the percentage tax Late posting increased by about 50 percent

deviations for all of the items in the table over all returns between 1981 and 1982 with

wiuld be -.12 percent That the percentage some income and tax items showing larger

dØvations vary widely around this amount increases and some smaller This increase was

indicates\that at least one of these conditions due largely to the lengthening of the automatic

is not met\\ extension from two months to fourIn factIRS overprojected the total high Table reports the deviation of AD from CR

income returns and underpredicted the low income estimates of selected income and tax items as

returns in preparing the AD estimates for 1981 percentage of the corresponding late returns or

110

Page 3: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

amounts Whereas the AD projection of total proportion of cycle 3952 returns in the

returns was only .12 percent below the CR population of returns with values on the

estimate for the year Table this 111000 predictors The theory of Rosenbaum and Rubinshortfall represented 8.15 percent of the 1983 discussed in the context of surveyreturns actually posted late Likewise the AD nonresponse in David et al 1983 shows that

estimate of total AGI exceeded the CR estimate the distribution of is the same for early

by .38 percent but this represented 21 and later filers within strata with constant

percent overestimate of the AGI on returns values of px and weighting cells defined

posted after AD closeout For the 15 other with px as one of the stratifiers yielditems the error estimates range from less than unbiased estimates of the distributions of

one percent to nearly 50 percent for numbers of outcome variables This theory suggestsreturns and from percent to 51 percent for stratifying on an estimate of propensity to be

dollar amounts The magnitudes suggest consid posted late Specifically the procedureerable room for improvementmuch more so on defines the binary variable with value for

some items than others In addition while current year late returns and for current yearTable shows some items with comparable errors early returns calculates an estimate px of

on number of returns and dollar amount minimum px by logistic regression of on usingand alternative minimum tax itemized deduc data from the previous year forms or stratations payments to an IRA it includes other with grouped values of px and then uses the

examples where the errors are in opposite direc grouped variable as an additional stratifier in

tions interest received income tax before the formation of weighting classes just as

credits balance due overpayment For items complexity would have been used in the firstof the former kind improving the AD estimate of approach In view of the aforementionedlate returns will improve the projected amount importance of sample code as stratifier wefor items of the latter kind improving the propose calculating separate logistic regresprojected number of returns other things being sions for each sample code if this is

equal will actually increase the magnitude of practically feasible

the error on the amount Decisions required in implementing this

approach include the choice of predictorsProposed New Method and the choice of cut points for defining

Any proposal to alter the stratification of strata based on px The specifics of both arethe AD file must take the current sample design detailed in our description of test applicaas given Therefore changes to the weighting tion below Theoretical considerations in the

scheme must be implemented by poststratifica selection of predictors are laid Out heretion This together with the known substantial Three characteristics determine goodvariation in late posting by sample code predictors for inclusion in the logisticsuggests searching out as potential new strati regression modelsfiers variables that can improve the AD filesrepresentation of the CR file within each of the should be good predictor of the

present sample codes propensity to be posted lateOne known characteristic of late returns that

is not obviously addressed by the current stra should be good predictor of

tification is complexity Late returns tend to outcome variablesy1

include larger number of schedules than do tabulated in the AD reportearly returns Sailer et al 1982 Figure

If this remains true within strata then late the relationship between and the

complex returns end up being represented by propensity to be posted late shouldearly simple returns thereby biasing the AD be relatively stable across adjacentestimates of whatever items are related to the yearsfiling of additional schedules One approach to

improving the weighting of AD returns there Variables that fail to satisfy have minor

fore is to add to the current stratification influence on the weights and hence on the nonrescheme dimension for complexity operational sponse adjustments Variables that satisfyized as the number of schedules attached to the but fail to satisfy tend to increase the

basic tax form One appeal of this approach is mean squared error of AD estimates by inflatingits sinplicity and we entertained it in our their variance without compensating reductioninvestigation of new weighting procedures in bias Finally variables that satisfybecause it would be relatively easy for IRS to and but not introduce bias because priorimplement However empirical investigation year data are used to estimate probabilities ofbased on the 1981 CR microdata file found no late posting which in turn determine theclear evidence of positive relationship weights for current year databetween the number of schedules and the

probability that return will be posted late IMPLEMENTATION OF NEW METHODwithin individual sample codes Apparently the

current stratification already captures most of The application and evaluation of the newthe relationship between the number of schedules weighting procedure entailed the use of 1981 CRand the lateness of posting sample microdata to estimate the propensity

Our second approach was more general in models which were then applied to the earlynature Let be set of predictor variables returns on the 1982 CR sample file to generateavailable for early and late returns and define predicted propensities and construct new weightthe propensity to be posted late px as the classes within the sample codes To free the

111

Page 4: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

results of any influence from projection error Subsampling was done at the sample code

which will be reduced in the future owing to level to create analysis files for the eventualnew procedures introduced with the tax year 1983 estimation of stratumspecific logistic regresAD estimates we used full year CR estimates in sions Within each sample code we selected

place of the required projections Estimates of fraction of early returns approximately equal to

large set of income and tax aggregates pre the number of late returns except where the

pared with the new weights and alternatively latter was much below 100 The combined subsamuniform weights within each sampling stratum

pie totaled just under 20000 records with late

simulation of the current method were then returns comprising 48.4 percent The OLS

compared with estimates based on the final CR regression was run on the full subsample with

weights to assess the relative accuracy of the dummy indicators for sample code being forced

current and proposed weighting schemes net of into the equation On the basis of the regresprojection error sion results we dropped from further considera

tion 31 of the 64 variables using fairlySelection of Variables liberal criteria for retention As result

Ideally one would conduct search over all eight of the 28 items originally proposed by IRS

plausible variables perhaps giving special were eliminated from any further representationattention to those which IRS intends to tabu whatsoever i.e neither as flags nor amountslate With literally hundreds of possible

variables however it was not practical to do Estimation of Propensity Modelsso Instead we employed threestage The relationship between propensity tofile

procedure involving first an priori selec late and the predictors was found to varytion of subset of all possible predictors greatly across sample code Hence final

further screening of these variables on the logistic regression models of the propensitybasis of stepwise OLS regression application toward late posting were calculated separatelyof stepwise logistic regression estimation for 14 strata obtained by collapsing the 29

within strata to derive the final set of sample codes on the basis of similar proportionsstratumspecific models of late returns in the complete file The

Drawing on its subject matter expertise IRS collapsed strata are shown in the column

provided list of 28 items to be investigated headings of Table Definitions of the sampleas possible predictors of late posting Condi codes are given in Table

tions and for the choice of variables The logistic regression models were developeddiscussed in the preceding section were through three rounds of alternative modeladdressed at this point The 28 items were also estimation Each round consisted of the

believed to be related to late posting condi estimation of an initial model using pretion After further consideration we specified set of predictors and the subsequentexcluded one of the 28 items strongly related stepping in of additional predictors subjectto late posting because the nature of that to minimum statistical criteria for entry and

relationship was known to have changed over retention The initial model was defined withtime For most of the remaining 27 items both common set of predictors across the 14 strataan amount and flag indicating whether the item The additional predictors included all of the

was present on the return needed to be consid remaining variables plus variables introducedered In addition some of the items could into the analysis at this stage stratumassume negative values requiring at least one specific indicators of high and low income as

additional flag and amount to permit the identi well as several higher order interaction termsfication of distinct effects of net income and The objective behind including common set

net loss In all 64 variables were defined for of predictors in the equations for all 14 strata

empirical testing This required second stage was to moderate the influence of sampling errorof screening upon the equation specifications across

To reduce the number of variables sufficient strata Interstratum variation in the

ly to allow us to estimate sizeable number of specifications was restricted to variables that

logistic regressions without undue costs we exhibited net effects at conventionalestimated by ordinary least squares OLS significance levels over and above the commonforward stepwise regression of dichotomous

predictors In round one the common predictorsearly/late indicator upon the 64 variables noted comprised in eight of the first nine variablesabove Because of the overall size of the selected by the OLS procedure In round two wemicrodata file 144322 records and the expanded this set to include the sample code

extremely skewed distribution of the dependent indicators for strata combining two or morevariable we subsampled the early returns to sample codes plus 10 variables that had beenreduce the number of observations and to more stepped into the round one equations in at leastnearly equalize the numbers of early and late four strata At the same time we excluded from

returns Such subsampling on the dependent further consideration variables that had beenvariable biases the OLS parameter estimates and stepped into the round one equations in fewerthe logistic regression intercept estimates In than three strata In the third and final roundthe latter case the magnitude of the bias is

we dropped four variables from the common set

simple function of the sampling rate so theand repeated the forward stepwise procedure

intercepts can be corrected In the former case The final 14 equations are reported in Tablethere is no correction However in using the The variable designations are spelled out inOLS procedure simply to screen out the weakest Table The variables forced into the

predictors of late posting we judged the bias equations are listed in the top half of the

to be inconsequential table CAPGAIN and above Note that some of

112

Page 5: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

the forced in variables were excluded from one four variables to enter and the associated R2s

or more equations This occurred either because were as followsthe variable in each case flag was undefined

in that stratum or because it was effectively Income averaging computation flag 12.58%

constant Except where specifically noted inForeign investment credit flag 12.62

the full variable name flag predictor distin PAT15OK 12.65

guishes nonzero amount coded from no Credit for tax on gasoline flag 12.67amount coded The money amount variables

are scaled in $100000s The intercept and The indicator for overseas returns which we had

sample code indicator coefficients have been excluded from many equations because of its veryadjusted to reflect the subsampling rates used limited distribution entered at the 13th stepfor early returns as explained above raised the R2 by only .02 percent and fell

The coefficients exhibit substantial short of statistical significance Because of

variation across strata refer to Table forthe enormous sample size all of the preceding

sample code definitions In part this can be variables entered with statistically significantattributed to the differing distributions of the contributions but the largest was only 11.7variables across strata together with the fact compared to 2795.7 for the propensity score We

that the coefficients reported in the table are concluded from these results that we had not

not standardized This is especially true for overlooked any important predictors of lateincome variables where the amounts range from

posting propensity among those potentialbarely hundreds of dollars in the low income predictors identified at the outsetstrata to perhaps millions of dollars in the

highest income strata The coefficients of Calculation of Weightsthese variables decrease substantially between The calculation of weights under the proposedthe low and high income strata

new weighting scheme entails three stepsThe most consistent predictors across strata

are ITEMDUC ELOSS TAXPREFFLG NATGTPAT and calculation of propensity scores for allPARTNRLOSFLG With the exception of ELOSS the observations in the AD sample using the

presence of an amount or the value of the amount stratumspecific equations describedwas directly associated with the probability abovethat return would be posted late Overseas

returns had high probability of being posted assignment of each observation to one oflate in the four strata in which we included the propensity classes defined for thatOVERSEAS indicator However we found little observations sample code andevidence of variation in late posting by total

income within strata Only three of the final calculation of weight for eachmodels include PAT indicators

propensity classTo ascertain whether our final model

specification did indeed exhaust the ability of Each observation is assigned weightthe full set of potential variables proposed corresponding to its sample code stratum and

by IRS to predict late posting we performed the propensity classfollowing test We combined the 14 subsamples The propensity score for given observationinto single sample of close to 20000 cases

in stratum isand regressed dichotomous indicator of

early/late posing on the following variables

using forward stepwise OLS procedurep1 BXi1e

predicted propensity constructed from the

14 equations where is the vector of coefficients from

Table and X1 the vector of values on the65 proposed predictors drawn from the IRS

predictorslist To evaluate the propensity score method we

computed two alternative sets of weights basedPAT indicators and

on two different approaches to step The

two methods utilize the same individual propen15 higher order interaction terms

sity scores and propensity classes Method one

defines the weight for propensity class inWe forced in the predicted propensity score and stratum to beallowed the remainder to be stepped in From

the results we sought to learn whether anyvariable added anything beyond the explanatory

power captured in the propensity score CRTjkThe initial equation was estimated as Wjk

AD5jkCYCLE .011 .988P

wherewhere CYCLE is coded if late and if early CRTJk

is the estimated CR population that

and is the propensity score The equation would fall into propensity class of stratum

produced an R2 of 12.53 percent andADSjk

is the number of AD sample returns inNo other variable added appreciably to the

explanatory power of the equation The next the same class and stratum Method two defines

113

Page 6: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

preliminary weight as classes in each stratum The potential impact

_____________________

ADSjk/of small weight classes on the variance of the

final AD estimates suggested something like

equal sized classes Because of the muchjk

greater error in current AD estimates of money

where ijk is the predicted propensity for the amounts than in numbers of returns recall Table

ith observation in propensity class of stratumthere seemed more to gain by defining the

classes on the basis of aggregate income rather

The quantity 1/lPijk is the theoretical than numbers of sample returns In other words

weight for an observation with predictedthe boundary between the first lowest and

second propensity classes would be that propen

propensity and the preliminary class sity score which corresponded to the first

weight is simply the mean of such individualsextile of cumulative AGI for that stratum the

weights for the sample observations in thatboundary between the second and third classes

class would be that propensity score which corre

The preliminary weights are rescaled so thatsponded to the second sextile of cumulative AGI

the weighted sum of the sample observations byand so on Upon reviewing distributions of

stratum that is summed over propensitypropensity scores for early and late returns we

classes equals the projected CR population ofdetermined that the distribution of weights

that stratum could be improved by reducing the size of the

highest class and to compensate enlarging the

CRTlowest class Accordingly we fixed the bounda

______________ ries between propensity classes at threejk jk

w.kADs.k five seven nine and eleventwelfths of the

cumulative AGI distribution

The boundaries among the propensity classes

Method two assigns relatively greater weight tofor all 29 strata are shown in Table The

the higher propensity classes than does method weights computed for methods one and two and the

one the more so the higher the propensitysimulated current method are reproduced in Table

scores Method two yields monotonicallyThese weights are expressed as multipliers

increasing weights whereas the weights computedthe growth ratios of equation aboveshowing

under method one will not necessarily increase the relative increase in weight over and above

and could decrease between given propensitythe simple inverse of the sampling fraction

class and the next higher class Thus weight of 1.2 implies that to make the AD

Applying method one in practice would entail estimate of the full year population returns in

projecting the total number of late returns forthat class must represent 20 percent more

each propensity class within each sample code returns than they did when sampled

i.e disaggregating the projected sample codeIt may be noted that in few of the strata

total Unlike the projections by sample code we collapsed two or more propensity classes into

projections for individual propensity classes single class We did so whenever the range of

would have to be made without the benefit of propensity scores for given class turned out

periodic tabulations of returns processed duringto be extremely small In implementing the

the current year One way to do this would becalculation of cut points between classes we had

to compute for each sample code the proportiondivided the range of possible propensity scores

of late returns by propensity class for thethat is from zero to one excluding the end

previous year and apply this distribution to thepoints into 81 discrete intervals We used

projected current year total The assumption ofdiscrete intervals primarily so that we could

stability in this distribution that is theeasily review the distributions of scores but

distribution of late returns by percentile ofwe chose to define the propensity classes in

AGI between successive years does not seem terms of these same discrete intervals It

unreasonable In the evaluation however we happened that in some cases the lower bounds of

used the actual 1982 sample estimates of late two or more propensity classes fell into the

returns in each propensity class to construct same discrete interval When this occurred we

the weights We did so in order that thesimply combined the classes The most extreme

performance of method one not be weakened byexample is provided by sample code 40 where

errors in the projections of the distribution offour of the six propensity classes were assigned

late returns by propensity class This isto the same discrete interval .0075 to .0100

consistent with our desire to evaluate the three

methods in their purest form divorced from RESULTS

projection error as the refinement of projection methods is independent of the method of

To evaluate the new methods we computed three

calculating weights This choice may afford sets of advance estimates of selected Income and

comparative advantage to method one relative totax items by applying the alternative weight

method two and the current IRS procedure in thismultipliers in Table to the 1982 CR sample

evaluation as method one weights will returns posted prior to cycle 39 We then

incorporate more information about actual latecompared these estimates with tabulations based

returns in 1982 than either of the other sets ofon all returns on the CR file The results are

weights reported below

With regard to the definition of propensityThe selected income and tax items on which

classes we decided first of all to create sixthe evaluation is based were chosen because of

their prominence in the AD tabulations circu

114

BOX 1_SOI_RARR_1 986-1 987_00999

Page 7: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

lated by IRS They include variables present in number of returns and the monetary amount Forthe propensity equations and therefore contri itemized deductions the current method yieldsbutors to the propensity scores that factor into result that departs from the CR estimate bythe new weights as well as items with no close to 1.4 billion dollars or .48 percentobvious relation to any of the variables in the The new methods approximate the CR estimate to

models This distinction is potentially within 366 million and 221 million dollarsimportant We expect the propensity score respectively

approach to produce improved estimates of For business net profit the new methods and

outcome variables that happen to be included the current method yield very comparable estiamong the variables in the models because we mates However the current method underestiknow that their relationships to late posting mates the magnitude of the net loss amount byhave been incorporated into the propensity i.o billion dollars or 5.75 percent while the

scores We anticipate improvements in the other new methods deviate from the CR estimate by onlyvariables as well because the propensity scores 228 million and 179 million dollars Curiouslyare intended to represent the tendency toward this combination of better loss estimates andlate posting generally However for variables comparable profit estimates produces less

not tested as possible predictors of late accurate estimates of net profit less loss under

posting we have no prior empirical evidence to the new methods relative to the current This

suggest that their relationships with late happens because the current method estimates of

posting are indeed captured in the propensity net profit and net loss diverge from the CR

scores Accordingly our expectations for estimates in opposite directions i.e oneimproved performance among these variables are understates positive amount while the otherless strong Where the new procedures do not understates negative amount and the deviaproduce significantly improved estimates we tions largely nullify each other when the twowould recommend that such variables be tested as estimates are summed This kind of result is

possible predictors in future application of not repeated for any other of the net profitthe new method less loss amounts reported in Table however

For partnership income we again find noVariables Represented in the Propensity Models appreciable differences among the advance esti

Absolute and percentage deviations from CR mates of net income whereas the new methodsestimates are reported in Table for the three yield markedly smaller deviations than the

alternative advance estimates of selected income current method by onehalf to twothirds onand tax items that wereincluded in some form in

net loss Here unlike business income thethe propensity models The table reports as relative magnitudes of the deviations on incomewell the CR estimates from which the deviations and loss are such that the two new methods lead

are measuredto substantially better advance estimates of CR

For most income and tax amounts the advance net income less loss than does the currentestimates based on propensity score methods one method The current method differs from the CRand two lie substantially closer to the CR

by 2.6 billion dollars on net income less lossestimates than do those based on the simulated amount of 899 million dollars The new methodscurrent method The new methods also yield deviate from the CR estimate by 569 million andcloser approximations to the CR estimates of 220 million dollars respectivelynumbers of returns on items where the current

On the whole the new methods yield nomethod produces deviations of one percent or improvement over the current method on amountsmore There is generally little difference of capital assets sales Similarly the newamong the alternative advance estimates of methods show little improvement relative to thenumbers of returns where the current method and current method on employee business expensesanCR estimates are themselves very close item on which the current method yields results

The most dramatic improvements to the advance very close to the CR estimates of both number ofestimates of money amounts occur on dividends returns and money amountitemized deductions total tax preferences and

alternative minimum tax On total dividends Variables Not Represented in the Propensitywhere the current method yields an excess of 737 Modelsmillion dollars or 1.36 percent over the CR Absolute and percentage deviations from CRestimate the method one estimate falls within estimates are reported in Tablu 10 for the three76 million dollars 0.14 percent and the method alternative advance estimates of selected incometwo estimate within 28 million dollars 0.05 and tax items that were not represented directlypercent For total tax preferences deviation in the propensity models The variables notof 205 million dollars or 13.50 percent obtained included in the propensity models but selectedwith current method is reduced to 16 and 52 for this evaluation constitute major income andmillion dollars respectively under methods one tax items frequently included in summary reportsand two Here even the estimates of the number of advance estimates or highlighted in publicaof returns are much closer to the CR estimate tions based on the CR tabulations The resultswith methods one and two than with the current reported in Table 10 show that the gains inmethod Under the current method the number of

accuracy seen in Table for variables includedreturns with additional tax for tax preferences in the propensity models do indeed generalize tois underestimated by 16.7 thousand or 7.43 variables that were not used to construct propercent the new methods reduce this to 5.2 and pensity scores although they do not generalize2.9 thousand The alternative minimum tax to all the variables we examined On advancecomponent of total tax preferences presents estimates of money amounts the propensity score

comparable picture with respect to both the methods often show sizable improvements over the

115

Page 8: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

current method rarely do the new methods result for twoearner couplethe propensity scorein accurate estimates For numbers of methods show little or inconsistent improvementreturns on the other hand we find generally relative to the current method On exemptionslittle or at best modest improvement and contributions deductions the new methods

Where the propensity score methods yield fare somewhat worse than the current methodsignificantly better estimates of money amountsthe margin of improvement over the current Within Stratum Comparisonsmethod expressed as the proportionate reduction The weighted AD file is used routinely to

in the deviation from the CR estimate generally estimate not only aggregates over all taxpayersrange between 50 and 75 percent Improvements but also distributions by detailed AGI classof this magnitude may be seen on AGI net losses It is of interest therefore to consider to

on farm estate or trust Small Business what extent the comparative advantage of the

Corporation and other income taxable income propensity score methods extends to the subagand alternative measures of income tax On AGI gregate level Because they use more weightthe estimate based on the current method exceeds classes estimates based on the propensity scorethe CR estimate by 9.7 billion dollars or .53 methods might exhibit less bias but greaterpercent The propensity score methods reduce variance than comparable estimates based on the

this deviation to 4.6 billion and 2.9 billion current method Another issue is whether the

dollars On farm net loss the current method relative superiority of the propensity scoreunderstates in absolute magnitude the CR esti methods increases with the frequency of late

mate by 334 million dollars or 1.87 percent posting We have seen evidence of this in theThe propensity score methods reduce this devia comparative performance of the alternativetion to 81 million and 42 million dollars The advance estimates of items with substantial

improvement for net profit less loss is compar versus little late posting but there the

able with the percentage deviation being results may reflect the ability of our propensireduced from 3.51 percent to .80 and .59

ty models to capture the tendency toward late

percent posting in those particular variablesOn Small Business Corporation income the new Comparisons by sample code provide more general

methods produce estimates with somewhat greater evidence in that sample codes with relativelydeviations than the current method on net high rates of late posting overall will exhibit

profit but they compensate with substantially late posting in all variables Finally the

smaller deviations on net loss lie net effect application of the propensity score methodis such that whereas the current method yields entailed the estimation of separate equations byan estimate of the small and volatile net profit groups of sample càdes with differentialless loss that is over one billion dollars wide results It remains to be seen whether there is

of the mark the new methods produce estimates any evidence to support changes in the specifthat are within 441 and 279 million dollars On cations of the models or the grouping of sampleother income the new methods generate slightly codesimproved estimates of net profit and much more Table 11 presents the CR estimate and

substantially improved estimates of net loss percentage deviations from that estimate byOn net income less loss the current method sample code for three of the items from Tableunderstates the magnitude of the 10.3 billion and three from Table 10 The items selected aredollar aggregate loss by 2.0 billion dollars or dollar amounts and they include items on which19.2 percent The new methods reduce the the propensity score methods performed substanmargins on other income profit and loss to 921 tially better than the current method as well asand 568 million respectively or 8.9 and 5.5 items on which the propensity score methodspercent produced no aggregate improvement By including

Perhaps not surprisingly the improvements the latter items we seek to determine whetherfor taxable income are very comparable to those the lack of improvement on those items characregistered for AGI Here there are parallel terizes all sample codes or whether it reflects

improvements for both number of returns and systematically better performance in some samplemoney amount Comparable improvements are codes and worse performance In othersrealized for income tax before and after As reported above the aggregate AD estimatecredits total income tax and total tax liabil of total dividends received is 1.36 percentities On tax after credits the current method above the CR whereas the estimate based on

produces an estimate that is high by 2.8 billion propensity score method one PSM1 falls withindollars or 1.02 percent Propensity score .14 percent and method two PSM2 within .05method one reduces the error to less than 1.2 percent Reviewing the results by sample codebillion and method two lowers it still further we find first of all that the simulated ADto 1.0 billion or .36 percent of the CR esti estimate tends to overstate the CR estimate bymate Even the estimated number of returns an increasing amount as the sample code incomeshows proportionate Improvements on this order level and with It the rate of late postingalthough the current method estimate is itself rises that is from codes ending in to codes

very close to the CR estimate differing by less ending in There is no such pattern amongthan onetenth of one percent the propensity score estimates although there

Small improvements are recorded for Interest Is suggestion of Increasing variance thepaid deduction and for payments to an IRA and estimates being somewhat farther from the CRsmaller still for salaries and wages medical estimates in both the positive and negativeand dental expenses and taxes paid deduction direction as sample code rises Within theOn other Itemsunemployment compensation nonbusiness farm returns 5Os codes the

pensions and annuities alimony and deduction current AD method performs as well as method two

116

Page 9: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

overall and except for the two highest codes amounts differ little from each other the

also as well as method one Within the 40s and sample code estimates tend to be similar as

60s codes the propensity score methods perform wellat least as well relative to the AD method in Estimates among nonbusiness farm returns

the high income sample codes as they do in the tend not to show the level of improvement as we

low income codes find among nonfarm returns This may be

The AD estimates of AGI exhibit to an even attributable to the propensity equations

more pronounced degree the pattern observed for Because the samples within the farm codes are

gross dividends The propensity score methods quite small we combined nonbusiness farm

also show evidence of increasing distance from returns with nonbusiness nonfarm returns by

the CR estimates with rising sample code in the income level based on comparable rates of late

nonbusiness nonfarm and the business codes posting in estimating the equations and we did

but the growth is less rapid than for the AD not attempt to estimate farmspecific interacestimates As result the propensity score tions other than through the interceptsestimates increase their advantage over the Perhaps another strategy for aggregating farm

current AD estimates as the propensity for late returns would have yielded better predictions of

posting rises Within the nonbusiness farm late posting among farm returns This would be

returns method two falls increasingly below the an issue for future research

CR estimate as sample code increases while

method one exhibits no clear pattern DiscussionOn itemized deductions the strong aggregate In assessing the implications of the evalua

performance of the propensity score methods tion it is very significant to note that the

relative to the AD method is seen to be the gains in accuracy realized with the propensityresult of superior performance within small score methods are by no means limited to variasubset of sample codes together with small bles that were used to construct the propensity

overestimates in very large strata counterbal scores One appeal of the propensity score

ancing the underestimates in most other approach to weighting is that an adequate model

strata The propensity score methods provide of the underlying propensity phenomenon should

closer approximations to the CR estimates in permit improved estimates of whatever observed

code ranges 4245 and 6063 Within the farm variables exhibit such tendencies In the

strata the AD method is if anything somewhat present study the empirical development of

closer to the CR estimates than are the propen propensity models was restricted by an priori

sity score methods selection of potential covariates Yet the

On total tax preferences the propensity score method yielded improved estimates of keymethods produce much better estimates than the variables outside this basic set Refinements

AD method through sample codes 4346 and 6267 of the method in this particular context would

which account for most of the aggregate amount broaden the set of variables searched to developof total tax preferences Within the farm codes the models adding variables for which furtherthere is again basically no advantage to any of improvements are desiredthe methods Another aspect of the results that surprised

On salaries and wages and estate or trust net us was the marginal superiority of method two

profitsitems on which the aggregate estimates over method one Method one used the actual

of the AD and two propensity score applications complete report estimates of the numbers of

are fairly comparablethe sample codespecific returns in each propensity class to construct

estimates are themselves very similar especial the weights by propensity class Method twoly on estate/trust The top two or three codes like the simulated current method used only the

for each type of return generate the largest sample code population estimates relying on the

errors but below that there is little differ propensity scores themselves to generate the

entiation by sample code in the average magni differential weights Yet the results presentedtudes of the errors in Tables and 10 show the following

This brief survey of results by sample code performance of the two methodssuggests several conclusions For those items ________________________________________________

where the aggregate estimates based on propensi Method of Comparison of Returns Amount

ty score methods are substantially closer to the

CR estimates than are the simulated AD esti Variables Included in Propensity Models

mates we do not find this level of performance

repeated in every sample code On the other Method better than

hand neither do we find that propensity score Method better than 11

methods achieve their lower overall bias with

higher variance at the sample code level In Variables Not Included in Models

approximately threequarters of the sample codes

the estimates based on the propensity score Method better than 18 22

methods deviate less from the CR estimates than Method better than 11 11

do the estimates based on the current AD

procedures Moreover the relative superiority On the whole the estimates based on method twoof the new methods tends to be most pronounced are superior to those based on method onein those sample codes where the current method Since method two is more directly applicable in

produces the largest deviations from the CR practice than is method one these results

estimatesnamely those sample codes with high suggest that the gains observed in this

proportions of returns posted late For items evaluation should indeed be realizable inwhere the three advance estimates of aggregate practice

117

Page 10: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

SUMMARY maintaining or even improving upon the accuracy

of key variables for which the current magniThis paper has proposed and tested new tudes of error are already comparatively low

method of preparing annual estimates of income

and tax statistics from an advance or truncated FOOTNOTES

sample The problem examined here is one of

adjusting for unit nonresponse where the Improvements to the projection methodology

missing units are tax returns yet to be added to were the subject of another research effort

the sample and the missing returns are known to under this same project The results of

differ from the advance returns on the variables that effort are detailed in Czajka 1984afor which advance estimates are desired No 1984b 1985information is available from the missing

returns for the current year but all returns Personal communication from Ray Shadid IRSare available for previous years Currently IRS Statistics of Income Division

projects the total number of late returns by

sampling stratum and computes stratumspecific The implied individual weight grows very

weights to inflate the sampled returns to the rapidly as approaches unity To avoid

projected totals The proposed new method excessively high weights for the top

involves differentially weighting the returns propensity class we computed the class

within the sample strata on the basis of their weight as the inverse of one less the mean

estimated propensity to be late Individual propensity

propensity scores are computed on the basis of

logistic regression models of late posting The intervals were as follows zero to

estimated on the fullyear sample from the .0300 in increments of .0025 12 intervalsprevious year Within each sample stratum the .030 to .040 in increments of .005

advance returns are then divided into six intervals .04 to .40 in increments of .01

propensity classes and each class is assigned 37 intervals and .4 to 1.0 in increments

weight translating the propensity into of .02 30 intervals We divided the

projected increase in the number of returns of intervals this way because of the extreme

that type bunching of propensity scores at low values

Weights for the new method were estimated in strata with low mean propensities

using tax year 1981 data and then applied to

data from tax year 1982 Estimates were Note that the net loss amount was included

prepared from the advance sample using two among the predictors in the propensitydifferent methods of reweighting the propensity equations whereas the net profit amount was

classes Deviations of these estimates from the not Net profit was tested but rejected as

complete year estimates were compared with predictor at an early stage of the model

deviations based on assigning single weight development possible explanation is that

per stratum the current method The new the relationship between net business incomemethods produced estimates that were generally and late posting changed between 1981 and

much closer to the final estimates for variables 1982 In further research using 1982 data

included in the propensity equations Estimates we found net profit to be significnatof dollar amounts showed much greater improve predictor of late posting in number of

ment proportionally than estimates of the number strataof returns on which those income or tax items

were present Currently there is greater error

on amounts than on numbers of returns Because ACKNOWLEDGMENTSthe propensity equations are intended to

represent the general tendency to be posted This research was performed under contract

late improvements are expected for variables with the Internal Revenue Service The authors

not included in the equations as well as those wish to thank Fritz Scheuren Director of the

that are included Such improvements are Statistics of Income Division for his valuablelimited by the extent to which the equations do input to the research the assistance providedindeed represent general propensity We by members of his staff and his helpful

registered improvements on several variables comments on an earlier draft The assistance of

that were not included in the equations Lucia Wesley who typed the manuscript is also

Variables on which no improvements were recorded gratefully acknowledgedwould be prime candidates for inclusion in

future model development as they reflect

aspects of late posting not captured in the REFERENCESmodels presented here

The results demonstrate the value of Czajka John Evaluation of the Projectionsreweighting on the basis of propensity scores of 1983 Individual Tax Returns MemoranThe methods tested here could be applied on dum Washington Mathematica Policyregular basis to the preparation of advance Research Inc.1985estimates and provide sizeable error reductions

for income and tax items that are poorly esti Projections of Total Tax Returns frommated from advance data using the current an Advance Date Evaluation of Alternamethodology With further research and devel tives Project Report Washington

opment the method could be tailored to provide Mathematica Policy Research Inc 1984aimprovements where they are most needed while

118

Page 11: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

_________Revised Projections of Total Tax ______ Statistics of Income1981 Individual

Returns by Sample Code Tax Year 1983 Income Tax Returns U.S Government

Memorandum Washington Mathematica Policy Printing Office 1983

Research Inc 1984b

_______Advance Data 1981 Individual Income

Czajka John Donald Rubin and Roderick

J.A Little Modeling the PropensityTax Returns Mimeograph WashingtonInternal Revenue Service 1982

Toward Late Posting of Tax Returns An

Application to Tax Year 1981 Project

Report Washington Mathematica Policy

Research Inc 1984 Rosenbaum Paul and Donald Rubin TheCentral Role of the Propensity Score in

David Martin Roderick J.A Little Michael Observational Studies for Causal Effects

Samuhel and Robert Triest Imputation Biometrika vol 70 1983 pp 4155

Models Based on the Propensity to

Respond 1983 Proceedings of the Business

and Economics Section Washington Sailer Peter Charles Hicks Dave Watson and

American Statistical Association Dan Trevors Results of Coverage and

Processing Changes to the 1980 Individual

Internal Revenue Service Statistics of Income Statistics of Income Program 1982

1982 Individual Income Tax Returns Proceedings of the Section on Survey Research

Washington U.S Government Printing Methods Washington American Statistical

Office 1984 Association

119

Page 12: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

4-

4-

fl

fl L-

flfl

fl

4-

LA

fl fl fl%

fl

t. as

Lfl fl

0-fl

4-

50 50 50 50

LU 00

_______I-

LU

WL

on

____________

-L 50 LA 50 50

0%- fl

%5

U- ____fl %fl

fl LA

05 05 50

LA

.._0

___-J

04-

LA fl 50

fl %flfl fl LI

0Cit

4-

LULU

LA fl fiLA

___4-

LA

5050 LA50 LA 00

50

50 4-

505003 OL

LU51 05 CL

4- CC-J

.5

oi4-

LAO 05

.- 5050

U- 30

4-4-4-

%-LLoo4-

LU

on 40LA

05 00 00

04 .... 04

4- fl

00LU 08 08 00 00 00 4-0 4-

d-o8 80 8o 00

-0I.

LU

fl fl LA %fl

00 SI fl sO LA

SO 44

4-

fl 4-

Cj

020.5

4-

UL4-On

05

fl

fl .- 0O SO 50 SO 4-

00 00 o8 .5

A- 0- A-V8000 00 00.000LQ 00 00000 00 8800 0000 00 00 LU

fl fl

.2 sO 00 SO .0 fl

-.1

Page 13: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

OcC_ sIN-c La5nNa 54n05 LaN-a.9 N.LaLaOn.5La a.LaLas-sLaaoo C94c50 40090 LaLa0 aLaLaNa LaSn0ncs-sSnN-.0 nLaU

a. so 5--s e- Na sa La assaL 1151111 ii I_a

Sn

4-

ao.2 .4 N- Sn Sn La La La Na Sn so La La La so

-sii40 Sn IS N- N- Sn Sn La La La N- NaN- N- Sn N- Sn 519 119 Na 00

.- La Sn Sn 50 La Na Sn 09 Na .-s N- 1a

-s so so La _s La 10 .s s-I La 40 5-a Sn Na 010 Sn La

Q- s0 NanNaLaN- U9a5-.aLa ID nsa r- 01 Sn La Sn La Sn La La .jt ID SJ Sn

10 La La La La Sn La s-I La La Sn La Ill 55 La La Na La 10

s1LaC LaLaS aa oSn LaaS

N- La La La Sn Sn La La La La Sn LaN- Na Na Na NJ

.0

La N-La NJ La s--I Na La Sn La La N- La 50 La La

co sn.-.a nSn_s sn- LaN.L0n e-a.Lasnaau9NaNa rntna.osnLa 104- La Sn N- La La La N- Sn N- N- -IN-N-a 10 14 .s NJLu Sn -Ia La 10 La Na La La SnN- sn La La La Na La 55 La s4 515-s--s..--

OP La La Sn Sn Sn La La Sn .1 NINa to 50 La fl s-s La La LaN- La Sn La Sn La La La La Sn LaN- -e -.5 NJ Na Na 5-a

La to

Lu

s--La

451_a.La aoaLsJw 0015I-- 0-

4-1 45 Na 50 La to-N- La La La La N- La ID La La Na La 10 r-

IC La Sn La aLa Na Na Na 519 Sn

ddd ddc dddd dd11111 11111 III liii

N- La La Sn Na Na to NJ La La La Na .-a 10 41

..- In s-sNa Sn Sn Sn to- La to- Na La Na UI La La Na s-s

440 Sn N-nLa NaLaN- sOs-5 N-LaN-0 Snnn4OLanLaNa Las-SLaLaSnO9d Sna.N 41

.44 .a La s-S La Sn s-s La ann Sn naIliii .g .0

SJIS5In

U-

s-S N- La La s-a ID Sn La 519 as-s .o La La s-s .0La .0 Na0 La Na La La Na N- La On La Na Na a-i

s-i La La ID N- La .a Sn N- Na La ISa Na Sn 50441-

4- La La La La sI 519 Sn N- Na La N- La La tO La N-n N-

.0 OLaLa SnSnN aLaLa oLanN-sn N-Lan.-.N-.--ILa-.c- NaSnNaaNaNJ La La so ann La La N- N- Na La La ID Na .- N-

La DI 10 10 Sn N- Na sS Na Sn La Pa Sn La 01 ID LaSn0 4-On Sn La N- N- NJ 15

Sn La N- N- La 10 N- N- SnaSn La N- La La Nan 10 La 10 N- Na N-so La La 10 La La Na Sn La Na ss La La Na La La Na N-

La aa.La NaLa aLa LaLan.Sn NaLaLanLaLaLO09 OsonpaLaNnI a.La LaNJN a0Na La0Iaa asn5-Inunsan.-l NNOLa.aa s.snLa OLaSn OnNaIDO nN-LaSn.-sa.aN-a SnNaLa0InNJ N- La La to La La N- ID N- Na 551 La 10 Na s-I .-5 I0N-s-4

s-s

La La Sn 50 Na Na Sn La Na Sn La La La NaLan_s Sn La N- N- N- Na N-

________ Na

Sn

Ins.445 -s- .0

9- -D 00_s_a 1-

.0 IOc isSnSn Cs- 4- La

In La Inc05 ao cc Sn 15 05

50 Sn 55-0

LII Sn 4.1 5444La -s-a na 45Es-I.ulac 4015 a-.0 Sn LII 14 s- III 15 is

4-I _J Sn Sn s-i 0._a 44 Sn 4-

La I- a- a- is

-.- 515 0.a In 44

ala 51 OSnCs- 5.5.9- s-ais csna 10 aa.sa. ao oa rsswa

Sn _i 4.1 51 La -J Sn Sn .J CO I544 4-

Sn ao -_ aa a-s-EUso-c xa s-a 4.5

aI 44 44 45 a-I 044 51 a-saL ca-oc- a--- -i-ac-oooao u-s-s Is-

.0 us- Sn Is LI Sn II- 5- a- a- Is 4400 15 II 4-5 .J0Sn0 -s-515-s-00a0L.05150 ace OaUISaa-o a.4-s-415-01.-s-15O10UI.O5-0Sna-Snass1aU II--c Cs

a.4oSna_a oa..a000...a CO coa-sa oos_ae Sn

---a aa o.ca.aa-s-a-s-a51 In 454.544 Sn 444.141 .44-5444 a-Is- 0.0 44 4-4 4450 aaa-.-QswaaaaaeaaacsasnEEEcoo.s-o-s-a

44I 51 .XXX a- XC 154-5004-3 10 tO fleE UCa.jEc00 0100 LaLa 00 La L1. 0. La 0. Ui II- La Is

121

Page 14: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

Cu 440 1.-I LI en NJ N- .I NJ NJ NJ en en in 4008 en en en44 Cu 0508 N- 10 en NJ N- 08 N- Cfl NJ NJ 08 00

_C._ enN 00 -4.- C5310.0 NJ .-I

Cu 05 N- fl Cu 100 fl 080 08 00 fl en 4005Cu ..I ..I Cu NJ CO NJ N- 04

-c lOU 08 NJ NJ In en CO en CJ fl 00 -I US 008 00 -008 -en en .0 -08 -08

0511 00 NJ en en 400 0511 NJ N- 400 -40Cu 08 -4 en en N- NJ N- 08 C4

Cu 4.4 en Cu NJ N- 10 10 Cu N- .4 _1_4 08 08NJ Cu .U08 NJ .4 -4

0844-In-C

.2

4-I

Cu nN-en N-N- .4 fl In 0440 N- en 05 NJ 08 LI 05Cu 4- en N- NJ 10 10 N- NJ

Cu .0 l0 05 .C CSCu ON- N- 500% 11 P4 In N- N-In 10 NJ 005 en NJ en N-LI N-C-- NI OS en Cu LI Cu NJ P4

Cu U.4 NIN- NJ COO UNJ N- ION- 044

Ifl enN- .I LI 08 08 NJ N- LI NJ P4 N-05 08 en

.-4 .-

Cu I.a ID en 40 001 N- NJ ID N- NJ 40 001 N- 08 00II N- Cs en Os en

80Cu InC 4101 NJ N- ID 05 10 0% N- 10 N- IOU N- en IC

Cu en .0 en N- .I .05Cu 4-I en 0.Cu 05 NJ 08 en N- -0 00 Cu N- fl 08 05 en 08 40Cu 10 01 N- -4

0-I- LOU 8080 LO UNJ Cu0 N-OS N-en N- 010 0805 ..-I en 08

08

I-001

Cu en 10 010 10.-I en NJ 00 010 ON- OSLO 400 NJ.-U USC OIC .Cu.Cu NJN- IC Cuen 058 0-I 10 40

UI Cu .4 en en LI NJ en-_

LIJ 0.414

N-

4-Cu Cu0 NJ050 P.1-I 005 1N- Cu OU IOU

2-080 14 In NJ NJ0%

Cu .0114

Cu00-V

CuN- .4 NJ .-I .I N-.-4 N- NJ LI en 08 80

CO .0 s- en 08 NJ N- NJ 104- Cu NJ 08 10 NJ Cu N- 0110 NJ 10 NJ N- en en N- NJ en 0%

114 NJ N- 08 NJ N- 05 01 N-

4-IUU 010 N-U 110 NJ NJ N- en 08 In NJ NJ 10 10-4 N- en

Cu II 451 en N- 05 08 01Cu 40 N-C 400 040 400 0140 1010 040 -I NJ In 4001 10

-4 Cu NJ CuCuN- PJCu 80 05 N- NJ CO NJ en .-NJ en0_ -4 -4

Cu 154 08 NJ 0110 en 011 N- 10 N- 400 en 1009o5 QU e.4 csi

00 0g NJ NU .4 CJ154 ID N- 08 NJ en .-4

100 80 en en en 08 05 NJ NJ N- N- .4IO NJ NJ .-I0 00N- N- Cu en IIN- en Cu 01.-I 10 CI 08 NJ N- NJ en NJ NJ 0% Cu

05 Cu .I en 08

Cu -ICu 40

Cu -.4 VI

UI Cu In

Cu VI Cu

Cu -.4

Cu ..4 4.1

Cu Cu 10

Cu 444 IA 4-I VI UI IA IA 4.I Cu VI Cu Cu VINC ...IC Cu Cu CC44 USC 01 04. 4- lI-I. .144- 04- 0.4- 4- 01

CII 040 00 00 CO CO 03 -.0 -344 4.4 Cu 44 41 .41 Cu 44 44 044 44 4.144 4.1

DVI Cu Cu OW 3.01 1001 CuCu Cu 4-10 CCu 00 Cu

44 VI Cu 00 00 .00 Cu 00UI 4.1 4.1

4- II- Cu II- 014- 4- II- CII- 14. II. 044- CuII. 014-

CuO 00 0- Cu0 440 00 0000 4-4.1 4-I 4. 44 4-44 VI 4. 41 1-4.4 44 4-44 Cu 1- 44 4.4 VI 41 44 4-I

UI UIWC 44100 CUIC VIGIC 0010 0010 000 OCuC 4.IUIC CuWC 00101-41 -.Q 01.0 .0 CI .0 .0 fl 0.0 .0 CO .0 .OWIn 4-CO 4-CO VICe CEO 4-CEO CO -CO 0_CO CuEO CO .000.00 CuOC woe VIOE sOC 30 EOE VIOE E0C FOE EE 100

I- 4.4 lC VI 4.4 1. CI Cu IC

4- Cu Cu Cu 44 Cu

UI 0814- 4--

122

Page 15: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

05 en Qt CM N.041 CM In Si at SJ en

1.76-441 NJ. en to 000.0._i 54 en

.0

ate 04 0-4to

..i at ctJ en CM to NJ en to to OOten -0 -to CM -en-.4 -40 to en Den en ..4

en at Si CMto NJ CM 54 05 CSJ

44

.0

s_C 4-50.4 at

.0X.0 C. -4 .40 CS Ca 00

CM en4- to 05 CD to .760iJ to en Si -to NJUS at en en 40 NJ 44

fl en at .0 4.

400 to en CM .76

NJ -40 .0CM U-s at -4

-.4

.- 4.

00

04 CM 0N.to-UatU.enN44 CM en en to .a

en 00 0CM to Qt 4-44-4 40 .U0.4 at 0CM 04 00 en 45

44 en at0-I CS CM 454 fl .4 CM at en_Ni to

atNJ to CMCM

44

4U at Itt .0 44en to 0. CM too 10 .-

U4J4- 4-CM CM en en NJ 0CM 4400_I -4CM 7.4CM

4-

NJ

.4. 5.U at .- 0% too S.en NJ en 05 45

._ 00 toO 400 050 40--en

to to to en 14 at en ON en10 04 .0 44ON en to to 7-7 45

44.4.4 4-

at -444

.0SC0044-

.0 CMat NJ en 40 10 ON5.0 OS en

at CM to to 00 enUi ...4 NJ -0

i-I CM 010 0. toen CM

CM to en fl SJ-o -en -u

CM ID CM.7-

..4 CM CD.7 .044

.4-

44en 040 at en at CMCM 0% CM enoat en en en at en en Oat

45 -10 05 en CS0.0 4.4 0CM CM to to to444 -4 CM

to en 01 040-en -UI0CM to en at --

NJ to

.4.4

4.4

44

0.4

04

444.4

44

In ut 44 44

.4004 USS 4-45 CL 4. 5.

5.0 4.40at04.7 4.4 04.4 C4 4-4

.7- 4.0 44 0441 .4

.74.

4- ii4. 114 014 041- 4414.

4.40 150 40 Co.4 4. 7-

5.4.4 55.4 4.45.4 4.4 45.4.4 1.4.4 04-4en 00 540 4400 4440 460 64440 460 CV

0.0 .0 4-4.0 an on .0 .0EEO E0 4410 EEO 1.00 cEO 0.60 .40044 4500 0041 .0E 64444 500 ..4E.0

4-In 44

Page 16: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

TABLE

DEVIATION OF ADVANCE DATA FROM COMPLETE REPORT ESTIMATES

EXPRESSED AS PERCENTAGE OF LATE POSTED RETURNS

AND AMOUNTS TAX YEAR 1981

Percentage Error Percentage Error

Item on Late Posted on Late Posted

Number of Returns Money Amounts

Total Number of Returns 8.15%

Adjusted Gross Income Less Deficit 21.07%

Salaries and Wages 1.03 13.51

Interest Received 6.81 13.90

Gross Dividends Received 5.88 12.02

Business Net Profit Less Loss 0.88 19.83

Net Capital Gain Less Loss 19.66 31.95

Farm Net Income Less Loss 20.19 35.71

Pensions and Annuities in AGI 15.54 30.33

Payments to an IRA 29.66 22.09

Itemized Deductions 9.35 11.32

Taxable Income 3.83 18.24

Income Tax Before Credits 3.50 26.99

Minimum Tax 38.64 40.35

Alternative Minimum Tax 49.17 -50.55

Balance Due 32.55 11.53

Overpayment 19.13 7.92

SOURCE Tables and Entries are absolute deviations from Table divided

by quantities posted late from Table then multiplied by 100 percent

124

Page 17: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

4C

cc en cc c.i in en c. in

N- en en .c in N- in

.0 C1 in .0 -4 in en

le 9C IC IC

en in ri cc in .0 C4 in .-4

en en cc e.i cc en in in

.0 C.I CJ ill

IC IC IC IC IC

in in in N- .0 cc .0 N- .0

-4 -4 -4 in cc en cc

.0 en e.i c.j c.i e-i in en

IC IC IC

en in r- cc .o in N-

.o en ci en a- N- .0 N- it cc

.0 -4 en .- en C4 en c-.i C4

IC IC IC IC

e-i cc cc a- N- IC N- in

NI en in cc N- N- .0 NI -4 cc .0 .0 N-

.0 .0 .0 en NI en en .o N- in N-

NI

IC IC IC IC

.0 .0 ..O IC cc in IC

en a- cc cc in cc en en en .o

en en en r.i -I ci en c- cc

en cc in

Cl IC IC IC

cc NI in -4 NI N- ...4 en en en

N- cc N- cc en cc c.i cc cc .o in N-

in in -4 -4 -4 N- NI en en NI -4

__________IC IC IC IC IC IC IC

a- en in in c-i en in .0 cc

4-4 .0 .0 cc en .0 NI cc N- in -4 N- NI

in .o in en en in cc

_____IC IC IC IC IC IC IC IC IC

in N- .-4 N- .0 .0 en -4 -.t it

in in .0 in N- N- N- NI -4

in NI .-4 en en ..- c-i en en

-.1 ra

E-4 IC Ic IC IC

in N- in en en in NI .0

NI -4 N- -4 cc en .o en in

in -4 in en en en N- NI en en

I.4

-l

I-I

rx4IC IC

cc .0 -4 en cc in en NI in

ri en N- cc .-i en in cc .o N- -4

in in cc en in en en en -4 NI NI NI cc N-

IC IC IC IC

cc cc in en in

-4 en in cc .0 -4 in NI in N- en en

in in cc c-i -4 NI -4 N- -4 -4 in cc

Ia-4 en

Ia Ic IC IC IC IC IC IC IC IC

N- en in in in .0 cc cc cc N- .0 NI NI

cc .o en en c- in .0 NI en in NI NI

en .o en .o en CCC c-i c--i

4-4

Cl _____4-4

a- en -.0 NI N- N- N- en in

cc a- -.0 ci in c--.i c-i in in N- cc in cc

c--i cc en .0 NI c4 cI N- en

___Ia Ia

1.4 4- Cl

OW WQ Ia Ia Ia 1-4

1-4 Ia CE CO 1a4

Ia P. I-I000 -4 1-4 E- 1-4

-4 E- P. Ia 000Ia -H Ia .-

-4 1-44-4p.4 c-i en P. 1-4 Ia 4-44-10 Ia 00

125

Page 18: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

-4

rcc cc.0 cc .0

-IC

cc r-

.0 cc

.01.J

-4

-IC -IC

Lr r4

lr --.0 .0 c-sj cc N- Ct

4-4

-K .I

.0 Ct

N-

.0 c1 Cl

______ .0

-IC -K-IC

cc

N- CN I- 4J

csJ cc Ct

-IC Ct

-IC

cc N-

Cfl

cI Ct

Ct-K -K IC IC -IC -IC

N- .s-

cc cc cc cc N- 4-4

Læ Ll .i r1

CJ

-IC -IC -IC -IC

.0 N- Si-I

0.0 Lfl Læ -l 4-i

..Lr4 cc Ct

-4

-IC -IC -IC IC

cc .0 Ct r4

cc Cl

.Lrl ccCt

CIC.I Cl

-IC -IC -IC

.0 c-j I-

.0ri in4J

--

-IC

cc -IC LUCI C-I ri

I- IJLr in .0 cc

________-4 _C

1-J--l

Cl 4-4-t -t

4-I

iI CtW-I Ct

_______ 1-IC-C

-IC -IC IC cawN- N-

00 N- N- N- v-4 .r0cc

--I

s-I

s-

cc-I Cl Ct

N- cc ccin -5

-4 cc______ C- Ct00 uct00Cl IS -IC i4OW Cl Cl 1-i

L1 00 00-K 1-4 CJ rSCCl -i i- 00 i-1 1-IC

--Sct Cl IS CI 1-40 Cl 1-4 in OCt Ct$iE-t Cl 001-4 1-i N- N- C- N- E- -IC 4-14

1- Cl 1-4 F-I 1-41-4 U-tCt ICt 10 i-i OWQ-r s-I 1-40 VC.Z -ICr5

126

Page 19: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

TABLE

MNEMONIC DESIGNATIONS FOR PREDICTIONS IN FINAL MODELS

MNEMONIC Full Variable Name

ITEMDUCFLG Total itemized deductions flag

PARTNRLOSFLG Partnership net loss flag

ALTMINTAXFLG Alternative miriimun tax flag

TAXPREFFLG Total tax preference flag

DIVGT400 Flag for dividend income greater than $400

EINCOMEFLG Schedule net income flag

INTGT400 Flag for interest income greater than $400

ITEMDLJC Total itemized deductions amount

DIVINCOME Dividend income amount

ELOSS Schedule net loss amount

BUSEXPFLG Employee business expenses flag

NATGTPAT Flag for negative amounts total exceeding positive

CLOSS Schedule net loss amount

CAPGAIN Net capital gain amount reported on schedule

OVERSEAS Return from overseas

PARTNRLOS Partnership net loss

ALTMINTAX Alternative minimum tax amount

ELOSSFLG Schedule net loss flag

INVSTCRDFLG Investment credit flag

ENERGYFLG Residential energy credit flagTAXPREF Total tax preferences amount

CLOSSFLG Schedule net loss flagJOBCREDFLG Jobs credit flagGASTAXFLG Credit for tax on gasoline flag

PARTNRGAINFLG Partnership net gain flag

THEFTLOS Casualty and theft loss amount

PAT75K Flag indicating PAT $75000PAT75OK Flag indicating PAT $750000PAT3500K Flag indicating PAT $3500000PAT7500K Flag indicating PAT $7500000PAT7500KCODE58 Product of indicator PAT7500K and sample code 58 indicator

PAT3500KCODE58 Product of indicator PAT3500K and sample code 58 indicator

PARTNRTAXPRF Product of partnership net loss flag and tax preferences flagDIVALTMIN Product of flag for dividend income greater than $400 and

alternative minimum tax flag

127

Page 20: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

TABLE

LOWER BOUNDS OF PROPENSITY CLASSES TWO THROUGH SIX BY SMIPLE CODE

Propensity Class

Sample

Code

28 .1600 .2300 .2900 .4000 .5600

38 .1800 .2300 .2900 .4000 .6600

40 .0075 .0100

41 .0050 .0075 .0125

42 .0125 .0150 .0175 .0225

43 .0225 .0275 .0350 .0600

44 .0275 .0400 .0500 .0700 .1500

45 .0500 .0700 .0900 .1200 .2200

46 .0600 .0700 .1000 .1600 .3200

47 .1000 .1200 .1700 .2800 .4600

48 .1100 .1400 .2300 .4000 .6000

50 .0050 .0075 .0275

51 .0075 .0100 .0125 .0400

52 .0125 .0150 .0175 .0225 .0400

53 .0275 .0400 .0500 .0800 .1200

54 .0400 .0500 .0800 .1100 .1600

55 .0600 .0900 .1300 .1800 .2400

56 .0700 .0900 .1600 .2100 .3000

57 .1000 .1300 .2300 .3300 .5200

58 .0350 .0600 .1100 .2400 .6000

60 .0225 .0275 .0500

61 .0150 .0175 .0225 .0275 .0500

62 .0275 .0350 .0400 .0600 .0900

63 .0400 .0500 .0700 .1000 .1900

64 .0800 .0900 .1200 .1600 .2500

65 .1100 .1400 .1600 .2200 .3300

66 .1300 .1600 .2000 .2700 .4600

67 .1700 .1900 .2400 .3300 .4200

68 .1600 .1900 .2600 .3500 .5600

SOURCE Mathematica Policy Research Inc

NOTE The lower bound of the first propensity class is zero

This propensity class is combined with the preceding classes

128

Page 21: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

N- 00 .0 Cfl N- 03 N- .0003 .0 N-

003 -4 Cl .000 03 in .0 Lr .o -z cfl N-

.0 C4 Cfl 00 In Q\ Ui 000 C1 ui N- .0

fl .-4 .- C4 Cfl .-4 -4 -4 -4

.c.Ic

CflO\ QC.J03Q-N- 03c4.4N-03 .-4 ..-4 C.1 N- C4 03 -l .0 .-41 Cfl 03 Ui

Ui .0 c- ci .0 .-4 .4 Cfl .0 -1 0J

/D _4 .4

cl_i

.4-i

4C1C4C4C 4C4CN- C-I In Cfl -4 Cfl Cfl In N- N- In -4

Cfl Cfl -I C4 .C C-I .0 40 0300 Ui

-t CflL1 00000CICfl 00004CNCJC-I4-4 -4_ _I

031_i

4C.3C1C4C 4C4-cI 0-.t-400-.1.0 004.00030.0CO Ui .-4.-4 C4 .0 N- C4 .0 40 cfl N- .0 -1

.-ci 00000CI 00004--4C10_I -4 -4 -4 -4 -4 -4 -4 -4 ._ -4 -4 _4 _4 .-4 -4

0. cj IC

i-I N- in 00 c.4 N- C.4 CC -0 CCC N- 0303 Lf 03.-IIn cI In .0 -i- in .i0 .0 03 c-I -1

C4 0C.I 000000---4 00000-00\00

4..JI4 -I .4 .-4 .-4 .-4 -4 -I -4 .-4 -I -4

P-i 03 In 03 00003 .0 N- In .0 C4 -.1 C-I cfl N-

O\.0 00 .0IC-.103 00InN-C00N-O\-l 000000-4Q 000000400

031_i.-4 .-4 .-4 .-4 .4 .-4 .-4 .-I -4 -4 -4 l0

ciz.I-I -4 -t 03 N- In CC .-4 C-I 0300 C-4 In N- N-

N- C-I CC 03 c-I -1- cn N- -1- -.4 C-I .0

InN- 0004CIIn0C-J--4 000-4--InIn411.4 -l C4 -4 .-4 -4 .4 .4 c-I C-I .-4 .-4 .-4 .-I ..-I c4

i-4ICIC

i-i -1 N- .-4 030 In CC Ci CO .0 N- In In 003 0C-I Cfl In

In .0 C-I .0 N- N- In 00 N-

111x1 In C-ICC 000 C-IIn 0000CflC-JCflN-Cl_i T4 .-4 -4 -4 -4 .4 -4 -4 -4 C-I .-4 .-4 .-1 -4 .-4 .-I .-.4

03

HZ ICIC-C 1C4Ci-0 N- CC0-.1-4030.1--1.0 0C4.-4InCC00.0--IP.1 C-I N- C-I .0 In In .-4 C-I Cfl C-I

.0 -4 C- ce_i C- C-I C-I

Hrxi-I -.4 .-4 .-4 -4 .-4 -4 .-4 .-4 .-4 .-4 -4 -4

4CICICIC ICIC030 -.1 in N- .0 C.4 In In 003 ..-1-

4J -I N- 03 .403 In .0 CC In In N- In

ci c-I 000004C-IIn 00000C-JC-Im4-4 -4-4-4 -4-4-4

IC IC

-4.- coc-zrc-.4--lc-.I0In.0 0N-CCO.--100HO C-IO\ 0-In-N-C.0CN 0InC.000C-I Cfl C-I 0000000 In 0000 -I C-I C-I

-44111 N- N- In 00 -4 C- 00 cfl .0 .000 -4

-In -In0CC -IOC-IN--4.ON--.InC-I 000000000 0000-40.-I -4-4

.-4 .-4 .-4 .-4 .-4 .-I .-4 .-4 -4 .-4 .-4 .-4 .-i .-I

P-I

I-i

Cfl0 0C1C0In.0.0In InIn00400004C-100 000 .-4iC-ICO0N-N-N-C .-4.-4Cfl0In-I.04In

-4-4-4-4-4

-1000cC 04C-ICfl-.IIn0N-CO 0.-4CICfl-.In0N-00C-ICfl 1.ll InInInInInInInInin

I_i

129

Page 22: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

0-t0cflCOcflLr00

.O -i -J -t CO

_____

N- .-i c-.i cO ri00orrr--Lf Læ CO cj

4CON-NCflrCO0O\ Ci

-.j N- cfl Lfl

N-

-4 .-4 -4 -4 -4 .-4

_____ $-i

-4 Cfl Lf

N- C\

-4

-4 _4 -4 -4 -4 -4 -4 -4 -4

CO N- N- Lr Lr CO Lr

Cr Lfl CO Lfl Lfl Cr Cr

00-44J .-1 -4 -4 .-4 -l -4 -4 -4

-f-I

Cl

r4

Cr .0 N-

Cr -1 N- Cr If Cr000044-JCrCrCl CU

p_I .I

Ci____ CO

41

-4

-4 Cr Cr .0 Cr

N- CO Cr

.0 Cr -U CO CO Lfl Cr

4J

-4 .I -I Cr

__r4

4J 4C biD

CO Lr N- CO

If Lf N- Cr Cr

Lr 00 -4 Cr Lf CO If ID

4J -4 Ci CU

biD II

-i-I 4C

CO N- If .-

.0 CO N- N- N- N- CO 4-i

Cr -z1 N- Ci

_I

-4 -4 -4 CU .IJ

ID

Cl

iD

CO .0 N-

4.j Lfl Cr N- CO Cr r4

Cr 000-4NCrLrLrCi

.4 .-J .-4 rI

r-I

p_I

Cr -zt Cr Lf CU

v-i -4 -4 -4 -I -4 v-I -4 -4 .J

___ CU

N- CO CO CO CU

Cr N- CO Lr

-I 0000 -4 Ci

-4 -4 -4 _4 -4 -4 -4 -4 -4

CU

r1

.i 4-I vI Lr CO .0 .0 N-

If N- 4f4 -1 Cr N- CU CU000v4NCr-.0N-r4.r14.J IDQJH ii

1-4 1-1

p_I

1-4

HGCr .0 N- CO

.0 .0 .0 -0 .0 .0 .0 .0

E-4

El ci

130

Page 23: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

en en 0% .40 0% 0% 0% 0%enCe 0%

cSd c.d dd c.- Of-0.0.-

%_

IS

40 40 0% 10 0%.- c.o fO 400 enlO 4.N 00N4 .- en

1500%

04

P10%15 f-14

.0 01 CJ en 0% 01.2 en enf-I

0% -4a_

0%%001 U.0 _____0%

a- 0% 0% .4 10 en 0% 0%.- 4.- en 10 f-a 0%tn -n 10 IV 44 40 0% IV

14- 0% 40 ..4 .1 0%15.0 0% -4 .-4 en 0%4-I

ta

4-UCISL4.U

________0%

lu IS 4.44 10 r- 01 .-4 10.0 4- a- en 0% 10 .4 en IV 10IS IS 01 Cl 0% en en 10 10 01 IV ICen en 0e -0 en 10

-a 4.- 15 en 10 NJ en 0% ..4 N-CE 0.2 04 en N.01W

000%

0.0-J CI Cl en 40 10 15 en 0% 0%

a- a- 0%01 1. 15 en 0% 10 .-I 15 0% NJ

en NJ .10 --4 -IV-4 0% 040 00 -440 en PJ en IV en

0% III..4 0% NJ .44 40 en 40

LI .2 0% %0en en 40

150004.4.J 0%0ow lanE on 1.4 L4J IS 4.4.2 4.4 141 _.0 30.04.039-COEO 0040 VC 400 1500 44% E0...IEOC.EO4J OE WOn .0 .00 0.03 30 30III 04 04 I. 4- .IWOE ISOE lOnE WOE CI SI

4.2 UI

0%

131

Page 24: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

4- 4-

.0

44 at to at to Utto fl at 00 Ui

.-to dsi .- dd.5at.- 4-S

C______.2

4-

.0

CC at to ID 00 to to tofl It .4 I-.n-a

.4 4-S

-c toat.-

Ca

..-

0.400 at SC 00 0%

.4 at at to to

.-4 sc.iID

_0

4-S .0U0U14

I.- aI.S

.O4-5 a.J 44

.5.5 to fl .5- 5.

.4 .- ID .-l-.-

S. 14 5. at 10 10 as

5. 5.

ti .c at to .0 5.

ats sn CCS to.-4 aaa asac 44 aj

4_S 5. 4105

_____0.4.Eaa soaa3

...4 to ID 1054-S 45

to .5Us to .4 to at at to I.. Ut 4.5

5- ID to to to 00 -5

.5 -4 ID 10 at4.5 to 5.

45 -C5. aea.05.00..ua05.15 44

to -.-a

.-

aaa5.1 US

4J 4-S CC.4 to 10 to at to .5

4-S to to at ID 45 .4

to to CID 4--.0 SD 00 505 to to ID US 5.

10 5. 4.5

.-LU a-.- 5cUS________ a--ao-o .5

.4 .5 aoUs oa

.5 a-a105 -S at .2

4.54.5 4.5 to _S 5.

5. at at at at to .- Cl 44

to .5 50 ID 05 4.5 ..-

a.-.- .a at to 05 to 45

45 to .5 5.55515 to to to 10 to at to

5.4

toa.- 4.5

4.5 115aa 4.55.30 a.U

54 4.5 0.5In In 51 US

Ill

4. OIC In aaO51 -.-

5. 5. III. .551 US 4554 US 504.5

54-SU05ID

as

5. 5. Ia ..s .5 In 44 14 5- 55 45

515 Us aaa0 5. 5- 5- 5. 4.5 ._ 0.44_J OS ..-

4.5 5- Is 5- 5- ..- 4.5a0I. .o .0 Ia -0 so- .-S .l 5- 4.5 5- 4.5 to USa5.aEO...SEO 00 .cOassaCoas a....aa .4-aso aS aO aO S..00.05S..05 a5..4 SU.544

at 4.54.5 4.5 _s as 4.5

.5 0.0 5- 4.5.4 .4US 4.5 44

4-Ca di.-aa.0 .5 .505- 5- aC5- 5.5 554 Si .0 5-

132

Page 25: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

TABLE 10

ABSOLUTE AND PERCENTAGE DEVIATIONS FROM COMPLETE REPORT FOR THREE ALTERNATIVE ADVANCE ESTIMATES

OF SELECTED INCOME AND TAX ITEMS NOT USED AS PREDICTORS OF LATE POSTING TAX YEAR 1982

Money amounts are in millions of dollars

Absolute Deviation from Complete Report Percentage Deviation

Complete Simulated Propensity Propensity Simulated Propensity Propensity

Income or Tax Item Report Advance Data Score Score Advance Data Score Score

Estimate Estimate Method Method Estimate Method Method

Adjusted Gross Income AGLess Deficit 1852.072.4 9748.7 4594.0 2903.0 0.53 0.25 0.16

Salaries and Wages

Number of returns 83105532 170836 117162 148571 0.21 0.14 0.18

Amount 1564948.3 4295.7 3559.5 3561.4 0.27 0.23 0.23

Unemployment Compensation

Total

Number of returns 10105079 57860 51625 56460 0.57 0.51 0.56

Amount 19818.4 108.9 98.3 110.6 0.55 0.50 0.56

Included in AGI

Number of returns 5347.634 32.900 30784 30699 0.62 0.58 0.57

Amount 7089.1 12.7 12.0 11.5 0.18 0.17 0.16

Farm Income Schedule FaNet Profit

Number of returns 932986 4191 3287 1200 0.45 J.35 0.13

Amount 7994.2 10.9 -2.3 -16.1 0.14 -0.03 -0.20

Net Loss

Number of returns 1755632 -5113 -2580 472 -0.29 O.15 0.03

Amount -17822.8 334.1 81.3 -42.1 -1.87 -0.46 0.24

Net Profit Less Loss

Number of returns 2688618 -922 708 1672 -0.03 0.03 0.06

Amount -9828.6 344.9 79.1 -58.1 -3.51 -0.80 0.59

Pensions and Annuities in AGI

Number of returns 8824875 38287 54554 29021 0.43 0.62 0.33

Amount 60122.9 182.4 207.8 85.2 0.30 0.35 0.14

Alimony Received

Number of returns 316617 1251 1724 1385 0.40 0.54 0.44

Amount 1946.1 -122.2 -118.1 -120.2 -6.28 -6.07 -6.18

Estate or Trust Incomeb

Net Income

Number of returns 797391 -25870 -23286 -23160 -3.24 -2.92 -2.90

Amount 6088.8 -319.0 -340.9 -351.8 -5.24 -5.60 -5.78

Net Loss

Number of returns 61810 -5397 -4764 -4548 -8.73 7.71 7.36Amount -342.7 37.5 22.7 18.0 -10.95 -6.61 -5.26

Net Income Less Loss

Nunber of returns 859201 31267 -28050 -27708 -3.64 -3.26 -3.22

Amount 5746.2 -281.5 -318.3 -333.8 -4.90 -5.54 -5.81

Small Business Corporationb

Net Profit

Number of returns 416549 -13482 -12115 -12191 -3.24 -2.91 -2.93

Amount 5580.3 -186.4 -324.8 -296.3 -3.34 -5.82 5.31

Net Loss

Number of returns 421219 -31291 -25037 -23375 -7.43 5.94 5.55

Amount -6426.4 1227.2 765.7 574.8 -19.10 -11.91 -8.94

Net Profit Less Loss

Number of returns 837768 -44773 -37152 -35567 5.34 -4.43 4.25

Amount 846.1 1040.8 440.9 278.5 -123.01 -52.11 32.92

Other Income

Net Income

Number of returns 3703370 -28463 24862 -25754 -0.77 -0.67 -0.70

Amount 7641.0 -424.5 -380.2 -373.0 -5.56 -4.98 -4.88

Net Loss

Number of returns 553778 -38502 -27939 -24953 -6.95 -5.05 -4.51

Amount -17942.3 2403.8 1303.9 941.2 -13.40 -7.27 -5.25

Net Income Less Loss

Number of returns 4257148 -66965 -52801 -50707 -1.57 -1.24 -1.19

Amount 10301.3 1979.3 920.7 568.2 -19.21 -8.94 -5.52

133

Page 26: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

Table 10 continued

Absolute Deviation from Corn lete Report Percentage Deviation

Complete Simulated Propensity Propensity Simulated Propensity Propensity

Income or Tax Item Report Advance Data Score Score Advance Data Score Score

Estimate Estimate Method Method Estimate Method Method

Payments to an IRA

Number of returns 12101016 101.150 84888 77179 0.84 0.71 0.64

Amount 28273.8 238.9 182.7 165.7 0.84 0.65 0.59

Deduction for Two-Earner

Coupl eC

Number of returns 21.689907 136655 126726 124770 0.63 0.58 0.58

Amount 9047.7 61.3 54.9 55.2 0.68 0.61 0.61

Taxable Income

Number of returns 89716570 67444 39341 17128 0.08 0.04 0.02

Amount 1473318.0 6846.5 2976.0 2313.8 0.46 0.20 0.16

Income Tax

lax Before Credits

Number of returns 79348582 50762 27285 1417 0.06 0.03 0.00

Amoun 283926.4 2549.3 946.7 789.1 0.90 0.33 0.28

Tax Credits

Number of returns 18882528 -18745 -20400 -26094 -0 10 -0 11 -0 14

Amount 7854.2 267.5 -208.3 215.6 -3.41 -2.65 -2 75

Tax After Credits

Number of returns 76959571 72677 46981 20573 0.09 0.06 0.03

Amount 276072.2 2816.9 1155.0 1004.7 1.02 0.42 0.36

Total Income Tax

Number of returns 77033971 63653 42539 17645 0.08 0.06 0.02

Amount 277588.5 2612.2 1138.7 1056.2 0.94 0.41 0.38

Total Tax Liabilities

Number of returns 78680509 3076 -5507 -28404 0.00 -0.01 -0.04

Amount 284699.0 2353.7 897.7 811.7 0.83 0.32 0.29

Selected Itemized Deductions

Medical and Dental Expenses

Number of returns 21980291 48403 46155 58787 0.22 0.21 0.27

Amount 21704.6 -275.5 -217.5 205.6 -1.27 -1.00 -0.95

Taxes Paid

Number of returns 33079548 28700 26653 44074 0.09 0.08 0.13

Amount 88032 498.1 452.3 452.5 57 0.51 0.51

Interest Paid

Number of returns 30242954 29373 35097 50923 0.10 0.12 0.17

Amount 121822.6 1795.3 -954.2 -845.7 -1.47 -0.78 0.69

Contributions

Number of returns 30509886 66578 63254 78195 0.22 0.21 0.26

Amount 33467.9 282.7 338.2 325.6 0.84 1.01 0.97

Exemptions

Total Number 232188753 72683 133721 114854 0.03 0.06 0.05

Age or Blindness 14211172 12845 19949 -28421 -0.09 0.14 -0.20

Other 217977581 85528 113772 143275 0.04 0.05 0.07

SOURCE Prepared by Mathematica Policy Research from 1982 SQl Complete Report microdata file

aFlags and amounts were tested but rejected as predictors of propensity to be posted latebThis income source is included on Supplemental Income Schedule which encompasses rent royalties partnerships small business corporations

estate and trust income and selected other sources Schedule income flag and loss amount appear in the propensity equations for all strataThe net income amount was tested but rejected

Clhis variable was not available prior to 1982

134

Page 27: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

NI

0r are-4

cis ON 0lt5O0fl 0-fl7N 0ONINI0NI.74 NI

-4

Ow00ow ON oN.N...oS-4aNIr- N.4ofl0mclN NIariNIaO

cn NI .7 NI Os .7 NI C- CS -.1

NIl ii III IC005.00.0

5.

o-.7 ooa-. os..o ON .0cflNIOdo ordr dcc

NI NI NI .7

50 II 1111111J

5-

sO 05 i- as so .7 NI 5405.0-4 0.7 05 50 CS 05 C- C50 fl .4

o.o. r-.flflcn NINIsNI -4so0so.tNI45 NI07

02

NI tSNI7.7.7ll5 05S0N.S01NI 7-4N.N.NI7S0-4 02 000000 45 00 NI N.sO 00000 .-

II lUll 1-414 III02

-IC- 04NI00s05 flOfl0N.OcnNIr. 05NINIflasO0000-7 ooooddd ddC-I--.dSNI NI

00WI

.7 NIC N.0.7CIOSSONI -3C-7OC-C .7s00a0rN.0.-- dooo-.d. dd0ddd ..d-d5NI 11111 III NI 11111111II

-4 -4

50 50 CSa 0N.OCfl-4N.0fl .N.7404 ea4S0N.Cnr..r-..7

oi NI Csl Os NI NI C0 07 N.54.4 4J 4J 1.4 50 -7 C- NI 0.7 It .7 ..7 505 -0.70

50 ..7 7C0C0a0 00-7050NI4. a0NI0.40.0NI.4 NI 50- NI 50-0 5C

4.4 NI NI7 NI

02

0027 NI

1-

NI.7 .7 CS .-4 CC N.50 45 NI NI SC INI 50.7 40

05.1.4 02 -.0 .7 -7 NI 0000 00oO ii III III III II

507.4

-7.C .4 .7.7 0.7 NI sO NI NI NI 0.0 N-SO

024.0-Crno..7NICs.7.7Cn d-dddd-50 04

00CU020 sO

04.4 flO0U00-.0O flN.C4..t-0.0.0 a5N.Oa0s0OsO

S-.0NI-7UNIO dCNC04 .4 I- II NI

02

50

02 CS a0 0USUN.N.N N.0N.aN.0500U -0.Ca0N.moNI02 oi-..1

50 ONI 45.7 05 NI NI NI NI 077 SOCS NIO NI

..0 0N.NIN..flNICNI 0CflU7O0-.N OSNI-.ttSN.a.N.l0.0 N.054SN.-.N.O0U 4NICNISNI-450021.4 50a-7O NINIs5

02

-0 NI SON 050 NI fl NI 505005050 .7 45.7 70

NI NI sO 05 NI NI NI -7.00 -04 NI NIm-CC-ors

50

4.1

50

I.

50

50 50

50 50

--4

50..4 -4

5050 50 50

.050 .-s NI -7 0N NI S7 UsON .40 NI C-7 0N

NICS C777777777 7505050sO0s.OsOsOC

I-

135

Page 28: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and

NIN- N- 11 I1 .0 N- .0 .0 N- 0.0 N- OS OS

.4..- dcd d-N N..OsO.0N-0N-OSV

Ni NI N- .- 11 NI NI NIIII.1-IOW

ISNI 0ON-VNIOVO ONS0NI.-.NIcSlS OS00S0N-NIUbO -..-WV

N- NS NI NI NSIII III

NI SNS NIV0CINI N-VCSQ0fl-I 0SVCNI NI 1S 1S NI NI N- .0 .0 fl NI NI

NI N- CS NI NI fl

I-lWIa NI NI CS N- N- IS .0 CS ON- CS N-VUV08 NI VI.VN-V.C 0NICI1000.CVN-

VNI0N-OS.SCS CSCCNISONINIC 00000CS0SVC1C NI CS NI NI Ia

NI IS

NI NI NI NI N- OS N- CS NI N- OS CC OS NI IS CS---4 fl .0 0000000 N- .0 00

I-I liii -4

-4lb

-C

NI NI 0NIlIS0C1O0 -CNI-s00000 fl0OSflNIC-CNIVflWV 00000N.-CS0 000.0cSNININI .1CCI-SObO

II III II

oW01VII N-

NI ON- NI-IONIC1C.001I0 mN.mV0m 00C-VN-VCS .0

V-b 00000-IC-COCfl 000.IaNI00.IaOS Nl-NI-NINI00 -4

.4 I1 II IIIi-I _I

_________-4 lb

lb

CS N- C1 OS NI 05 05 CI.O OS Cfl NI OS C1 I-I

-vN-fl V.0CIC1CN-N-0CC 00OC-ICCflN- N-0.0OSC-COS

UI-lU .1I VNIN-N-00V-IN- I1S0N-NICSIa00 IVOS0CV-CCSN-1.1

NI OS N- 05 05 CS NI CIS NI Ia OS

OP -- .4

0..4 NI os N- NI -1 NI NI N- SI-V .Ia NI OS

US mN-.I

______OINI

CS OS .Ia Ia UI N- .3 .3 .0 NI .Ia .0 .Ia .Ia NI IS .0-4UI -COS 0000-INICVV 00005.IaNION.N- 0000-101

ViOl 11114 IICSINI.4 III

C-i

UC .4NI 05 .0 05 NI NI -V .0 CS NI NI N-0 NI Cl

.lbOIZC/ NI Ia NI NI NI CS CS .0 NI

Ia NI

1401

0NI0OS0NI N-N-VN-ONI.0

Cl 0- 000 NI N- N- 00 NI NI 000 NI N- NI .0NI Ia NI CS NI NI

cD 1.1IC N-VICCCCNI0C 0.CN.00.0 0iI1CNICN..-0l.0UUUVUV NI SN- ONINI-VN-00CII 0N-Vc0N-CV NINIOSIlVN-V

-i N- OS.0 00 as OC -I US -C NI

3.0.-I IflNIVVN-/N10Ifl .-COSVC.0C1NI N.I0NiIfl0-CVN- lb

VI-Cl NI VC5005VCCIa N.0N-NI 0SN.NICNIC

UC N- 5$ -i

VN-NI -1 -4.0

I-I

_________ -4

OS Ni ON- CS CS 15$It N- NI.0 .0

-i 0I CVCNCVbIN- .oV.oCCm3.14 NIC NIN-0OSNCOSNINI VVCVmNI 0N-.-NI.--IC.--O .4 .4

8.-I------ III

lb Cl 05 N- Ia NI NI 0OC NI

.F

______ .0

1.1 .3

.-4

5-

_i .4

--.3.30-I NI -V Ii N- 0.- Ni CS ON-0-.NICVfl0N-

NIC 0VVVVVVVV-V CC VOOOOOOOOOI-I

136