addonald rubin datametrics research inc and harvard university introduction sample closeout the...
TRANSCRIPT
![Page 1: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/1.jpg)
EVALUATION OF NEW PROCEDURE FOR ESTIMATING INCOME AND TAX AGGREGATES FROM ADVANCE DATA
John Czajka Mathematica Policy Research IncSharon Hirabayashi Mathematica Policy Research Inc
Roderick J.A Little Datametrics Research Inc and University of California Los Angeles
Donald Rubin Datametrics Research Inc and Harvard University
INTRODUCTION sample closeout The distribution of returns
by posting date is not random hence the nonreIndividual and sole proprietorship tax sponse adjustment must account for differences
returns filed with he Internal Revenue Service between early and late returns vis vis the
IRS in any tax year are processed and posted income and tax items for which advance estimates
to centralized data base called the Individual are requiredMaster File IMF The Statistics of Income
SOl Division of the IRS draws probability Current IRS Procedures
sample of returns from the IMF and uses the data Following the customary method of adjustment
to produce extensive tabulations for government for unit nonresponse the AD estimation is based
clients and for publication in SOI statistical on reweighting of the advance sampleseries The Office of Tax Analysis and the Specifically the returns in each sampling
Joint Committee on Taxation require key tabula stratum see below are weighted up to projections by the end of each November months before tion of the final population of returns in that
the statistics for the full year are available stratum at the end of the calendar year14To accommodate this need the SOI Division weeks after the AD closeout The weight
prepares an Advance Data AD report from the representing the ratio of the projected fullreturns sampled through late September Sample year population to the advance sample count may
observations are weighted to projections of be expressed as the product of two componentstotal returns for the year by sampling stratum the inverse of the sampling fraction the base
For most income and tax variables the weight and the projected growth in the populaadvance estimates have tended to be very close tion of returns between the AD closeout and the
to the final estimates prepared after the end of end of the yearthe processing year For some variables however the advance estimates have differed from Projected CR population
the final estimates by substantial marginsFurthermore recent changes in tax regulations AD sample count
have contributed to an increase in the proportion of returns posted to the IMF after the AD
sample closeout thereby increasing the AD population Projected CR population
potential for error in the AD estimates In ________________________
view of these considerations there is per AD sample AD populationceived need to improve the AD estimating
methodology to project more accurately the
number of lateposted returns and to adjust for Variation in the projected growth by stratum
income and tax differences between early and the second term in the bottom expression is
late returns in each stratum This paper the mechanism by which the AD estimation
addresses the latter of these two needs We methodology adjusts for differences between
propose apply and evaluate new method of advance and late returns For tax year 1981 the
weighting the AD sample returns to improve their projected growth ratios ranged from 1.004 to
representation of complete year income and tax 1.397aggregates The new procedures make use of Because it is so central to the AD weighting
propensity score methods developed by Rosenbaum procedure the stratification of the SOT sample
and Rubin 1983 merits discussion The principal stratifying
dimension is classification based on income
ESTIMATION WITH ADVANCE DATA and type of return The sample design implemented with tax year 1982 provides for 29
The returns processed and posted to the IMF categories or sample codes Twentyseven of
during given calendar year excluding small the categories represent crossclassification
subset primarily residents of Puerto Rico and of nine levels of income and three types of
nomresidents constitute the universe of the return business nonbusiness farm and nonSOI Complete Report CR for the preceding tax business monfarm The nine income levels
year The CR is based on stratified represent combinations of total receipts farmprobability sample of the returns present on the and business returns only and the larger
IMF at the end of the calendar year with absolute value of positive amounts total PATreturns being sampled on continuing basis and negative amounts total NAT computed over
throughout the year The objective of the AD 19 income items Two additional sample codesestimation is to anticipate the CR estimates of set aside for high income nontaxable returns and
key income and tax items on the basis of returns business returns with high net profit or losssampled through late September Given the complete the 29 strata The stratification is
projections of total returns the remaining concentrated on the upper tail of the income
problem may be viewed as one of adjusting for distribution In tax year 1982 for example 82
unit nonresponse where the missing units are percent of the population of 95.6 million
tax returns posted to the IMF after the AD returns fell into the two lowest income non
109
![Page 2: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/2.jpg)
business nonfarm sample codes leaving 18 Czajka Little and Rubin 1984 In part as
percent to be apportioned among the remaining 27 result the AD overestimated adjusted gross
sample codes The sampling rates varied from income AGI one of the key components of the
.02 percent in the largest sample code to 100 1981 sample code definitions Elsewhere we
percent in the six highest income and two observe both small and large deviations posi
supplemental codes Internal Revenue Service tive and negative with no obvious pattern The
1984 16 numbers of returns with interest and dividends
In addition to sample code there is geogra are both overestimated by about .10 percent but
phic stratification with up to three levels in the interest Table 1981 comparison amounts
given year Returns from small states are are underestimated by .47 percent and dividend
sampled at higher rate than those from large amounts overestimated by .64 percent The
states to insure minimum size for each state number of returns with business net profit is
sample However this geographic stratification underestimated by .14 percent and the amount of
is unimportant in the reweighting of the advance net profit by .55 percent The number of
sample and we shall focus strictly on the returns with net loss is overestimated by .18
sample code percent but the magnitude of the losses is
The proportion of returns posted after the AD underestimated by nearly percent Net capital
closeout varies markedly across sample codes gains are underestimated by 4.52 percent while
The frequency of late posting rises monotonical farm net income is underestimated by 1.0
ly with the income level of the code except at percent The number of returns with income tax
the lowest levels and business returns exhibit is underestimated by only .06 percent but the
higher rates of late posting than nonbusiness amount of tax is overestimated by .61 percentreturns within the same income level Table The largest deviations for both numbers of
illustrates this pattern with tabulations from returns and amounts are found on the additional
the 1981 CR sample file These tabulations use tax items The number of returns with tax for
1982 sample code definitions consistent with tax preferences is underestimated by 5.92
the analyses reported below rather than the 21 percent and the amount of tax by nearly double
strata from which the 1981 sample was actually that The error for minimum tax is somewhat
drawn The percentage of returns posted after smaller while that for alternative minimum tax
the AD closeout in processing weeks or is slightly greater
cycles 3952 ranges from .8 and .9 percent in Estimates of percentage error over period
the lowest income nonbusiness codes to over 32 of years would inform the user as to the
percent in the highest income codes all three expected precision of the advance estimates By
types of returns Business and nonbusiness themselves however such error statistics do
returns are most sharply differentiated at the not provide clear picture of the effectiveness
lowest income level their rates of late posting of the AD methodology at projecting that which
converge as income rises is unknown at the time the AD sample is closed
It is clear from Table that the sample code outnamely the income and tax on returns
is an excellent stratifier for reweighting the posted after the AD closeout To examine this
advance sample to estimate full year totals question Table describes the incidence of
prior to the completion of processing If the late postisng and Table compares the AD and
variation in late posting within sample codes CR estimates among late returns
were unrelated to the income and tax items for The incidence of late posting exhibits wide
which AD estimates are prepared then the use of variation by type of income or tax item The
this single stratifier would be sufficient In distribution of returns and amounts by posting
fact however the performance of the AD period is reported in Table for selected
estimates suggests that further stratification income and tax items in tax years 1981 and 1982
is required as we shall see these data are based on IRS tabulations of the
total population of returns so the yearendPerformance of Advance Estimates totals differ marginally from the CR estimates
Table presents comparison of the AD and in Table In 1981 1.43 percent of returns
CR estimates of selected income and tax items were posted late In subpopulations this
for tax year 1981 Deviations of AD from CR percentage varied from low of .9 percent for
estimates are expressed as absolute quantities returns with an overpayment or with unemployment
numbers of returns or millions of dollars and compensation in their AGI to high of 15.71
as percentages of the CR estimates The AD percent for returns with alternative minimum
estmate of the total number of returns was .12 tax Variation was even greater for dollar
percent below the CR estimate If the returns amounts In 1981 lateposted returns accounted
Postedlate were undifferentiated from same for 1.81 percent of AGI the percentage of other
stratum returns posted prior to the AD close dollar amounts in lateposted returns varied
out and if the percentage error in the from 1.07 percent for unemployment compensation
projecti\ns of total returns were constant in AGI to 28.65 percent for alternative minimum
across a\ll the strata then the percentage tax Late posting increased by about 50 percent
deviations for all of the items in the table over all returns between 1981 and 1982 with
wiuld be -.12 percent That the percentage some income and tax items showing larger
dØvations vary widely around this amount increases and some smaller This increase was
indicates\that at least one of these conditions due largely to the lengthening of the automatic
is not met\\ extension from two months to fourIn factIRS overprojected the total high Table reports the deviation of AD from CR
income returns and underpredicted the low income estimates of selected income and tax items as
returns in preparing the AD estimates for 1981 percentage of the corresponding late returns or
110
![Page 3: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/3.jpg)
amounts Whereas the AD projection of total proportion of cycle 3952 returns in the
returns was only .12 percent below the CR population of returns with values on the
estimate for the year Table this 111000 predictors The theory of Rosenbaum and Rubinshortfall represented 8.15 percent of the 1983 discussed in the context of surveyreturns actually posted late Likewise the AD nonresponse in David et al 1983 shows that
estimate of total AGI exceeded the CR estimate the distribution of is the same for early
by .38 percent but this represented 21 and later filers within strata with constant
percent overestimate of the AGI on returns values of px and weighting cells defined
posted after AD closeout For the 15 other with px as one of the stratifiers yielditems the error estimates range from less than unbiased estimates of the distributions of
one percent to nearly 50 percent for numbers of outcome variables This theory suggestsreturns and from percent to 51 percent for stratifying on an estimate of propensity to be
dollar amounts The magnitudes suggest consid posted late Specifically the procedureerable room for improvementmuch more so on defines the binary variable with value for
some items than others In addition while current year late returns and for current yearTable shows some items with comparable errors early returns calculates an estimate px of
on number of returns and dollar amount minimum px by logistic regression of on usingand alternative minimum tax itemized deduc data from the previous year forms or stratations payments to an IRA it includes other with grouped values of px and then uses the
examples where the errors are in opposite direc grouped variable as an additional stratifier in
tions interest received income tax before the formation of weighting classes just as
credits balance due overpayment For items complexity would have been used in the firstof the former kind improving the AD estimate of approach In view of the aforementionedlate returns will improve the projected amount importance of sample code as stratifier wefor items of the latter kind improving the propose calculating separate logistic regresprojected number of returns other things being sions for each sample code if this is
equal will actually increase the magnitude of practically feasible
the error on the amount Decisions required in implementing this
approach include the choice of predictorsProposed New Method and the choice of cut points for defining
Any proposal to alter the stratification of strata based on px The specifics of both arethe AD file must take the current sample design detailed in our description of test applicaas given Therefore changes to the weighting tion below Theoretical considerations in the
scheme must be implemented by poststratifica selection of predictors are laid Out heretion This together with the known substantial Three characteristics determine goodvariation in late posting by sample code predictors for inclusion in the logisticsuggests searching out as potential new strati regression modelsfiers variables that can improve the AD filesrepresentation of the CR file within each of the should be good predictor of the
present sample codes propensity to be posted lateOne known characteristic of late returns that
is not obviously addressed by the current stra should be good predictor of
tification is complexity Late returns tend to outcome variablesy1
include larger number of schedules than do tabulated in the AD reportearly returns Sailer et al 1982 Figure
If this remains true within strata then late the relationship between and the
complex returns end up being represented by propensity to be posted late shouldearly simple returns thereby biasing the AD be relatively stable across adjacentestimates of whatever items are related to the yearsfiling of additional schedules One approach to
improving the weighting of AD returns there Variables that fail to satisfy have minor
fore is to add to the current stratification influence on the weights and hence on the nonrescheme dimension for complexity operational sponse adjustments Variables that satisfyized as the number of schedules attached to the but fail to satisfy tend to increase the
basic tax form One appeal of this approach is mean squared error of AD estimates by inflatingits sinplicity and we entertained it in our their variance without compensating reductioninvestigation of new weighting procedures in bias Finally variables that satisfybecause it would be relatively easy for IRS to and but not introduce bias because priorimplement However empirical investigation year data are used to estimate probabilities ofbased on the 1981 CR microdata file found no late posting which in turn determine theclear evidence of positive relationship weights for current year databetween the number of schedules and the
probability that return will be posted late IMPLEMENTATION OF NEW METHODwithin individual sample codes Apparently the
current stratification already captures most of The application and evaluation of the newthe relationship between the number of schedules weighting procedure entailed the use of 1981 CRand the lateness of posting sample microdata to estimate the propensity
Our second approach was more general in models which were then applied to the earlynature Let be set of predictor variables returns on the 1982 CR sample file to generateavailable for early and late returns and define predicted propensities and construct new weightthe propensity to be posted late px as the classes within the sample codes To free the
111
![Page 4: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/4.jpg)
results of any influence from projection error Subsampling was done at the sample code
which will be reduced in the future owing to level to create analysis files for the eventualnew procedures introduced with the tax year 1983 estimation of stratumspecific logistic regresAD estimates we used full year CR estimates in sions Within each sample code we selected
place of the required projections Estimates of fraction of early returns approximately equal to
large set of income and tax aggregates pre the number of late returns except where the
pared with the new weights and alternatively latter was much below 100 The combined subsamuniform weights within each sampling stratum
pie totaled just under 20000 records with late
simulation of the current method were then returns comprising 48.4 percent The OLS
compared with estimates based on the final CR regression was run on the full subsample with
weights to assess the relative accuracy of the dummy indicators for sample code being forced
current and proposed weighting schemes net of into the equation On the basis of the regresprojection error sion results we dropped from further considera
tion 31 of the 64 variables using fairlySelection of Variables liberal criteria for retention As result
Ideally one would conduct search over all eight of the 28 items originally proposed by IRS
plausible variables perhaps giving special were eliminated from any further representationattention to those which IRS intends to tabu whatsoever i.e neither as flags nor amountslate With literally hundreds of possible
variables however it was not practical to do Estimation of Propensity Modelsso Instead we employed threestage The relationship between propensity tofile
procedure involving first an priori selec late and the predictors was found to varytion of subset of all possible predictors greatly across sample code Hence final
further screening of these variables on the logistic regression models of the propensitybasis of stepwise OLS regression application toward late posting were calculated separatelyof stepwise logistic regression estimation for 14 strata obtained by collapsing the 29
within strata to derive the final set of sample codes on the basis of similar proportionsstratumspecific models of late returns in the complete file The
Drawing on its subject matter expertise IRS collapsed strata are shown in the column
provided list of 28 items to be investigated headings of Table Definitions of the sampleas possible predictors of late posting Condi codes are given in Table
tions and for the choice of variables The logistic regression models were developeddiscussed in the preceding section were through three rounds of alternative modeladdressed at this point The 28 items were also estimation Each round consisted of the
believed to be related to late posting condi estimation of an initial model using pretion After further consideration we specified set of predictors and the subsequentexcluded one of the 28 items strongly related stepping in of additional predictors subjectto late posting because the nature of that to minimum statistical criteria for entry and
relationship was known to have changed over retention The initial model was defined withtime For most of the remaining 27 items both common set of predictors across the 14 strataan amount and flag indicating whether the item The additional predictors included all of the
was present on the return needed to be consid remaining variables plus variables introducedered In addition some of the items could into the analysis at this stage stratumassume negative values requiring at least one specific indicators of high and low income as
additional flag and amount to permit the identi well as several higher order interaction termsfication of distinct effects of net income and The objective behind including common set
net loss In all 64 variables were defined for of predictors in the equations for all 14 strata
empirical testing This required second stage was to moderate the influence of sampling errorof screening upon the equation specifications across
To reduce the number of variables sufficient strata Interstratum variation in the
ly to allow us to estimate sizeable number of specifications was restricted to variables that
logistic regressions without undue costs we exhibited net effects at conventionalestimated by ordinary least squares OLS significance levels over and above the commonforward stepwise regression of dichotomous
predictors In round one the common predictorsearly/late indicator upon the 64 variables noted comprised in eight of the first nine variablesabove Because of the overall size of the selected by the OLS procedure In round two wemicrodata file 144322 records and the expanded this set to include the sample code
extremely skewed distribution of the dependent indicators for strata combining two or morevariable we subsampled the early returns to sample codes plus 10 variables that had beenreduce the number of observations and to more stepped into the round one equations in at leastnearly equalize the numbers of early and late four strata At the same time we excluded from
returns Such subsampling on the dependent further consideration variables that had beenvariable biases the OLS parameter estimates and stepped into the round one equations in fewerthe logistic regression intercept estimates In than three strata In the third and final roundthe latter case the magnitude of the bias is
we dropped four variables from the common set
simple function of the sampling rate so theand repeated the forward stepwise procedure
intercepts can be corrected In the former case The final 14 equations are reported in Tablethere is no correction However in using the The variable designations are spelled out inOLS procedure simply to screen out the weakest Table The variables forced into the
predictors of late posting we judged the bias equations are listed in the top half of the
to be inconsequential table CAPGAIN and above Note that some of
112
![Page 5: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/5.jpg)
the forced in variables were excluded from one four variables to enter and the associated R2s
or more equations This occurred either because were as followsthe variable in each case flag was undefined
in that stratum or because it was effectively Income averaging computation flag 12.58%
constant Except where specifically noted inForeign investment credit flag 12.62
the full variable name flag predictor distin PAT15OK 12.65
guishes nonzero amount coded from no Credit for tax on gasoline flag 12.67amount coded The money amount variables
are scaled in $100000s The intercept and The indicator for overseas returns which we had
sample code indicator coefficients have been excluded from many equations because of its veryadjusted to reflect the subsampling rates used limited distribution entered at the 13th stepfor early returns as explained above raised the R2 by only .02 percent and fell
The coefficients exhibit substantial short of statistical significance Because of
variation across strata refer to Table forthe enormous sample size all of the preceding
sample code definitions In part this can be variables entered with statistically significantattributed to the differing distributions of the contributions but the largest was only 11.7variables across strata together with the fact compared to 2795.7 for the propensity score We
that the coefficients reported in the table are concluded from these results that we had not
not standardized This is especially true for overlooked any important predictors of lateincome variables where the amounts range from
posting propensity among those potentialbarely hundreds of dollars in the low income predictors identified at the outsetstrata to perhaps millions of dollars in the
highest income strata The coefficients of Calculation of Weightsthese variables decrease substantially between The calculation of weights under the proposedthe low and high income strata
new weighting scheme entails three stepsThe most consistent predictors across strata
are ITEMDUC ELOSS TAXPREFFLG NATGTPAT and calculation of propensity scores for allPARTNRLOSFLG With the exception of ELOSS the observations in the AD sample using the
presence of an amount or the value of the amount stratumspecific equations describedwas directly associated with the probability abovethat return would be posted late Overseas
returns had high probability of being posted assignment of each observation to one oflate in the four strata in which we included the propensity classes defined for thatOVERSEAS indicator However we found little observations sample code andevidence of variation in late posting by total
income within strata Only three of the final calculation of weight for eachmodels include PAT indicators
propensity classTo ascertain whether our final model
specification did indeed exhaust the ability of Each observation is assigned weightthe full set of potential variables proposed corresponding to its sample code stratum and
by IRS to predict late posting we performed the propensity classfollowing test We combined the 14 subsamples The propensity score for given observationinto single sample of close to 20000 cases
in stratum isand regressed dichotomous indicator of
early/late posing on the following variables
using forward stepwise OLS procedurep1 BXi1e
predicted propensity constructed from the
14 equations where is the vector of coefficients from
Table and X1 the vector of values on the65 proposed predictors drawn from the IRS
predictorslist To evaluate the propensity score method we
computed two alternative sets of weights basedPAT indicators and
on two different approaches to step The
two methods utilize the same individual propen15 higher order interaction terms
sity scores and propensity classes Method one
defines the weight for propensity class inWe forced in the predicted propensity score and stratum to beallowed the remainder to be stepped in From
the results we sought to learn whether anyvariable added anything beyond the explanatory
power captured in the propensity score CRTjkThe initial equation was estimated as Wjk
AD5jkCYCLE .011 .988P
wherewhere CYCLE is coded if late and if early CRTJk
is the estimated CR population that
and is the propensity score The equation would fall into propensity class of stratum
produced an R2 of 12.53 percent andADSjk
is the number of AD sample returns inNo other variable added appreciably to the
explanatory power of the equation The next the same class and stratum Method two defines
113
![Page 6: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/6.jpg)
preliminary weight as classes in each stratum The potential impact
_____________________
ADSjk/of small weight classes on the variance of the
final AD estimates suggested something like
equal sized classes Because of the muchjk
greater error in current AD estimates of money
where ijk is the predicted propensity for the amounts than in numbers of returns recall Table
ith observation in propensity class of stratumthere seemed more to gain by defining the
classes on the basis of aggregate income rather
The quantity 1/lPijk is the theoretical than numbers of sample returns In other words
weight for an observation with predictedthe boundary between the first lowest and
second propensity classes would be that propen
propensity and the preliminary class sity score which corresponded to the first
weight is simply the mean of such individualsextile of cumulative AGI for that stratum the
weights for the sample observations in thatboundary between the second and third classes
class would be that propensity score which corre
The preliminary weights are rescaled so thatsponded to the second sextile of cumulative AGI
the weighted sum of the sample observations byand so on Upon reviewing distributions of
stratum that is summed over propensitypropensity scores for early and late returns we
classes equals the projected CR population ofdetermined that the distribution of weights
that stratum could be improved by reducing the size of the
highest class and to compensate enlarging the
CRTlowest class Accordingly we fixed the bounda
______________ ries between propensity classes at threejk jk
w.kADs.k five seven nine and eleventwelfths of the
cumulative AGI distribution
The boundaries among the propensity classes
Method two assigns relatively greater weight tofor all 29 strata are shown in Table The
the higher propensity classes than does method weights computed for methods one and two and the
one the more so the higher the propensitysimulated current method are reproduced in Table
scores Method two yields monotonicallyThese weights are expressed as multipliers
increasing weights whereas the weights computedthe growth ratios of equation aboveshowing
under method one will not necessarily increase the relative increase in weight over and above
and could decrease between given propensitythe simple inverse of the sampling fraction
class and the next higher class Thus weight of 1.2 implies that to make the AD
Applying method one in practice would entail estimate of the full year population returns in
projecting the total number of late returns forthat class must represent 20 percent more
each propensity class within each sample code returns than they did when sampled
i.e disaggregating the projected sample codeIt may be noted that in few of the strata
total Unlike the projections by sample code we collapsed two or more propensity classes into
projections for individual propensity classes single class We did so whenever the range of
would have to be made without the benefit of propensity scores for given class turned out
periodic tabulations of returns processed duringto be extremely small In implementing the
the current year One way to do this would becalculation of cut points between classes we had
to compute for each sample code the proportiondivided the range of possible propensity scores
of late returns by propensity class for thethat is from zero to one excluding the end
previous year and apply this distribution to thepoints into 81 discrete intervals We used
projected current year total The assumption ofdiscrete intervals primarily so that we could
stability in this distribution that is theeasily review the distributions of scores but
distribution of late returns by percentile ofwe chose to define the propensity classes in
AGI between successive years does not seem terms of these same discrete intervals It
unreasonable In the evaluation however we happened that in some cases the lower bounds of
used the actual 1982 sample estimates of late two or more propensity classes fell into the
returns in each propensity class to construct same discrete interval When this occurred we
the weights We did so in order that thesimply combined the classes The most extreme
performance of method one not be weakened byexample is provided by sample code 40 where
errors in the projections of the distribution offour of the six propensity classes were assigned
late returns by propensity class This isto the same discrete interval .0075 to .0100
consistent with our desire to evaluate the three
methods in their purest form divorced from RESULTS
projection error as the refinement of projection methods is independent of the method of
To evaluate the new methods we computed three
calculating weights This choice may afford sets of advance estimates of selected Income and
comparative advantage to method one relative totax items by applying the alternative weight
method two and the current IRS procedure in thismultipliers in Table to the 1982 CR sample
evaluation as method one weights will returns posted prior to cycle 39 We then
incorporate more information about actual latecompared these estimates with tabulations based
returns in 1982 than either of the other sets ofon all returns on the CR file The results are
weights reported below
With regard to the definition of propensityThe selected income and tax items on which
classes we decided first of all to create sixthe evaluation is based were chosen because of
their prominence in the AD tabulations circu
114
BOX 1_SOI_RARR_1 986-1 987_00999
![Page 7: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/7.jpg)
lated by IRS They include variables present in number of returns and the monetary amount Forthe propensity equations and therefore contri itemized deductions the current method yieldsbutors to the propensity scores that factor into result that departs from the CR estimate bythe new weights as well as items with no close to 1.4 billion dollars or .48 percentobvious relation to any of the variables in the The new methods approximate the CR estimate to
models This distinction is potentially within 366 million and 221 million dollarsimportant We expect the propensity score respectively
approach to produce improved estimates of For business net profit the new methods and
outcome variables that happen to be included the current method yield very comparable estiamong the variables in the models because we mates However the current method underestiknow that their relationships to late posting mates the magnitude of the net loss amount byhave been incorporated into the propensity i.o billion dollars or 5.75 percent while the
scores We anticipate improvements in the other new methods deviate from the CR estimate by onlyvariables as well because the propensity scores 228 million and 179 million dollars Curiouslyare intended to represent the tendency toward this combination of better loss estimates andlate posting generally However for variables comparable profit estimates produces less
not tested as possible predictors of late accurate estimates of net profit less loss under
posting we have no prior empirical evidence to the new methods relative to the current This
suggest that their relationships with late happens because the current method estimates of
posting are indeed captured in the propensity net profit and net loss diverge from the CR
scores Accordingly our expectations for estimates in opposite directions i.e oneimproved performance among these variables are understates positive amount while the otherless strong Where the new procedures do not understates negative amount and the deviaproduce significantly improved estimates we tions largely nullify each other when the twowould recommend that such variables be tested as estimates are summed This kind of result is
possible predictors in future application of not repeated for any other of the net profitthe new method less loss amounts reported in Table however
For partnership income we again find noVariables Represented in the Propensity Models appreciable differences among the advance esti
Absolute and percentage deviations from CR mates of net income whereas the new methodsestimates are reported in Table for the three yield markedly smaller deviations than the
alternative advance estimates of selected income current method by onehalf to twothirds onand tax items that wereincluded in some form in
net loss Here unlike business income thethe propensity models The table reports as relative magnitudes of the deviations on incomewell the CR estimates from which the deviations and loss are such that the two new methods lead
are measuredto substantially better advance estimates of CR
For most income and tax amounts the advance net income less loss than does the currentestimates based on propensity score methods one method The current method differs from the CRand two lie substantially closer to the CR
by 2.6 billion dollars on net income less lossestimates than do those based on the simulated amount of 899 million dollars The new methodscurrent method The new methods also yield deviate from the CR estimate by 569 million andcloser approximations to the CR estimates of 220 million dollars respectivelynumbers of returns on items where the current
On the whole the new methods yield nomethod produces deviations of one percent or improvement over the current method on amountsmore There is generally little difference of capital assets sales Similarly the newamong the alternative advance estimates of methods show little improvement relative to thenumbers of returns where the current method and current method on employee business expensesanCR estimates are themselves very close item on which the current method yields results
The most dramatic improvements to the advance very close to the CR estimates of both number ofestimates of money amounts occur on dividends returns and money amountitemized deductions total tax preferences and
alternative minimum tax On total dividends Variables Not Represented in the Propensitywhere the current method yields an excess of 737 Modelsmillion dollars or 1.36 percent over the CR Absolute and percentage deviations from CRestimate the method one estimate falls within estimates are reported in Tablu 10 for the three76 million dollars 0.14 percent and the method alternative advance estimates of selected incometwo estimate within 28 million dollars 0.05 and tax items that were not represented directlypercent For total tax preferences deviation in the propensity models The variables notof 205 million dollars or 13.50 percent obtained included in the propensity models but selectedwith current method is reduced to 16 and 52 for this evaluation constitute major income andmillion dollars respectively under methods one tax items frequently included in summary reportsand two Here even the estimates of the number of advance estimates or highlighted in publicaof returns are much closer to the CR estimate tions based on the CR tabulations The resultswith methods one and two than with the current reported in Table 10 show that the gains inmethod Under the current method the number of
accuracy seen in Table for variables includedreturns with additional tax for tax preferences in the propensity models do indeed generalize tois underestimated by 16.7 thousand or 7.43 variables that were not used to construct propercent the new methods reduce this to 5.2 and pensity scores although they do not generalize2.9 thousand The alternative minimum tax to all the variables we examined On advancecomponent of total tax preferences presents estimates of money amounts the propensity score
comparable picture with respect to both the methods often show sizable improvements over the
115
![Page 8: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/8.jpg)
current method rarely do the new methods result for twoearner couplethe propensity scorein accurate estimates For numbers of methods show little or inconsistent improvementreturns on the other hand we find generally relative to the current method On exemptionslittle or at best modest improvement and contributions deductions the new methods
Where the propensity score methods yield fare somewhat worse than the current methodsignificantly better estimates of money amountsthe margin of improvement over the current Within Stratum Comparisonsmethod expressed as the proportionate reduction The weighted AD file is used routinely to
in the deviation from the CR estimate generally estimate not only aggregates over all taxpayersrange between 50 and 75 percent Improvements but also distributions by detailed AGI classof this magnitude may be seen on AGI net losses It is of interest therefore to consider to
on farm estate or trust Small Business what extent the comparative advantage of the
Corporation and other income taxable income propensity score methods extends to the subagand alternative measures of income tax On AGI gregate level Because they use more weightthe estimate based on the current method exceeds classes estimates based on the propensity scorethe CR estimate by 9.7 billion dollars or .53 methods might exhibit less bias but greaterpercent The propensity score methods reduce variance than comparable estimates based on the
this deviation to 4.6 billion and 2.9 billion current method Another issue is whether the
dollars On farm net loss the current method relative superiority of the propensity scoreunderstates in absolute magnitude the CR esti methods increases with the frequency of late
mate by 334 million dollars or 1.87 percent posting We have seen evidence of this in theThe propensity score methods reduce this devia comparative performance of the alternativetion to 81 million and 42 million dollars The advance estimates of items with substantial
improvement for net profit less loss is compar versus little late posting but there the
able with the percentage deviation being results may reflect the ability of our propensireduced from 3.51 percent to .80 and .59
ty models to capture the tendency toward late
percent posting in those particular variablesOn Small Business Corporation income the new Comparisons by sample code provide more general
methods produce estimates with somewhat greater evidence in that sample codes with relativelydeviations than the current method on net high rates of late posting overall will exhibit
profit but they compensate with substantially late posting in all variables Finally the
smaller deviations on net loss lie net effect application of the propensity score methodis such that whereas the current method yields entailed the estimation of separate equations byan estimate of the small and volatile net profit groups of sample càdes with differentialless loss that is over one billion dollars wide results It remains to be seen whether there is
of the mark the new methods produce estimates any evidence to support changes in the specifthat are within 441 and 279 million dollars On cations of the models or the grouping of sampleother income the new methods generate slightly codesimproved estimates of net profit and much more Table 11 presents the CR estimate and
substantially improved estimates of net loss percentage deviations from that estimate byOn net income less loss the current method sample code for three of the items from Tableunderstates the magnitude of the 10.3 billion and three from Table 10 The items selected aredollar aggregate loss by 2.0 billion dollars or dollar amounts and they include items on which19.2 percent The new methods reduce the the propensity score methods performed substanmargins on other income profit and loss to 921 tially better than the current method as well asand 568 million respectively or 8.9 and 5.5 items on which the propensity score methodspercent produced no aggregate improvement By including
Perhaps not surprisingly the improvements the latter items we seek to determine whetherfor taxable income are very comparable to those the lack of improvement on those items characregistered for AGI Here there are parallel terizes all sample codes or whether it reflects
improvements for both number of returns and systematically better performance in some samplemoney amount Comparable improvements are codes and worse performance In othersrealized for income tax before and after As reported above the aggregate AD estimatecredits total income tax and total tax liabil of total dividends received is 1.36 percentities On tax after credits the current method above the CR whereas the estimate based on
produces an estimate that is high by 2.8 billion propensity score method one PSM1 falls withindollars or 1.02 percent Propensity score .14 percent and method two PSM2 within .05method one reduces the error to less than 1.2 percent Reviewing the results by sample codebillion and method two lowers it still further we find first of all that the simulated ADto 1.0 billion or .36 percent of the CR esti estimate tends to overstate the CR estimate bymate Even the estimated number of returns an increasing amount as the sample code incomeshows proportionate Improvements on this order level and with It the rate of late postingalthough the current method estimate is itself rises that is from codes ending in to codes
very close to the CR estimate differing by less ending in There is no such pattern amongthan onetenth of one percent the propensity score estimates although there
Small improvements are recorded for Interest Is suggestion of Increasing variance thepaid deduction and for payments to an IRA and estimates being somewhat farther from the CRsmaller still for salaries and wages medical estimates in both the positive and negativeand dental expenses and taxes paid deduction direction as sample code rises Within theOn other Itemsunemployment compensation nonbusiness farm returns 5Os codes the
pensions and annuities alimony and deduction current AD method performs as well as method two
116
![Page 9: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/9.jpg)
overall and except for the two highest codes amounts differ little from each other the
also as well as method one Within the 40s and sample code estimates tend to be similar as
60s codes the propensity score methods perform wellat least as well relative to the AD method in Estimates among nonbusiness farm returns
the high income sample codes as they do in the tend not to show the level of improvement as we
low income codes find among nonfarm returns This may be
The AD estimates of AGI exhibit to an even attributable to the propensity equations
more pronounced degree the pattern observed for Because the samples within the farm codes are
gross dividends The propensity score methods quite small we combined nonbusiness farm
also show evidence of increasing distance from returns with nonbusiness nonfarm returns by
the CR estimates with rising sample code in the income level based on comparable rates of late
nonbusiness nonfarm and the business codes posting in estimating the equations and we did
but the growth is less rapid than for the AD not attempt to estimate farmspecific interacestimates As result the propensity score tions other than through the interceptsestimates increase their advantage over the Perhaps another strategy for aggregating farm
current AD estimates as the propensity for late returns would have yielded better predictions of
posting rises Within the nonbusiness farm late posting among farm returns This would be
returns method two falls increasingly below the an issue for future research
CR estimate as sample code increases while
method one exhibits no clear pattern DiscussionOn itemized deductions the strong aggregate In assessing the implications of the evalua
performance of the propensity score methods tion it is very significant to note that the
relative to the AD method is seen to be the gains in accuracy realized with the propensityresult of superior performance within small score methods are by no means limited to variasubset of sample codes together with small bles that were used to construct the propensity
overestimates in very large strata counterbal scores One appeal of the propensity score
ancing the underestimates in most other approach to weighting is that an adequate model
strata The propensity score methods provide of the underlying propensity phenomenon should
closer approximations to the CR estimates in permit improved estimates of whatever observed
code ranges 4245 and 6063 Within the farm variables exhibit such tendencies In the
strata the AD method is if anything somewhat present study the empirical development of
closer to the CR estimates than are the propen propensity models was restricted by an priori
sity score methods selection of potential covariates Yet the
On total tax preferences the propensity score method yielded improved estimates of keymethods produce much better estimates than the variables outside this basic set Refinements
AD method through sample codes 4346 and 6267 of the method in this particular context would
which account for most of the aggregate amount broaden the set of variables searched to developof total tax preferences Within the farm codes the models adding variables for which furtherthere is again basically no advantage to any of improvements are desiredthe methods Another aspect of the results that surprised
On salaries and wages and estate or trust net us was the marginal superiority of method two
profitsitems on which the aggregate estimates over method one Method one used the actual
of the AD and two propensity score applications complete report estimates of the numbers of
are fairly comparablethe sample codespecific returns in each propensity class to construct
estimates are themselves very similar especial the weights by propensity class Method twoly on estate/trust The top two or three codes like the simulated current method used only the
for each type of return generate the largest sample code population estimates relying on the
errors but below that there is little differ propensity scores themselves to generate the
entiation by sample code in the average magni differential weights Yet the results presentedtudes of the errors in Tables and 10 show the following
This brief survey of results by sample code performance of the two methodssuggests several conclusions For those items ________________________________________________
where the aggregate estimates based on propensi Method of Comparison of Returns Amount
ty score methods are substantially closer to the
CR estimates than are the simulated AD esti Variables Included in Propensity Models
mates we do not find this level of performance
repeated in every sample code On the other Method better than
hand neither do we find that propensity score Method better than 11
methods achieve their lower overall bias with
higher variance at the sample code level In Variables Not Included in Models
approximately threequarters of the sample codes
the estimates based on the propensity score Method better than 18 22
methods deviate less from the CR estimates than Method better than 11 11
do the estimates based on the current AD
procedures Moreover the relative superiority On the whole the estimates based on method twoof the new methods tends to be most pronounced are superior to those based on method onein those sample codes where the current method Since method two is more directly applicable in
produces the largest deviations from the CR practice than is method one these results
estimatesnamely those sample codes with high suggest that the gains observed in this
proportions of returns posted late For items evaluation should indeed be realizable inwhere the three advance estimates of aggregate practice
117
![Page 10: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/10.jpg)
SUMMARY maintaining or even improving upon the accuracy
of key variables for which the current magniThis paper has proposed and tested new tudes of error are already comparatively low
method of preparing annual estimates of income
and tax statistics from an advance or truncated FOOTNOTES
sample The problem examined here is one of
adjusting for unit nonresponse where the Improvements to the projection methodology
missing units are tax returns yet to be added to were the subject of another research effort
the sample and the missing returns are known to under this same project The results of
differ from the advance returns on the variables that effort are detailed in Czajka 1984afor which advance estimates are desired No 1984b 1985information is available from the missing
returns for the current year but all returns Personal communication from Ray Shadid IRSare available for previous years Currently IRS Statistics of Income Division
projects the total number of late returns by
sampling stratum and computes stratumspecific The implied individual weight grows very
weights to inflate the sampled returns to the rapidly as approaches unity To avoid
projected totals The proposed new method excessively high weights for the top
involves differentially weighting the returns propensity class we computed the class
within the sample strata on the basis of their weight as the inverse of one less the mean
estimated propensity to be late Individual propensity
propensity scores are computed on the basis of
logistic regression models of late posting The intervals were as follows zero to
estimated on the fullyear sample from the .0300 in increments of .0025 12 intervalsprevious year Within each sample stratum the .030 to .040 in increments of .005
advance returns are then divided into six intervals .04 to .40 in increments of .01
propensity classes and each class is assigned 37 intervals and .4 to 1.0 in increments
weight translating the propensity into of .02 30 intervals We divided the
projected increase in the number of returns of intervals this way because of the extreme
that type bunching of propensity scores at low values
Weights for the new method were estimated in strata with low mean propensities
using tax year 1981 data and then applied to
data from tax year 1982 Estimates were Note that the net loss amount was included
prepared from the advance sample using two among the predictors in the propensitydifferent methods of reweighting the propensity equations whereas the net profit amount was
classes Deviations of these estimates from the not Net profit was tested but rejected as
complete year estimates were compared with predictor at an early stage of the model
deviations based on assigning single weight development possible explanation is that
per stratum the current method The new the relationship between net business incomemethods produced estimates that were generally and late posting changed between 1981 and
much closer to the final estimates for variables 1982 In further research using 1982 data
included in the propensity equations Estimates we found net profit to be significnatof dollar amounts showed much greater improve predictor of late posting in number of
ment proportionally than estimates of the number strataof returns on which those income or tax items
were present Currently there is greater error
on amounts than on numbers of returns Because ACKNOWLEDGMENTSthe propensity equations are intended to
represent the general tendency to be posted This research was performed under contract
late improvements are expected for variables with the Internal Revenue Service The authors
not included in the equations as well as those wish to thank Fritz Scheuren Director of the
that are included Such improvements are Statistics of Income Division for his valuablelimited by the extent to which the equations do input to the research the assistance providedindeed represent general propensity We by members of his staff and his helpful
registered improvements on several variables comments on an earlier draft The assistance of
that were not included in the equations Lucia Wesley who typed the manuscript is also
Variables on which no improvements were recorded gratefully acknowledgedwould be prime candidates for inclusion in
future model development as they reflect
aspects of late posting not captured in the REFERENCESmodels presented here
The results demonstrate the value of Czajka John Evaluation of the Projectionsreweighting on the basis of propensity scores of 1983 Individual Tax Returns MemoranThe methods tested here could be applied on dum Washington Mathematica Policyregular basis to the preparation of advance Research Inc.1985estimates and provide sizeable error reductions
for income and tax items that are poorly esti Projections of Total Tax Returns frommated from advance data using the current an Advance Date Evaluation of Alternamethodology With further research and devel tives Project Report Washington
opment the method could be tailored to provide Mathematica Policy Research Inc 1984aimprovements where they are most needed while
118
![Page 11: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/11.jpg)
_________Revised Projections of Total Tax ______ Statistics of Income1981 Individual
Returns by Sample Code Tax Year 1983 Income Tax Returns U.S Government
Memorandum Washington Mathematica Policy Printing Office 1983
Research Inc 1984b
_______Advance Data 1981 Individual Income
Czajka John Donald Rubin and Roderick
J.A Little Modeling the PropensityTax Returns Mimeograph WashingtonInternal Revenue Service 1982
Toward Late Posting of Tax Returns An
Application to Tax Year 1981 Project
Report Washington Mathematica Policy
Research Inc 1984 Rosenbaum Paul and Donald Rubin TheCentral Role of the Propensity Score in
David Martin Roderick J.A Little Michael Observational Studies for Causal Effects
Samuhel and Robert Triest Imputation Biometrika vol 70 1983 pp 4155
Models Based on the Propensity to
Respond 1983 Proceedings of the Business
and Economics Section Washington Sailer Peter Charles Hicks Dave Watson and
American Statistical Association Dan Trevors Results of Coverage and
Processing Changes to the 1980 Individual
Internal Revenue Service Statistics of Income Statistics of Income Program 1982
1982 Individual Income Tax Returns Proceedings of the Section on Survey Research
Washington U.S Government Printing Methods Washington American Statistical
Office 1984 Association
119
![Page 12: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/12.jpg)
4-
4-
fl
fl L-
flfl
fl
4-
LA
fl fl fl%
fl
t. as
Lfl fl
0-fl
4-
50 50 50 50
LU 00
_______I-
LU
WL
on
____________
-L 50 LA 50 50
0%- fl
%5
U- ____fl %fl
fl LA
05 05 50
LA
.._0
___-J
04-
LA fl 50
fl %flfl fl LI
0Cit
4-
LULU
LA fl fiLA
___4-
LA
5050 LA50 LA 00
50
50 4-
505003 OL
LU51 05 CL
4- CC-J
.5
oi4-
LAO 05
.- 5050
U- 30
4-4-4-
%-LLoo4-
LU
on 40LA
05 00 00
04 .... 04
4- fl
00LU 08 08 00 00 00 4-0 4-
d-o8 80 8o 00
-0I.
LU
fl fl LA %fl
00 SI fl sO LA
SO 44
4-
fl 4-
Cj
020.5
4-
UL4-On
05
fl
fl .- 0O SO 50 SO 4-
00 00 o8 .5
A- 0- A-V8000 00 00.000LQ 00 00000 00 8800 0000 00 00 LU
fl fl
.2 sO 00 SO .0 fl
-.1
![Page 13: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/13.jpg)
OcC_ sIN-c La5nNa 54n05 LaN-a.9 N.LaLaOn.5La a.LaLas-sLaaoo C94c50 40090 LaLa0 aLaLaNa LaSn0ncs-sSnN-.0 nLaU
a. so 5--s e- Na sa La assaL 1151111 ii I_a
Sn
4-
ao.2 .4 N- Sn Sn La La La Na Sn so La La La so
-sii40 Sn IS N- N- Sn Sn La La La N- NaN- N- Sn N- Sn 519 119 Na 00
.- La Sn Sn 50 La Na Sn 09 Na .-s N- 1a
-s so so La _s La 10 .s s-I La 40 5-a Sn Na 010 Sn La
Q- s0 NanNaLaN- U9a5-.aLa ID nsa r- 01 Sn La Sn La Sn La La .jt ID SJ Sn
10 La La La La Sn La s-I La La Sn La Ill 55 La La Na La 10
s1LaC LaLaS aa oSn LaaS
N- La La La Sn Sn La La La La Sn LaN- Na Na Na NJ
.0
La N-La NJ La s--I Na La Sn La La N- La 50 La La
co sn.-.a nSn_s sn- LaN.L0n e-a.Lasnaau9NaNa rntna.osnLa 104- La Sn N- La La La N- Sn N- N- -IN-N-a 10 14 .s NJLu Sn -Ia La 10 La Na La La SnN- sn La La La Na La 55 La s4 515-s--s..--
OP La La Sn Sn Sn La La Sn .1 NINa to 50 La fl s-s La La LaN- La Sn La Sn La La La La Sn LaN- -e -.5 NJ Na Na 5-a
La to
Lu
s--La
451_a.La aoaLsJw 0015I-- 0-
4-1 45 Na 50 La to-N- La La La La N- La ID La La Na La 10 r-
IC La Sn La aLa Na Na Na 519 Sn
ddd ddc dddd dd11111 11111 III liii
N- La La Sn Na Na to NJ La La La Na .-a 10 41
..- In s-sNa Sn Sn Sn to- La to- Na La Na UI La La Na s-s
440 Sn N-nLa NaLaN- sOs-5 N-LaN-0 Snnn4OLanLaNa Las-SLaLaSnO9d Sna.N 41
.44 .a La s-S La Sn s-s La ann Sn naIliii .g .0
SJIS5In
U-
s-S N- La La s-a ID Sn La 519 as-s .o La La s-s .0La .0 Na0 La Na La La Na N- La On La Na Na a-i
s-i La La ID N- La .a Sn N- Na La ISa Na Sn 50441-
4- La La La La sI 519 Sn N- Na La N- La La tO La N-n N-
.0 OLaLa SnSnN aLaLa oLanN-sn N-Lan.-.N-.--ILa-.c- NaSnNaaNaNJ La La so ann La La N- N- Na La La ID Na .- N-
La DI 10 10 Sn N- Na sS Na Sn La Pa Sn La 01 ID LaSn0 4-On Sn La N- N- NJ 15
Sn La N- N- La 10 N- N- SnaSn La N- La La Nan 10 La 10 N- Na N-so La La 10 La La Na Sn La Na ss La La Na La La Na N-
La aa.La NaLa aLa LaLan.Sn NaLaLanLaLaLO09 OsonpaLaNnI a.La LaNJN a0Na La0Iaa asn5-Inunsan.-l NNOLa.aa s.snLa OLaSn OnNaIDO nN-LaSn.-sa.aN-a SnNaLa0InNJ N- La La to La La N- ID N- Na 551 La 10 Na s-I .-5 I0N-s-4
s-s
La La Sn 50 Na Na Sn La Na Sn La La La NaLan_s Sn La N- N- N- Na N-
________ Na
Sn
Ins.445 -s- .0
9- -D 00_s_a 1-
.0 IOc isSnSn Cs- 4- La
In La Inc05 ao cc Sn 15 05
50 Sn 55-0
LII Sn 4.1 5444La -s-a na 45Es-I.ulac 4015 a-.0 Sn LII 14 s- III 15 is
4-I _J Sn Sn s-i 0._a 44 Sn 4-
La I- a- a- is
-.- 515 0.a In 44
ala 51 OSnCs- 5.5.9- s-ais csna 10 aa.sa. ao oa rsswa
Sn _i 4.1 51 La -J Sn Sn .J CO I544 4-
Sn ao -_ aa a-s-EUso-c xa s-a 4.5
aI 44 44 45 a-I 044 51 a-saL ca-oc- a--- -i-ac-oooao u-s-s Is-
.0 us- Sn Is LI Sn II- 5- a- a- Is 4400 15 II 4-5 .J0Sn0 -s-515-s-00a0L.05150 ace OaUISaa-o a.4-s-415-01.-s-15O10UI.O5-0Sna-Snass1aU II--c Cs
a.4oSna_a oa..a000...a CO coa-sa oos_ae Sn
---a aa o.ca.aa-s-a-s-a51 In 454.544 Sn 444.141 .44-5444 a-Is- 0.0 44 4-4 4450 aaa-.-QswaaaaaeaaacsasnEEEcoo.s-o-s-a
44I 51 .XXX a- XC 154-5004-3 10 tO fleE UCa.jEc00 0100 LaLa 00 La L1. 0. La 0. Ui II- La Is
121
![Page 14: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/14.jpg)
Cu 440 1.-I LI en NJ N- .I NJ NJ NJ en en in 4008 en en en44 Cu 0508 N- 10 en NJ N- 08 N- Cfl NJ NJ 08 00
_C._ enN 00 -4.- C5310.0 NJ .-I
Cu 05 N- fl Cu 100 fl 080 08 00 fl en 4005Cu ..I ..I Cu NJ CO NJ N- 04
-c lOU 08 NJ NJ In en CO en CJ fl 00 -I US 008 00 -008 -en en .0 -08 -08
0511 00 NJ en en 400 0511 NJ N- 400 -40Cu 08 -4 en en N- NJ N- 08 C4
Cu 4.4 en Cu NJ N- 10 10 Cu N- .4 _1_4 08 08NJ Cu .U08 NJ .4 -4
0844-In-C
.2
4-I
Cu nN-en N-N- .4 fl In 0440 N- en 05 NJ 08 LI 05Cu 4- en N- NJ 10 10 N- NJ
Cu .0 l0 05 .C CSCu ON- N- 500% 11 P4 In N- N-In 10 NJ 005 en NJ en N-LI N-C-- NI OS en Cu LI Cu NJ P4
Cu U.4 NIN- NJ COO UNJ N- ION- 044
Ifl enN- .I LI 08 08 NJ N- LI NJ P4 N-05 08 en
.-4 .-
Cu I.a ID en 40 001 N- NJ ID N- NJ 40 001 N- 08 00II N- Cs en Os en
80Cu InC 4101 NJ N- ID 05 10 0% N- 10 N- IOU N- en IC
Cu en .0 en N- .I .05Cu 4-I en 0.Cu 05 NJ 08 en N- -0 00 Cu N- fl 08 05 en 08 40Cu 10 01 N- -4
0-I- LOU 8080 LO UNJ Cu0 N-OS N-en N- 010 0805 ..-I en 08
08
I-001
Cu en 10 010 10.-I en NJ 00 010 ON- OSLO 400 NJ.-U USC OIC .Cu.Cu NJN- IC Cuen 058 0-I 10 40
UI Cu .4 en en LI NJ en-_
LIJ 0.414
N-
4-Cu Cu0 NJ050 P.1-I 005 1N- Cu OU IOU
2-080 14 In NJ NJ0%
Cu .0114
Cu00-V
CuN- .4 NJ .-I .I N-.-4 N- NJ LI en 08 80
CO .0 s- en 08 NJ N- NJ 104- Cu NJ 08 10 NJ Cu N- 0110 NJ 10 NJ N- en en N- NJ en 0%
114 NJ N- 08 NJ N- 05 01 N-
4-IUU 010 N-U 110 NJ NJ N- en 08 In NJ NJ 10 10-4 N- en
Cu II 451 en N- 05 08 01Cu 40 N-C 400 040 400 0140 1010 040 -I NJ In 4001 10
-4 Cu NJ CuCuN- PJCu 80 05 N- NJ CO NJ en .-NJ en0_ -4 -4
Cu 154 08 NJ 0110 en 011 N- 10 N- 400 en 1009o5 QU e.4 csi
00 0g NJ NU .4 CJ154 ID N- 08 NJ en .-4
100 80 en en en 08 05 NJ NJ N- N- .4IO NJ NJ .-I0 00N- N- Cu en IIN- en Cu 01.-I 10 CI 08 NJ N- NJ en NJ NJ 0% Cu
05 Cu .I en 08
Cu -ICu 40
Cu -.4 VI
UI Cu In
Cu VI Cu
Cu -.4
Cu ..4 4.1
Cu Cu 10
Cu 444 IA 4-I VI UI IA IA 4.I Cu VI Cu Cu VINC ...IC Cu Cu CC44 USC 01 04. 4- lI-I. .144- 04- 0.4- 4- 01
CII 040 00 00 CO CO 03 -.0 -344 4.4 Cu 44 41 .41 Cu 44 44 044 44 4.144 4.1
DVI Cu Cu OW 3.01 1001 CuCu Cu 4-10 CCu 00 Cu
44 VI Cu 00 00 .00 Cu 00UI 4.1 4.1
4- II- Cu II- 014- 4- II- CII- 14. II. 044- CuII. 014-
CuO 00 0- Cu0 440 00 0000 4-4.1 4-I 4. 44 4-44 VI 4. 41 1-4.4 44 4-44 Cu 1- 44 4.4 VI 41 44 4-I
UI UIWC 44100 CUIC VIGIC 0010 0010 000 OCuC 4.IUIC CuWC 00101-41 -.Q 01.0 .0 CI .0 .0 fl 0.0 .0 CO .0 .OWIn 4-CO 4-CO VICe CEO 4-CEO CO -CO 0_CO CuEO CO .000.00 CuOC woe VIOE sOC 30 EOE VIOE E0C FOE EE 100
I- 4.4 lC VI 4.4 1. CI Cu IC
4- Cu Cu Cu 44 Cu
UI 0814- 4--
122
![Page 15: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/15.jpg)
05 en Qt CM N.041 CM In Si at SJ en
1.76-441 NJ. en to 000.0._i 54 en
.0
ate 04 0-4to
..i at ctJ en CM to NJ en to to OOten -0 -to CM -en-.4 -40 to en Den en ..4
en at Si CMto NJ CM 54 05 CSJ
44
.0
s_C 4-50.4 at
.0X.0 C. -4 .40 CS Ca 00
CM en4- to 05 CD to .760iJ to en Si -to NJUS at en en 40 NJ 44
fl en at .0 4.
400 to en CM .76
NJ -40 .0CM U-s at -4
-.4
.- 4.
00
04 CM 0N.to-UatU.enN44 CM en en to .a
en 00 0CM to Qt 4-44-4 40 .U0.4 at 0CM 04 00 en 45
44 en at0-I CS CM 454 fl .4 CM at en_Ni to
atNJ to CMCM
44
4U at Itt .0 44en to 0. CM too 10 .-
U4J4- 4-CM CM en en NJ 0CM 4400_I -4CM 7.4CM
4-
NJ
.4. 5.U at .- 0% too S.en NJ en 05 45
._ 00 toO 400 050 40--en
to to to en 14 at en ON en10 04 .0 44ON en to to 7-7 45
44.4.4 4-
at -444
.0SC0044-
.0 CMat NJ en 40 10 ON5.0 OS en
at CM to to 00 enUi ...4 NJ -0
i-I CM 010 0. toen CM
CM to en fl SJ-o -en -u
CM ID CM.7-
..4 CM CD.7 .044
.4-
44en 040 at en at CMCM 0% CM enoat en en en at en en Oat
45 -10 05 en CS0.0 4.4 0CM CM to to to444 -4 CM
to en 01 040-en -UI0CM to en at --
NJ to
.4.4
4.4
44
0.4
04
444.4
44
In ut 44 44
.4004 USS 4-45 CL 4. 5.
5.0 4.40at04.7 4.4 04.4 C4 4-4
.7- 4.0 44 0441 .4
.74.
4- ii4. 114 014 041- 4414.
4.40 150 40 Co.4 4. 7-
5.4.4 55.4 4.45.4 4.4 45.4.4 1.4.4 04-4en 00 540 4400 4440 460 64440 460 CV
0.0 .0 4-4.0 an on .0 .0EEO E0 4410 EEO 1.00 cEO 0.60 .40044 4500 0041 .0E 64444 500 ..4E.0
4-In 44
![Page 16: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/16.jpg)
TABLE
DEVIATION OF ADVANCE DATA FROM COMPLETE REPORT ESTIMATES
EXPRESSED AS PERCENTAGE OF LATE POSTED RETURNS
AND AMOUNTS TAX YEAR 1981
Percentage Error Percentage Error
Item on Late Posted on Late Posted
Number of Returns Money Amounts
Total Number of Returns 8.15%
Adjusted Gross Income Less Deficit 21.07%
Salaries and Wages 1.03 13.51
Interest Received 6.81 13.90
Gross Dividends Received 5.88 12.02
Business Net Profit Less Loss 0.88 19.83
Net Capital Gain Less Loss 19.66 31.95
Farm Net Income Less Loss 20.19 35.71
Pensions and Annuities in AGI 15.54 30.33
Payments to an IRA 29.66 22.09
Itemized Deductions 9.35 11.32
Taxable Income 3.83 18.24
Income Tax Before Credits 3.50 26.99
Minimum Tax 38.64 40.35
Alternative Minimum Tax 49.17 -50.55
Balance Due 32.55 11.53
Overpayment 19.13 7.92
SOURCE Tables and Entries are absolute deviations from Table divided
by quantities posted late from Table then multiplied by 100 percent
124
![Page 17: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/17.jpg)
4C
cc en cc c.i in en c. in
N- en en .c in N- in
.0 C1 in .0 -4 in en
le 9C IC IC
en in ri cc in .0 C4 in .-4
en en cc e.i cc en in in
.0 C.I CJ ill
IC IC IC IC IC
in in in N- .0 cc .0 N- .0
-4 -4 -4 in cc en cc
.0 en e.i c.j c.i e-i in en
IC IC IC
en in r- cc .o in N-
.o en ci en a- N- .0 N- it cc
.0 -4 en .- en C4 en c-.i C4
IC IC IC IC
e-i cc cc a- N- IC N- in
NI en in cc N- N- .0 NI -4 cc .0 .0 N-
.0 .0 .0 en NI en en .o N- in N-
NI
IC IC IC IC
.0 .0 ..O IC cc in IC
en a- cc cc in cc en en en .o
en en en r.i -I ci en c- cc
en cc in
Cl IC IC IC
cc NI in -4 NI N- ...4 en en en
N- cc N- cc en cc c.i cc cc .o in N-
in in -4 -4 -4 N- NI en en NI -4
__________IC IC IC IC IC IC IC
a- en in in c-i en in .0 cc
4-4 .0 .0 cc en .0 NI cc N- in -4 N- NI
in .o in en en in cc
_____IC IC IC IC IC IC IC IC IC
in N- .-4 N- .0 .0 en -4 -.t it
in in .0 in N- N- N- NI -4
in NI .-4 en en ..- c-i en en
-.1 ra
E-4 IC Ic IC IC
in N- in en en in NI .0
NI -4 N- -4 cc en .o en in
in -4 in en en en N- NI en en
I.4
-l
I-I
rx4IC IC
cc .0 -4 en cc in en NI in
ri en N- cc .-i en in cc .o N- -4
in in cc en in en en en -4 NI NI NI cc N-
IC IC IC IC
cc cc in en in
-4 en in cc .0 -4 in NI in N- en en
in in cc c-i -4 NI -4 N- -4 -4 in cc
Ia-4 en
Ia Ic IC IC IC IC IC IC IC IC
N- en in in in .0 cc cc cc N- .0 NI NI
cc .o en en c- in .0 NI en in NI NI
en .o en .o en CCC c-i c--i
4-4
Cl _____4-4
a- en -.0 NI N- N- N- en in
cc a- -.0 ci in c--.i c-i in in N- cc in cc
c--i cc en .0 NI c4 cI N- en
___Ia Ia
1.4 4- Cl
OW WQ Ia Ia Ia 1-4
1-4 Ia CE CO 1a4
Ia P. I-I000 -4 1-4 E- 1-4
-4 E- P. Ia 000Ia -H Ia .-
-4 1-44-4p.4 c-i en P. 1-4 Ia 4-44-10 Ia 00
125
![Page 18: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/18.jpg)
-4
rcc cc.0 cc .0
-IC
cc r-
.0 cc
.01.J
-4
-IC -IC
Lr r4
lr --.0 .0 c-sj cc N- Ct
4-4
-K .I
.0 Ct
N-
.0 c1 Cl
______ .0
-IC -K-IC
cc
N- CN I- 4J
csJ cc Ct
-IC Ct
-IC
cc N-
Cfl
cI Ct
Ct-K -K IC IC -IC -IC
N- .s-
cc cc cc cc N- 4-4
Læ Ll .i r1
CJ
-IC -IC -IC -IC
.0 N- Si-I
0.0 Lfl Læ -l 4-i
..Lr4 cc Ct
-4
-IC -IC -IC IC
cc .0 Ct r4
cc Cl
.Lrl ccCt
CIC.I Cl
-IC -IC -IC
.0 c-j I-
.0ri in4J
--
-IC
cc -IC LUCI C-I ri
I- IJLr in .0 cc
________-4 _C
1-J--l
Cl 4-4-t -t
4-I
iI CtW-I Ct
_______ 1-IC-C
-IC -IC IC cawN- N-
00 N- N- N- v-4 .r0cc
--I
s-I
s-
cc-I Cl Ct
N- cc ccin -5
-4 cc______ C- Ct00 uct00Cl IS -IC i4OW Cl Cl 1-i
L1 00 00-K 1-4 CJ rSCCl -i i- 00 i-1 1-IC
--Sct Cl IS CI 1-40 Cl 1-4 in OCt Ct$iE-t Cl 001-4 1-i N- N- C- N- E- -IC 4-14
1- Cl 1-4 F-I 1-41-4 U-tCt ICt 10 i-i OWQ-r s-I 1-40 VC.Z -ICr5
126
![Page 19: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/19.jpg)
TABLE
MNEMONIC DESIGNATIONS FOR PREDICTIONS IN FINAL MODELS
MNEMONIC Full Variable Name
ITEMDUCFLG Total itemized deductions flag
PARTNRLOSFLG Partnership net loss flag
ALTMINTAXFLG Alternative miriimun tax flag
TAXPREFFLG Total tax preference flag
DIVGT400 Flag for dividend income greater than $400
EINCOMEFLG Schedule net income flag
INTGT400 Flag for interest income greater than $400
ITEMDLJC Total itemized deductions amount
DIVINCOME Dividend income amount
ELOSS Schedule net loss amount
BUSEXPFLG Employee business expenses flag
NATGTPAT Flag for negative amounts total exceeding positive
CLOSS Schedule net loss amount
CAPGAIN Net capital gain amount reported on schedule
OVERSEAS Return from overseas
PARTNRLOS Partnership net loss
ALTMINTAX Alternative minimum tax amount
ELOSSFLG Schedule net loss flag
INVSTCRDFLG Investment credit flag
ENERGYFLG Residential energy credit flagTAXPREF Total tax preferences amount
CLOSSFLG Schedule net loss flagJOBCREDFLG Jobs credit flagGASTAXFLG Credit for tax on gasoline flag
PARTNRGAINFLG Partnership net gain flag
THEFTLOS Casualty and theft loss amount
PAT75K Flag indicating PAT $75000PAT75OK Flag indicating PAT $750000PAT3500K Flag indicating PAT $3500000PAT7500K Flag indicating PAT $7500000PAT7500KCODE58 Product of indicator PAT7500K and sample code 58 indicator
PAT3500KCODE58 Product of indicator PAT3500K and sample code 58 indicator
PARTNRTAXPRF Product of partnership net loss flag and tax preferences flagDIVALTMIN Product of flag for dividend income greater than $400 and
alternative minimum tax flag
127
![Page 20: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/20.jpg)
TABLE
LOWER BOUNDS OF PROPENSITY CLASSES TWO THROUGH SIX BY SMIPLE CODE
Propensity Class
Sample
Code
28 .1600 .2300 .2900 .4000 .5600
38 .1800 .2300 .2900 .4000 .6600
40 .0075 .0100
41 .0050 .0075 .0125
42 .0125 .0150 .0175 .0225
43 .0225 .0275 .0350 .0600
44 .0275 .0400 .0500 .0700 .1500
45 .0500 .0700 .0900 .1200 .2200
46 .0600 .0700 .1000 .1600 .3200
47 .1000 .1200 .1700 .2800 .4600
48 .1100 .1400 .2300 .4000 .6000
50 .0050 .0075 .0275
51 .0075 .0100 .0125 .0400
52 .0125 .0150 .0175 .0225 .0400
53 .0275 .0400 .0500 .0800 .1200
54 .0400 .0500 .0800 .1100 .1600
55 .0600 .0900 .1300 .1800 .2400
56 .0700 .0900 .1600 .2100 .3000
57 .1000 .1300 .2300 .3300 .5200
58 .0350 .0600 .1100 .2400 .6000
60 .0225 .0275 .0500
61 .0150 .0175 .0225 .0275 .0500
62 .0275 .0350 .0400 .0600 .0900
63 .0400 .0500 .0700 .1000 .1900
64 .0800 .0900 .1200 .1600 .2500
65 .1100 .1400 .1600 .2200 .3300
66 .1300 .1600 .2000 .2700 .4600
67 .1700 .1900 .2400 .3300 .4200
68 .1600 .1900 .2600 .3500 .5600
SOURCE Mathematica Policy Research Inc
NOTE The lower bound of the first propensity class is zero
This propensity class is combined with the preceding classes
128
![Page 21: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/21.jpg)
N- 00 .0 Cfl N- 03 N- .0003 .0 N-
003 -4 Cl .000 03 in .0 Lr .o -z cfl N-
.0 C4 Cfl 00 In Q\ Ui 000 C1 ui N- .0
fl .-4 .- C4 Cfl .-4 -4 -4 -4
.c.Ic
CflO\ QC.J03Q-N- 03c4.4N-03 .-4 ..-4 C.1 N- C4 03 -l .0 .-41 Cfl 03 Ui
Ui .0 c- ci .0 .-4 .4 Cfl .0 -1 0J
/D _4 .4
cl_i
.4-i
4C1C4C4C 4C4CN- C-I In Cfl -4 Cfl Cfl In N- N- In -4
Cfl Cfl -I C4 .C C-I .0 40 0300 Ui
-t CflL1 00000CICfl 00004CNCJC-I4-4 -4_ _I
031_i
4C.3C1C4C 4C4-cI 0-.t-400-.1.0 004.00030.0CO Ui .-4.-4 C4 .0 N- C4 .0 40 cfl N- .0 -1
.-ci 00000CI 00004--4C10_I -4 -4 -4 -4 -4 -4 -4 -4 ._ -4 -4 _4 _4 .-4 -4
0. cj IC
i-I N- in 00 c.4 N- C.4 CC -0 CCC N- 0303 Lf 03.-IIn cI In .0 -i- in .i0 .0 03 c-I -1
C4 0C.I 000000---4 00000-00\00
4..JI4 -I .4 .-4 .-4 .-4 -4 -I -4 .-4 -I -4
P-i 03 In 03 00003 .0 N- In .0 C4 -.1 C-I cfl N-
O\.0 00 .0IC-.103 00InN-C00N-O\-l 000000-4Q 000000400
031_i.-4 .-4 .-4 .-4 .4 .-4 .-4 .-I -4 -4 -4 l0
ciz.I-I -4 -t 03 N- In CC .-4 C-I 0300 C-4 In N- N-
N- C-I CC 03 c-I -1- cn N- -1- -.4 C-I .0
InN- 0004CIIn0C-J--4 000-4--InIn411.4 -l C4 -4 .-4 -4 .4 .4 c-I C-I .-4 .-4 .-4 .-I ..-I c4
i-4ICIC
i-i -1 N- .-4 030 In CC Ci CO .0 N- In In 003 0C-I Cfl In
In .0 C-I .0 N- N- In 00 N-
111x1 In C-ICC 000 C-IIn 0000CflC-JCflN-Cl_i T4 .-4 -4 -4 -4 .4 -4 -4 -4 C-I .-4 .-4 .-1 -4 .-4 .-I .-.4
03
HZ ICIC-C 1C4Ci-0 N- CC0-.1-4030.1--1.0 0C4.-4InCC00.0--IP.1 C-I N- C-I .0 In In .-4 C-I Cfl C-I
.0 -4 C- ce_i C- C-I C-I
Hrxi-I -.4 .-4 .-4 -4 .-4 -4 .-4 .-4 .-4 .-4 -4 -4
4CICICIC ICIC030 -.1 in N- .0 C.4 In In 003 ..-1-
4J -I N- 03 .403 In .0 CC In In N- In
ci c-I 000004C-IIn 00000C-JC-Im4-4 -4-4-4 -4-4-4
IC IC
-4.- coc-zrc-.4--lc-.I0In.0 0N-CCO.--100HO C-IO\ 0-In-N-C.0CN 0InC.000C-I Cfl C-I 0000000 In 0000 -I C-I C-I
-44111 N- N- In 00 -4 C- 00 cfl .0 .000 -4
-In -In0CC -IOC-IN--4.ON--.InC-I 000000000 0000-40.-I -4-4
.-4 .-4 .-4 .-4 .-4 .-I .-4 .-4 -4 .-4 .-4 .-4 .-i .-I
P-I
I-i
Cfl0 0C1C0In.0.0In InIn00400004C-100 000 .-4iC-ICO0N-N-N-C .-4.-4Cfl0In-I.04In
-4-4-4-4-4
-1000cC 04C-ICfl-.IIn0N-CO 0.-4CICfl-.In0N-00C-ICfl 1.ll InInInInInInInInin
I_i
129
![Page 22: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/22.jpg)
0-t0cflCOcflLr00
.O -i -J -t CO
_____
N- .-i c-.i cO ri00orrr--Lf Læ CO cj
4CON-NCflrCO0O\ Ci
-.j N- cfl Lfl
N-
-4 .-4 -4 -4 -4 .-4
_____ $-i
-4 Cfl Lf
N- C\
-4
-4 _4 -4 -4 -4 -4 -4 -4 -4
CO N- N- Lr Lr CO Lr
Cr Lfl CO Lfl Lfl Cr Cr
00-44J .-1 -4 -4 .-4 -l -4 -4 -4
-f-I
Cl
r4
Cr .0 N-
Cr -1 N- Cr If Cr000044-JCrCrCl CU
p_I .I
Ci____ CO
41
-4
-4 Cr Cr .0 Cr
N- CO Cr
.0 Cr -U CO CO Lfl Cr
4J
-4 .I -I Cr
__r4
4J 4C biD
CO Lr N- CO
If Lf N- Cr Cr
Lr 00 -4 Cr Lf CO If ID
4J -4 Ci CU
biD II
-i-I 4C
CO N- If .-
.0 CO N- N- N- N- CO 4-i
Cr -z1 N- Ci
_I
-4 -4 -4 CU .IJ
ID
Cl
iD
CO .0 N-
4.j Lfl Cr N- CO Cr r4
Cr 000-4NCrLrLrCi
.4 .-J .-4 rI
r-I
p_I
Cr -zt Cr Lf CU
v-i -4 -4 -4 -I -4 v-I -4 -4 .J
___ CU
N- CO CO CO CU
Cr N- CO Lr
-I 0000 -4 Ci
-4 -4 -4 _4 -4 -4 -4 -4 -4
CU
r1
.i 4-I vI Lr CO .0 .0 N-
If N- 4f4 -1 Cr N- CU CU000v4NCr-.0N-r4.r14.J IDQJH ii
1-4 1-1
p_I
1-4
HGCr .0 N- CO
.0 .0 .0 -0 .0 .0 .0 .0
E-4
El ci
130
![Page 23: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/23.jpg)
en en 0% .40 0% 0% 0% 0%enCe 0%
cSd c.d dd c.- Of-0.0.-
%_
IS
40 40 0% 10 0%.- c.o fO 400 enlO 4.N 00N4 .- en
1500%
04
P10%15 f-14
.0 01 CJ en 0% 01.2 en enf-I
0% -4a_
0%%001 U.0 _____0%
a- 0% 0% .4 10 en 0% 0%.- 4.- en 10 f-a 0%tn -n 10 IV 44 40 0% IV
14- 0% 40 ..4 .1 0%15.0 0% -4 .-4 en 0%4-I
ta
4-UCISL4.U
________0%
lu IS 4.44 10 r- 01 .-4 10.0 4- a- en 0% 10 .4 en IV 10IS IS 01 Cl 0% en en 10 10 01 IV ICen en 0e -0 en 10
-a 4.- 15 en 10 NJ en 0% ..4 N-CE 0.2 04 en N.01W
000%
0.0-J CI Cl en 40 10 15 en 0% 0%
a- a- 0%01 1. 15 en 0% 10 .-I 15 0% NJ
en NJ .10 --4 -IV-4 0% 040 00 -440 en PJ en IV en
0% III..4 0% NJ .44 40 en 40
LI .2 0% %0en en 40
150004.4.J 0%0ow lanE on 1.4 L4J IS 4.4.2 4.4 141 _.0 30.04.039-COEO 0040 VC 400 1500 44% E0...IEOC.EO4J OE WOn .0 .00 0.03 30 30III 04 04 I. 4- .IWOE ISOE lOnE WOE CI SI
4.2 UI
0%
131
![Page 24: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/24.jpg)
4- 4-
.0
44 at to at to Utto fl at 00 Ui
.-to dsi .- dd.5at.- 4-S
C______.2
4-
.0
CC at to ID 00 to to tofl It .4 I-.n-a
.4 4-S
-c toat.-
Ca
..-
0.400 at SC 00 0%
.4 at at to to
.-4 sc.iID
_0
4-S .0U0U14
I.- aI.S
.O4-5 a.J 44
.5.5 to fl .5- 5.
.4 .- ID .-l-.-
S. 14 5. at 10 10 as
5. 5.
ti .c at to .0 5.
ats sn CCS to.-4 aaa asac 44 aj
4_S 5. 4105
_____0.4.Eaa soaa3
...4 to ID 1054-S 45
to .5Us to .4 to at at to I.. Ut 4.5
5- ID to to to 00 -5
.5 -4 ID 10 at4.5 to 5.
45 -C5. aea.05.00..ua05.15 44
to -.-a
.-
aaa5.1 US
4J 4-S CC.4 to 10 to at to .5
4-S to to at ID 45 .4
to to CID 4--.0 SD 00 505 to to ID US 5.
10 5. 4.5
.-LU a-.- 5cUS________ a--ao-o .5
.4 .5 aoUs oa
.5 a-a105 -S at .2
4.54.5 4.5 to _S 5.
5. at at at at to .- Cl 44
to .5 50 ID 05 4.5 ..-
a.-.- .a at to 05 to 45
45 to .5 5.55515 to to to 10 to at to
5.4
toa.- 4.5
4.5 115aa 4.55.30 a.U
54 4.5 0.5In In 51 US
Ill
4. OIC In aaO51 -.-
5. 5. III. .551 US 4554 US 504.5
54-SU05ID
as
5. 5. Ia ..s .5 In 44 14 5- 55 45
515 Us aaa0 5. 5- 5- 5. 4.5 ._ 0.44_J OS ..-
4.5 5- Is 5- 5- ..- 4.5a0I. .o .0 Ia -0 so- .-S .l 5- 4.5 5- 4.5 to USa5.aEO...SEO 00 .cOassaCoas a....aa .4-aso aS aO aO S..00.05S..05 a5..4 SU.544
at 4.54.5 4.5 _s as 4.5
.5 0.0 5- 4.5.4 .4US 4.5 44
4-Ca di.-aa.0 .5 .505- 5- aC5- 5.5 554 Si .0 5-
132
![Page 25: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/25.jpg)
TABLE 10
ABSOLUTE AND PERCENTAGE DEVIATIONS FROM COMPLETE REPORT FOR THREE ALTERNATIVE ADVANCE ESTIMATES
OF SELECTED INCOME AND TAX ITEMS NOT USED AS PREDICTORS OF LATE POSTING TAX YEAR 1982
Money amounts are in millions of dollars
Absolute Deviation from Complete Report Percentage Deviation
Complete Simulated Propensity Propensity Simulated Propensity Propensity
Income or Tax Item Report Advance Data Score Score Advance Data Score Score
Estimate Estimate Method Method Estimate Method Method
Adjusted Gross Income AGLess Deficit 1852.072.4 9748.7 4594.0 2903.0 0.53 0.25 0.16
Salaries and Wages
Number of returns 83105532 170836 117162 148571 0.21 0.14 0.18
Amount 1564948.3 4295.7 3559.5 3561.4 0.27 0.23 0.23
Unemployment Compensation
Total
Number of returns 10105079 57860 51625 56460 0.57 0.51 0.56
Amount 19818.4 108.9 98.3 110.6 0.55 0.50 0.56
Included in AGI
Number of returns 5347.634 32.900 30784 30699 0.62 0.58 0.57
Amount 7089.1 12.7 12.0 11.5 0.18 0.17 0.16
Farm Income Schedule FaNet Profit
Number of returns 932986 4191 3287 1200 0.45 J.35 0.13
Amount 7994.2 10.9 -2.3 -16.1 0.14 -0.03 -0.20
Net Loss
Number of returns 1755632 -5113 -2580 472 -0.29 O.15 0.03
Amount -17822.8 334.1 81.3 -42.1 -1.87 -0.46 0.24
Net Profit Less Loss
Number of returns 2688618 -922 708 1672 -0.03 0.03 0.06
Amount -9828.6 344.9 79.1 -58.1 -3.51 -0.80 0.59
Pensions and Annuities in AGI
Number of returns 8824875 38287 54554 29021 0.43 0.62 0.33
Amount 60122.9 182.4 207.8 85.2 0.30 0.35 0.14
Alimony Received
Number of returns 316617 1251 1724 1385 0.40 0.54 0.44
Amount 1946.1 -122.2 -118.1 -120.2 -6.28 -6.07 -6.18
Estate or Trust Incomeb
Net Income
Number of returns 797391 -25870 -23286 -23160 -3.24 -2.92 -2.90
Amount 6088.8 -319.0 -340.9 -351.8 -5.24 -5.60 -5.78
Net Loss
Number of returns 61810 -5397 -4764 -4548 -8.73 7.71 7.36Amount -342.7 37.5 22.7 18.0 -10.95 -6.61 -5.26
Net Income Less Loss
Nunber of returns 859201 31267 -28050 -27708 -3.64 -3.26 -3.22
Amount 5746.2 -281.5 -318.3 -333.8 -4.90 -5.54 -5.81
Small Business Corporationb
Net Profit
Number of returns 416549 -13482 -12115 -12191 -3.24 -2.91 -2.93
Amount 5580.3 -186.4 -324.8 -296.3 -3.34 -5.82 5.31
Net Loss
Number of returns 421219 -31291 -25037 -23375 -7.43 5.94 5.55
Amount -6426.4 1227.2 765.7 574.8 -19.10 -11.91 -8.94
Net Profit Less Loss
Number of returns 837768 -44773 -37152 -35567 5.34 -4.43 4.25
Amount 846.1 1040.8 440.9 278.5 -123.01 -52.11 32.92
Other Income
Net Income
Number of returns 3703370 -28463 24862 -25754 -0.77 -0.67 -0.70
Amount 7641.0 -424.5 -380.2 -373.0 -5.56 -4.98 -4.88
Net Loss
Number of returns 553778 -38502 -27939 -24953 -6.95 -5.05 -4.51
Amount -17942.3 2403.8 1303.9 941.2 -13.40 -7.27 -5.25
Net Income Less Loss
Number of returns 4257148 -66965 -52801 -50707 -1.57 -1.24 -1.19
Amount 10301.3 1979.3 920.7 568.2 -19.21 -8.94 -5.52
133
![Page 26: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/26.jpg)
Table 10 continued
Absolute Deviation from Corn lete Report Percentage Deviation
Complete Simulated Propensity Propensity Simulated Propensity Propensity
Income or Tax Item Report Advance Data Score Score Advance Data Score Score
Estimate Estimate Method Method Estimate Method Method
Payments to an IRA
Number of returns 12101016 101.150 84888 77179 0.84 0.71 0.64
Amount 28273.8 238.9 182.7 165.7 0.84 0.65 0.59
Deduction for Two-Earner
Coupl eC
Number of returns 21.689907 136655 126726 124770 0.63 0.58 0.58
Amount 9047.7 61.3 54.9 55.2 0.68 0.61 0.61
Taxable Income
Number of returns 89716570 67444 39341 17128 0.08 0.04 0.02
Amount 1473318.0 6846.5 2976.0 2313.8 0.46 0.20 0.16
Income Tax
lax Before Credits
Number of returns 79348582 50762 27285 1417 0.06 0.03 0.00
Amoun 283926.4 2549.3 946.7 789.1 0.90 0.33 0.28
Tax Credits
Number of returns 18882528 -18745 -20400 -26094 -0 10 -0 11 -0 14
Amount 7854.2 267.5 -208.3 215.6 -3.41 -2.65 -2 75
Tax After Credits
Number of returns 76959571 72677 46981 20573 0.09 0.06 0.03
Amount 276072.2 2816.9 1155.0 1004.7 1.02 0.42 0.36
Total Income Tax
Number of returns 77033971 63653 42539 17645 0.08 0.06 0.02
Amount 277588.5 2612.2 1138.7 1056.2 0.94 0.41 0.38
Total Tax Liabilities
Number of returns 78680509 3076 -5507 -28404 0.00 -0.01 -0.04
Amount 284699.0 2353.7 897.7 811.7 0.83 0.32 0.29
Selected Itemized Deductions
Medical and Dental Expenses
Number of returns 21980291 48403 46155 58787 0.22 0.21 0.27
Amount 21704.6 -275.5 -217.5 205.6 -1.27 -1.00 -0.95
Taxes Paid
Number of returns 33079548 28700 26653 44074 0.09 0.08 0.13
Amount 88032 498.1 452.3 452.5 57 0.51 0.51
Interest Paid
Number of returns 30242954 29373 35097 50923 0.10 0.12 0.17
Amount 121822.6 1795.3 -954.2 -845.7 -1.47 -0.78 0.69
Contributions
Number of returns 30509886 66578 63254 78195 0.22 0.21 0.26
Amount 33467.9 282.7 338.2 325.6 0.84 1.01 0.97
Exemptions
Total Number 232188753 72683 133721 114854 0.03 0.06 0.05
Age or Blindness 14211172 12845 19949 -28421 -0.09 0.14 -0.20
Other 217977581 85528 113772 143275 0.04 0.05 0.07
SOURCE Prepared by Mathematica Policy Research from 1982 SQl Complete Report microdata file
aFlags and amounts were tested but rejected as predictors of propensity to be posted latebThis income source is included on Supplemental Income Schedule which encompasses rent royalties partnerships small business corporations
estate and trust income and selected other sources Schedule income flag and loss amount appear in the propensity equations for all strataThe net income amount was tested but rejected
Clhis variable was not available prior to 1982
134
![Page 27: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/27.jpg)
NI
0r are-4
cis ON 0lt5O0fl 0-fl7N 0ONINI0NI.74 NI
-4
Ow00ow ON oN.N...oS-4aNIr- N.4ofl0mclN NIariNIaO
cn NI .7 NI Os .7 NI C- CS -.1
NIl ii III IC005.00.0
5.
o-.7 ooa-. os..o ON .0cflNIOdo ordr dcc
NI NI NI .7
50 II 1111111J
5-
sO 05 i- as so .7 NI 5405.0-4 0.7 05 50 CS 05 C- C50 fl .4
o.o. r-.flflcn NINIsNI -4so0so.tNI45 NI07
02
NI tSNI7.7.7ll5 05S0N.S01NI 7-4N.N.NI7S0-4 02 000000 45 00 NI N.sO 00000 .-
II lUll 1-414 III02
-IC- 04NI00s05 flOfl0N.OcnNIr. 05NINIflasO0000-7 ooooddd ddC-I--.dSNI NI
00WI
.7 NIC N.0.7CIOSSONI -3C-7OC-C .7s00a0rN.0.-- dooo-.d. dd0ddd ..d-d5NI 11111 III NI 11111111II
-4 -4
50 50 CSa 0N.OCfl-4N.0fl .N.7404 ea4S0N.Cnr..r-..7
oi NI Csl Os NI NI C0 07 N.54.4 4J 4J 1.4 50 -7 C- NI 0.7 It .7 ..7 505 -0.70
50 ..7 7C0C0a0 00-7050NI4. a0NI0.40.0NI.4 NI 50- NI 50-0 5C
4.4 NI NI7 NI
02
0027 NI
1-
NI.7 .7 CS .-4 CC N.50 45 NI NI SC INI 50.7 40
05.1.4 02 -.0 .7 -7 NI 0000 00oO ii III III III II
507.4
-7.C .4 .7.7 0.7 NI sO NI NI NI 0.0 N-SO
024.0-Crno..7NICs.7.7Cn d-dddd-50 04
00CU020 sO
04.4 flO0U00-.0O flN.C4..t-0.0.0 a5N.Oa0s0OsO
S-.0NI-7UNIO dCNC04 .4 I- II NI
02
50
02 CS a0 0USUN.N.N N.0N.aN.0500U -0.Ca0N.moNI02 oi-..1
50 ONI 45.7 05 NI NI NI NI 077 SOCS NIO NI
..0 0N.NIN..flNICNI 0CflU7O0-.N OSNI-.ttSN.a.N.l0.0 N.054SN.-.N.O0U 4NICNISNI-450021.4 50a-7O NINIs5
02
-0 NI SON 050 NI fl NI 505005050 .7 45.7 70
NI NI sO 05 NI NI NI -7.00 -04 NI NIm-CC-ors
50
4.1
50
I.
50
50 50
50 50
--4
50..4 -4
5050 50 50
.050 .-s NI -7 0N NI S7 UsON .40 NI C-7 0N
NICS C777777777 7505050sO0s.OsOsOC
I-
135
![Page 28: ADDonald Rubin Datametrics Research Inc and Harvard University INTRODUCTION sample closeout The distribution of returns by posting date is not random hence the nonre Individual and](https://reader036.vdocuments.net/reader036/viewer/2022071020/5fd4d07b3401fb5c3d145e8a/html5/thumbnails/28.jpg)
NIN- N- 11 I1 .0 N- .0 .0 N- 0.0 N- OS OS
.4..- dcd d-N N..OsO.0N-0N-OSV
Ni NI N- .- 11 NI NI NIIII.1-IOW
ISNI 0ON-VNIOVO ONS0NI.-.NIcSlS OS00S0N-NIUbO -..-WV
N- NS NI NI NSIII III
NI SNS NIV0CINI N-VCSQ0fl-I 0SVCNI NI 1S 1S NI NI N- .0 .0 fl NI NI
NI N- CS NI NI fl
I-lWIa NI NI CS N- N- IS .0 CS ON- CS N-VUV08 NI VI.VN-V.C 0NICI1000.CVN-
VNI0N-OS.SCS CSCCNISONINIC 00000CS0SVC1C NI CS NI NI Ia
NI IS
NI NI NI NI N- OS N- CS NI N- OS CC OS NI IS CS---4 fl .0 0000000 N- .0 00
I-I liii -4
-4lb
-C
NI NI 0NIlIS0C1O0 -CNI-s00000 fl0OSflNIC-CNIVflWV 00000N.-CS0 000.0cSNININI .1CCI-SObO
II III II
oW01VII N-
NI ON- NI-IONIC1C.001I0 mN.mV0m 00C-VN-VCS .0
V-b 00000-IC-COCfl 000.IaNI00.IaOS Nl-NI-NINI00 -4
.4 I1 II IIIi-I _I
_________-4 lb
lb
CS N- C1 OS NI 05 05 CI.O OS Cfl NI OS C1 I-I
-vN-fl V.0CIC1CN-N-0CC 00OC-ICCflN- N-0.0OSC-COS
UI-lU .1I VNIN-N-00V-IN- I1S0N-NICSIa00 IVOS0CV-CCSN-1.1
NI OS N- 05 05 CS NI CIS NI Ia OS
OP -- .4
0..4 NI os N- NI -1 NI NI N- SI-V .Ia NI OS
US mN-.I
______OINI
CS OS .Ia Ia UI N- .3 .3 .0 NI .Ia .0 .Ia .Ia NI IS .0-4UI -COS 0000-INICVV 00005.IaNION.N- 0000-101
ViOl 11114 IICSINI.4 III
C-i
UC .4NI 05 .0 05 NI NI -V .0 CS NI NI N-0 NI Cl
.lbOIZC/ NI Ia NI NI NI CS CS .0 NI
Ia NI
1401
0NI0OS0NI N-N-VN-ONI.0
Cl 0- 000 NI N- N- 00 NI NI 000 NI N- NI .0NI Ia NI CS NI NI
cD 1.1IC N-VICCCCNI0C 0.CN.00.0 0iI1CNICN..-0l.0UUUVUV NI SN- ONINI-VN-00CII 0N-Vc0N-CV NINIOSIlVN-V
-i N- OS.0 00 as OC -I US -C NI
3.0.-I IflNIVVN-/N10Ifl .-COSVC.0C1NI N.I0NiIfl0-CVN- lb
VI-Cl NI VC5005VCCIa N.0N-NI 0SN.NICNIC
UC N- 5$ -i
VN-NI -1 -4.0
I-I
_________ -4
OS Ni ON- CS CS 15$It N- NI.0 .0
-i 0I CVCNCVbIN- .oV.oCCm3.14 NIC NIN-0OSNCOSNINI VVCVmNI 0N-.-NI.--IC.--O .4 .4
8.-I------ III
lb Cl 05 N- Ia NI NI 0OC NI
.F
______ .0
1.1 .3
.-4
5-
_i .4
--.3.30-I NI -V Ii N- 0.- Ni CS ON-0-.NICVfl0N-
NIC 0VVVVVVVV-V CC VOOOOOOOOOI-I
136