Improving the Quality of Tax Statistics: Recent Innovations in Editing and
Imputation Techniques at the Statistics of Income Division of the U.S. Internal Revenue Service
Scott Hollenbeck – [email protected] Johnson – [email protected]
Melissa Ludlum – [email protected]
Today’s Presentation
Overview of Statistics of Income (SOI)
Dealing with Missing Data
Recent Innovations
Future Plans
What Does SOI Do?
Primary source of U.S. tax data Data from 110 tax returns and information documents Test and correct data collected during administrative
processing (IRS Masterfile) Collect extensive additional data from forms, schedules
and attachments Most projects collect data from samples Products
Micro data files for U.S. Treasury Department & Congress Public-use files Tables and analysis (www.irs.gov/taxstats)
SOI Data Collection Systems
Maintains computer network separate from main IRS processing
Data collection takes place in IRS Submissions Processing Centers
Graphical User Interface (GUI) systems based in ORACLE
Data tested for internal consistency Post-edit processing overseen by
headquarters’ staff
Three Major SOI Programs
Individual Income Tax Filed by individuals and married couples to report most
forms of personal income 133 million returns filed in 2006
Corporation Income Tax Filed by incorporated businesses to report income from
parent corporation and subsidiaries 2.5 million returns filed in 2006
Tax-exempt Organizations Annual information returns report assets, income,
expenses 833,000 returns filed in 2006
Missing Data – Unit Nonresponse
Causes Extensions/late-filed returns Tax evasion
Strategies Update values from prior year using survey
responses Utilize records for recent prior years filed
during the selection period
Missing Data – Item Nonresponse
Causes Taxpayer neglects to provide attachments Paper return is being used by another IRS
function Strategies
Use IRS Masterfile data for key values Impute values based on existing data and
information provided on prior and/or subsequent return
Surveys and direct contact with preparers
What’s New?
Digital images of tax returns
Electronic filing
Automated error correction/imputation routines
Digital Return Images In 1998 SOI began scanning operations Images stored in Tagged Image File Format (TIFF) In 2006, imaged more than 71.5 million pages
from 30 different tax and information returns Many users:
SOI headquarters staff SOI edit operations IRS Functions General Public (tax-exempt organizations only)
Split-Screen Edit Systems
Combines scanned image and GUI edit system on a single 24 inch wide-aspect monitor
Image displayed using Adobe Acrobat or specially adapted ORACLE programs
Image and edit systems are synchronized
Online access to instructions, dictionaries, other tools
Split-Screen Edit Systems
Positive feedback from editors Slight overall improvement in productivity and
quality Images available to geographically disbursed
work force Reduced storage of paper documents Reduced impact on other IRS functions
Electronic Filing of Tax Returns
2004 Modernized electronic filing (MeF) began Uses Extensible Markup Language (XML) to
capture: Numeric and character strings supplied by
taxpayer Information tags
2005 mandatory e-file for large business and tax-exempt organizations 20.5% SOI sample of corporate income taxes 13.5% SOI sample of tax-exempt organizations
SOI Use of MeF Data
In 2006, SOI developed programs to render digital images from XML data
Edit returns using split-screen applications
In 2007, will populate ORACLE data tablesdirectly with XML data Editors will validate data, supply codes and
allocate certain data items
Electronic Filing of Tax Returns
Individual income tax returns 1986 – E-file through paid preparers 1992 – E-file from home computers allowed 1994 – 98% of all filers eligible to e-file 2006 – 73 million returns, or 54%, e-filed Data stored in Tax Return Database (TRDB)
ASCII data, not tagged XML 2010 – Scheduled for conversion to MeF
SOI Individual Income Tax Program
Sample of returns processed differently depending on certain criteria
Edited returns
“Missing returns”
Forced closed returns
Individual Processing Programs
Online editing system – editors transcribe, code and review any potential data discrepancies
Post Edit Reconciliation Process (PERP) – automated computer program which validates and adjusts data
Edited Returns
Edited returns are processed through the online editing system by an editor, then reviewed using the PERP program
Prior to Tax Year 2004, all sampled returns which were not “missing” were manually edited
Currently only paper returns and electronically filed returns with specific characteristics are edited through online system
“Missing Returns”
Each year, approximately 250 paper returns selected for the sample are not located
Limited IRS Masterfile data available
PERP program used to impute missingdetails of forms and schedules
Forced Closed Returns
Automated processing of certain E-filed returns in the SOI sample
Bypass the online editing system and processed through the PERP program
Returns with possible discrepancies are reviewed by National Office analyst
Returns that pass all tests are considered “forced closed” and added to final data file
Results from Forced Closing Returns
Tax Year 2004 – First year using automated closing of selected electronically filed returns
Total sample size – 200,295 returns Electronically filed – 64,670 returns “Forced Closed” – 18,193 returns Editing hours saved – 1,400 hours
Results from Forced Closing Returns
Tax Year 2005 – Second year of program, expanded criteria for returns eligible to be “forced closed”
Total sample size – 292,837 returns Electronically filed – 114,897 returns “Forced Closed” – 47,753 returns Editing hours saved – 4,100 hours
The Future - Data
More returns and information documents will be filed electronically
Optical Character Recognition or Intelligent Character Recognition will be used to capture data from paper-filed returns
Data will be available in real time Enable larger sample sizes and increased
use of population files
The Future – Field Operations
Increased resources dedicated to resolving data inconsistencies as opposed to data transcription
Paperless environment – use of electronic data or digital images created from paper returns
Increased use of prior year data to identify and correct data anomalies
The Future - Products
Improvements in technology and increased use of electronic filing will allow SOI to produce more data, more quickly and more efficiently
Increased sample sizes will allow small area estimates
Population files will allow for creation of ad hoc panels, linkage of data items across tax form types and research on infrequent data items