unleash the power of abs statistics through methodological ... · unleash the power of abs...
TRANSCRIPT
Unleash the power of ABS statistics through methodological innovation
By Dr Siu-Ming Tam Chief Methodologist
March, 2017
Views expressed in this talk are those of the author and do not necessarily represent those of the Australian Bureau of Statistics
1
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
2
Drivers of change in ABS statistics
3
Need
for faster
decisions
Data
deluge
More
evidence
based
decision
making
Ageing
systems and
manual
processes
Growing
expectations
New
statistical
possibilities
and
opportunities
ABS Transformation - Who are we transforming for?
Our partners
• Greater responsiveness
• Improved collaboration
• Quicker to market
• Less red tape
Our community
• Improved data matching
• Informed use of statistics
• Evidence based policy and programs
• Less burden on households and businesses
Our organisation
• Ongoing sustainability
• Greater influence and reach
• More dynamic – able to respond to future challenges
Our people
• Greater flexibility
• More satisfying work
• New skills and opportunities
• More diverse and engaged culture
Six dimensions of transformation to achieve ABS goals
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
6
Methodology Architecture (MA) - Tam (2014)
• Being part of EA, MA is a transformation plan for methodology
• Vision for ABS MA – To provide a set of methods that underpins the products and process
vision of the ABS Transformation program
• MA is supported by 5 key “rules of engagement” – Innovate
– Industralise
– Build capability
– Contemporise
– Build support
7
Methodology Transformation
8
Transformational change in methodologies
• From classical to contemporary statistical methods
– from Designed data to Found data
– from direct measurements to modelled data
– from single source to multiple sources
– from siloed data sets to integrated data sets
• From limited access of URFs to more liberal access
9
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
10
Data Deluge - Big Data opportunities
11
Big Data and Big Challenges - (Tam and Clarke, 2015)
• ABS objective
• Harness Big Data sources to create a richer, more dynamic and focused statistical picture of Australia for better informed decision-making
• Challenges • Business benefit • Privacy and public trust • Technological feasibility • Data acquisition • Data integrity • Methodological
soundness • How to make valid
statistical inferences • Tam (2015)
12
Big Data = Big Sources, but
not entirely foreign to official statisticians e.g. Administrative records, Scanner Data
• Behaviour metrics and online opinion – potentially large inherent statistical biases
13
Administrative Records
Tax Records
Medical Records
Bank Records
Commercial Transactions
Credit Card Transactions
Scanner Transactions
Online Purchases
Sensor Data
Satellite Imagery
Ground Sensor Data
Location Data
Behaviour Metrics
Search Engine Queries
Web Pages Views and Navigation
Media Subscriptions
Online Opinion
Social Media Comments
Twitter Feeds
Data Analytics – What problems are they trying to solve?
Machine Learning methods
15
Statistical methods
16
Big Inference – One possible approach - (Tam, 1987, 2015)
Using Big Data
• Use a sample to calibrate the Big Data (treated as “covariates”) using ground truths
• Calibrate using a linear model (with time varying coefficients – Dynamic Model)
• Estimate parameters (using Frequentist/Bayesian approaches)
• Predict the non-sampled values using the covariates
• Or use the Generalised Regression Estimation (GREG) framework for estimation (parameters estimated using design-based methods)
The simple case – no missing data nor covariates
17
An ABS Pilot Study
To determine the feasibility of
Distinguishing crop types
Estimating area of land under each crop
Predict crop yield
from Earth Observations data
Barley or Wheat?
Region
Average Proportion Correctly Classified
Crop classification SE QLD 78.5%
Crop Presence Mallee 83%
Summary of Indicative Results
Survey Process Augmented by Big Data* *Big Data process augmented by survey data – Paul Biemer (2016)
20
Frame Population Sample
Randomization Observations
Data Integration and Processing
Modelling & Adjustment
Estimation Statistical Inference
Validation
Missing Covariate, and Missing Data challenges
21
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
22
Data integration vis-a-vis fusion
• Integration – Felligi and Sunter (1969)
• Fusion – Kim et al (2016)
23
One file – Sample AUB Note – ABS uses the EM algorithm to estimate the “m” and “u” probabilities (Samuels, 2012)
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
24
Data Utility versus Disclosure Risk for Unit Record Files
Disclosure Risk Data Utility
Protections
Ability in using the data to draw valid conclusions
Spontaneous Recognition
Matching risk
Higher risk for unit record
than aggregated data
Perturbation
Cell Suppression
Collapsing of Categories
Sampling
Record masking
Substitution of Values
25
From “4 Safes” to the “Five Safes” Framework - (Richie, 2014)
Safe people
Safe project
Safe setting
Safe data
Safe output
Can the person be trusted to use the data appropriately?
Is the specific use of the data appropriate?
How does the mode of access limit the risk of disclosure?
How much protections are to be applied to the data?
How much controls are applied to ensure the output is non-disclosive?
A multidimensional approach to disclosure risk assessment Key Equation: Pr(D) = Pr(D|A)Pr(A)
26
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
27
Temporal modelling
• Shapes of curve is not constant over time
• Temporal modelling would seem appropriate – Dynamic linear model for production data
– Dynamic logistic regression for binary data
• Modelling the “beta” in GREG over time
– State Transition Equation
28
State Space Modelling for Satellite Imagery data Tam(1987, 2015)
29
Outline
• ABS transformation
• Methodology transformation
– Big Data challenges
– Data integration and data fusion
– Data access
• Methodological innovation
– Borrowing strength over time
– Measurement error model
30
31
Where survey sampling errors go if they are not removed ?
32
Better signal extraction using Structural Time Series model
• Explicit modelling of sampling error for survey estimates
– 𝑦𝑡 = ϑ𝑡 + 𝑢𝑡; ϑ𝑡 = 𝑇𝑡 + 𝑆𝑡 + 𝐼𝑡 – Modelling for trend (eg local linear trend), seasonal
effects (eg dummy seasonal) • Option 1 – put ϑ 𝑡|𝑡through the linear filters (eg X13 ARIMA)
to decompose into trend, seasonal effects and irregular (ABS option)
• Option 2 – use T 𝑡|𝑡 and 𝑆 𝑡|𝑡 as trend and seasonal effects estimates
• Benefit for seasonally adjusted estimates for areas with relatively small sample sizes
33
|ˆt t
ACT Employment and Unemployment Estimates
Small Domain Estimation
35
SDE with repeated surveys
36
Multiple data sources to improve survey estimates
• Borrowing strength from multiple sources (Harvey and Chung, 2000; Zhang and Honchar, 2016)
• Using Unemployment Benefit Claimant Counts to improve the LFS estimates – Exploit the correlation in the error covariance matrix
– Bivariate State Space Model (aka Seemingly Unrelated Time Series Equations model - SUTSE)
• Quality assurance tool for ABS LFS unemployment estimates
37
Seemingly Unrelated Time Series Equations Model (SUTSE)
Case study of Unemployment – LFS estimates vs benefit claimant count
L_lfs_unemp_o-Slope
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
-0.02
0.00
0.02
0.04L_lfs_unemp_o-Slope
L_cc_total-Slope
1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
-0.025
0.000
0.025
0.050L_cc_total-Slope
Smoothed estimates of LFS unemployement trend slope (ν 1,𝑡|𝑇) vs Claimant Count Trend slope (ν 2,𝑡|𝑇)
Case study of Unemployment (cont.)
761.1 763.3
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
LFS SSM predicted
March 2016
95% low and high
Prediction total unemployment using known CC data
Concluding remarks
• Methodology innovation is fundamental to support ABS transformation
• Methodological transformational change – More measured use of statistical models – Reforming data access
• Methodological innovation – Use measurement error model for aggregate stats – Use of SSM for a number applications involving time
• Change programs – Need to build support and buy in from subject matter
colleagues
Selected References
Biemer. P. (2016). Key note address to the 2016 International Survey Error Workshop. Felligi, I, and Sunter, A.B. (1969). A theory of record linkage. Journal of the American Statistical Association, 64, 1183-1210. Harvey, A. and Chung, C.H. (2000). Estimating the underlying change in unemployment in the UK. Journal of the Royal Statistical Society, Series A,3, 303-339 Kim, J.K, Berg, E. and Park, T. (2016). Statistical matching using fractional imputation. Survey Methodology, 40, 19-40. Ritchie, F. (2014). Access to Sensitive Data: Satisfying Objectives Rather than Constraints . Journal of Official Statistics, 30, pp. 533-545. Samuels, C. (2012). Using the EM algorithm to estimate the parameters of the Fellegi-Sunter model for data linking. Tam, S-M. (1987). Analysis of a repeated survey using a dynamic linear model. International Statistical Review, 55, 63-73.
43
Selected references (cont’d)
Tam, S-M. (2014). Methodology architecture – a roadmap for new methodological directions in the Australian Bureau of Statistics. Journal of Official Statistics, 30, 371-375.
Tam, S-M. (2015). A Statistical Framework for Analysing Big Data. Survey Statistician, 72, 36-51
Tam, S-M. and Clarke, F. (2015) Big Data, Official Statistics and Some Initiatives by the ABS. International Statistical Review, 83, 436-448
Thomsen, I.B. (1973). A note on the efficiency of weighing subclass means to remove the effects of non-response when analysing survey data. Statistics Norway. Unpublished manuscript
Zhang, M. and Honchar, O. (2016). Predicting survey estimates by state space models using multiple data sources. Australian Bureau of Statistics. Unpublished manuscript.
44