sas government analytics leadership forum · through record linkage instead of collection - result:...
TRANSCRIPT
SAS Government
Analytics Leadership
Forum
April 2018
Anil Arora, Chief Statistician of Canada
• Translating data into evidence for 100
years
• Using statistical science and sophisticated
methods to produce reliable information
about Canadians
• A lot goes on behind the scenes to
produce the census…
Statistics Canada
2
3
Census: Behind the Scenes
• New data sources and the sophistication of our users and their capacity underpin the need to modernize our methods and outputs
• Leading-edge methods and data integration are a key pillar of our modernization agenda
4
The data revolution is changing
Canada’s society and the
expectations of Canadians
4
5
Statistics Canada is undertaking a significant
transformation and leading efforts to be more responsive to the data needs of policy leaders by:
Moving beyond a survey-first approach with new methods and integrating data from a variety of existing sources
Making data easier to access and use by adopting new tools to analyze and visualize data
Enabling Canadians to use data to make evidence-based decisions
6
Design and collection
Optimize designs and processes (samples, collection, coding, record
linkage)
Processing and inference
Statistical error detection and correction, weighting, weight
adjustments, use of statistical models
Analysis
Time series analysis, statistical data validation and confrontation, data
interpretation
Consumption
Supporting quality decisions by citizens, their governments and businesses based on evidence
Statistical analysis is at the center of every
step in the cycle of translating data to
evidence
G-SAM, G-CODE, G-LINK
BANFF. CANCEIS
G-SERIESG-CONFID
All processing systems (G-SAM, etc.) are coded in SAS
Statistical analysis is critical to producing high quality
information in the most cost efficient manner
Dissemination
Measurement of accuracy, statistical disclosure control
(privacy)
Leading-edge methods to
integrate new data types:
7
Model-based crop yield estimates
Responding to rapidly evolving
policy needs:
January 11, 2018 print edition
8
8
9
Statistics
Canada’s
linkable file
environment
Administrative data files
from departments,
agencies and crown
corporations
Existing survey and
administrative data files
at Statistics Canada
Basic descriptive
statistics
Before-after analysis
Cohort analysis
Linked file for ongoing
research
Integrating data to enable the
Horizontal Review of Innovation
and Clean Tech
✓ Gathering data efficiently and strategically
✓ Leveraging existing data holdings across
government
✓ Creating a new research dataset to allow
further analysis
Evolving with the times
SAS first introduced at
Statistics Canada in the
late 1980’s
From: Character-
based green screens on the
mainframe
Primitive Windows user
interfaces
Enterprise Guide
Moving to: Visual
Analytics, Enterprise Miner and
Viya
Canadian Housing Statistics
Program• Trans Union data (43 mil. records)
linked to tax information (165 mil.) • 233 million possible pairs created• Runs in about 40 hours on the SAS
Grid• Would not be possible on a
dedicated Windows Server
10
StatCan SAS Grid
11
- Started as a research project made up of 4 workstations
- Evolved to be the largest SAS Grid implementation in Canada:
- 16 Grid nodes each having 16 cores
- 256 compute cores and 60 Terabytes (TBs) of Shared File System
large record linkages
complex estimation processes
multi-dimensional tabulations
Continued improvement: using the StatCan SAS Grid and the new SAS application G-Tab Census, one can see a reduction in time of 95% when compared to creating the same table using the 2016 Tabulation system
Allows many processes to run concurrently:
• Capacity to store, process and analyze Big Data
• Planned use-cases:
• CPI alternate data source
• Canadian Housing Statistics Program linkage
• Admin Data Lake
Pure Data Analytic (Netezza)
12
Immigration-related variables:
Traditional: data was added through record linkage instead of collection
- Result: 24,000 hour reduction of respondent burden
AI: to fill in missing values, Machine Learning identified best combination of respondent characteristics to make corrections
- Result: complete data; up to 10% more accurate
Old and new: combining
traditional and AI methods
in the 2016 Census
13
OUTCOMES
NowMore accurate data for
IRCC policymakers
LaterProof of concept for
Census 2021
14
SAS and Statistics Canada
THANK YOU!
For more information,
please visit
www.statcan.gc.ca
#StatCan100