program and abstracts - statistics...

52

Upload: trankien

Post on 10-Nov-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Program and Abstracts

2

Welcome

The Organizing Committee of Statistics Canada's 2016 International Symposium on Methodological Issues would like to welcome you to this event. The theme of this 30th symposium is “Growth in Statistical Information: Challenges and Benefits”. We hope that the proposed scientific program will appeal to you and will create fruitful learning and networking opportunities as well as pleasant encounters with former and new colleagues. Have a great symposium!

Symposium 2016 Organizing Committee Joseph Duggan, Chair Logistics Committee Gildas Kleim, Chair Susan Demedash Catherine Deshaies-Moreault Geneviève Vézina Scientific Committee Marie-Claude Duval, Chair Melanie Abeysundera Jean-Francois Beaumont Kenneth Chu Abel Dasylva Susie Fortier Pierre Lavallée Lenka Mach Registrar Nick Budko Carole Jean-Marie

3

Table of Contents

Notes to participants ............................................................ 4

Sessions on March 22nd Schedule .......................................................................... 5 Program ........................................................................... 8 Abstracts ......................................................................... 18

Sessions on March 23rd Schedule .......................................................................... 6 Program ......................................................................... 11 Abstracts ......................................................................... 27

Sessions on March 24th Schedule .......................................................................... 7 Program ......................................................................... 16 Abstracts ......................................................................... 45

Final Thoughts .................................................................... 50

4

Notes to participants

Conference Rooms Symposium talks will be held on the third floor of the Palais des Congrès de Gatineau (PCG) in Chapleau and Gatineau rooms. Translation Devices Translation devices will be available at the registration table located in the Prefunction Space on the third floor. Breaks Time will be allotted for a morning break and an afternoon break in the Papineau room of the Palais des Congrès. Time will also be allotted for lunch. A list of restaurants and coffee shops that are within walking distance of the PCG is provided to the participants. Notes Blank sheets were appended to the end of the program to allow for notes. A note pad and the schedule of the Symposium were given to participants who don’t have a printed program. Coat Room Coat racks are available in the Chapleau and Gatineau rooms (3rd floor) Parking Many pay parking spots are available at the PCG ($). Public Transit The PCG is close to the Place du Portage (Portage 2) and is well served by mass transit. Bus transit information for Ottawa and Gatineau can be found at www.octranspo.com and www.sto.ca respectively. Palais des Congrès de Gatineau Address Palais des Congrès de Gatineau 50 boulevard Maisonneuve, 3rd floor Gatineau, Québec J8X 4H4, Canada Telephone: 1-819-595-8000

1-888-595-8001 Fax: 1-819-595-8012 Email: [email protected]

Day 1 – Tuesday – March 22, 2016

08:00 – 17:00 Registration (Third Floor)

08:45 – 09:00 Opening Remarks (Chapleau and Gatineau Rooms)

09:00 – 10:00 Plenary Session

Session 1: Keynote Address (Chapleau and Gatineau Rooms)

10:00 – 10:30 Morning Break

(Papineau Room)

10:30 – 12:00 Concurrent Sessions

Session 2A: Big Data in Official Statistics (Chapleau Room)

Session 2B: Applications Related to Growth

in Statistical Information (Gatineau Room)

12:00 – 13:30 Lunch

13:30 – 15:00 Concurrent Sessions

Session 3A: Total Survey Error

(Chapleau Room)

Session 3B: Alternative Data Sources

to Replace or Complement Survey Data (Gatineau Room)

15:00 – 15:30 Afternoon Break

(Papineau Room)

15:30 – 17:00 Concurrent Sessions

Session 4A: Open Data

(Chapleau Room)

Session 4B: Quality of Administrative Data

(Gatineau Room)

Day 2 – Wednesday – March 23, 2016

08:00 – 17:00 Registration (Third Floor)

08:45 – 9:45 Plenary Session

Session 5: Waksberg Award Winner Address – (Chapleau and Gatineau Rooms)

09:45 – 10:00 Plenary Session

Speed Advertisement for Posters and Software Demonstration

(Chapleau and Gatineau Rooms)

10:00 – 10:30 Poster Session, Software Demonstration and Morning Break

(Papineau Room)

10:30 – 12:00 Concurrent Sessions

Session 6A: New Advancements in Record Linkage

(Chapleau Room)

Session 6B: Confidentiality

(Gatineau Room)

12:00 – 13:30 Lunch

(Poster Session, Software Demonstration: Papineau Room, 13:00 – 13:30)

13:30 – 15:00 Concurrent Sessions

Session 7A: Non-traditional Methods

for Analysis of Survey Data (Chapleau Room)

Session 7B: Applications of Record Linkage and

Statistical Matching (Gatineau Room)

15:00 – 15:30 Poster Session, Software Demonstration and Afternoon Break

(Papineau Room)

15:30 – 17:00 Concurrent Sessions

Session 8A: Paradata

(Chapleau Room)

Session 8B: Use of Administrative Data

(Gatineau Room)

Day 3 – Thursday – March 24, 2016

08:00 – 12:00 Registration (Third Floor)

08:45 – 10:15 Concurrent Sessions

Session 9A: Scanner Data

(Chapleau Room)

Session 9B: Health Data

(Gatineau Room)

10:15 – 10:45 Morning Break (Papineau Room)

10:45 – 11:45 Plenary Session

Session 10: Data Science for Dynamic Data Systems: Implications for Official Statistics

(Chapleau and Gatineau Room)

11:45 – 12:00 Plenary Session

Session 11: Closing remarks

(Chapleau and Gatineau Room)

Tuesday, March 22, 2016

8

PROGRAM 8:00 - 17:00 Registration – Third Floor

8:45 - 9:00 (E, F) Opening Remarks – Chapleau and Gatineau Rooms

Sylvie Michaud, Assistant Chief Statistician, Statistics Canada

9:00 - 10:00 Session 1 – Keynote Address – Chapleau and Gatineau

Rooms

Chair: Joseph Duggan, Statistics Canada

(E) Methodological Issues and Challenges in the Production of Official Statistics

Danny Pfeffermann, Government Statistician of Israel, Hebrew University of Jerusalem, Israel, Southampton Statistical Sciences Research Institute, United Kingdom

10:00 - 10:30 Morning Break

10:30 - 12:00 Session 2A – Big Data in Official Statistics – Chapleau

Room

Organizer / Chair: Pierre Lavallée, Statistics Canada

(E) Challenges to Methodological Research in Official Statistics

Kees Zeelenberg, Statistics Netherlands, Netherlands

(E) Profiling of Twitter data: a Big Data selectivity study Joep Burger, Quan Le, Olav ten Bosch and Piet Daas, Statistics

Netherlands, Netherlands

(E) The Alternative Data Solution – Experience from Statistics Canada’s Producer Prices Division Gaétan Garneau and Mary Beth Garneau, Statistics Canada

10:30 - 12:00 Session 2B – Applications Related to Growth in Statistical

Information – Gatineau Room

Chair: Claude Poirier, Statistics Canada

The capital letter in front of the title indicates the language of the presentation.

(E) = English (F) = French

The slides are in both official languages for most presentations. Simultaneous translation is offered for all presentations.

Tuesday, March 22, 2016

9

(E) Challenges and results in using Audit trail data to monitor Labour Force Survey data quality Justin Francis and Yves Lafortune, Statistics Canada

(E) Statistics Canada’s Household Survey Frames Programme – Strategic

Research Enabling a shift to increased use of Admin Data as Input to the Social statistics program Tim Werschler, Edward Chen, Kim Charland and Crystal Sewards,

Statistics Canada

(F) Road Congestion Measures Using Instantaneous Speed Information from CVUS

Émile Allie, Transport Canada, Canada

(E) Data warehouse and analytical tools to facilitate the integration of the Canadian macroeconomic accounts Alistair Macfarlane and Jordan-Daniel Sabourin, Statistics Canada

12:00 - 13:30 Lunch

13:30 - 15:00 Session 3A – Total Survey Error – Chapleau Room

Organizer / Chair: Dave Dolson, Statistics Canada

(E) Using Administrative Records to Evaluate Survey Data Mary H. Mulry, Elizabeth M. Nichols and Jennifer Hunter Childs, U.S.

Census Bureau, USA

(E) Trends in Nonresponse and Linkage Consent Bias in a Panel Survey Joseph Sakshaug, University of Manchester, United Kingdom and Martina

Huber, Institute for Employment Research, United Kingdom

(E) Big Data: A Survey Research Perspective

Reg Baker, Marketing Research Institute International, USA

13:30 - 15:00 Session 3B – Alternative Data Sources to Replace or Complement Survey Data – Gatineau Room

Chair: Josée Morel, Statistics Canada

(E) A case study in administrative data informing policy development while cutting costs Yves Gingras, Tony Haddad, Stéphanie Roberge, Georges Awad and

Andy Handouyahia, Employment and Social Development Canada, Canada

(E) Towards an integrated census -- administration data approach to item-level imputation for the 2021 UK Census Steven Rogers and Fern Leather, Office for National Statistics, United

Kingdom

(E) Comparison of Survey Data to Administrative Sources

James Hemeon, Statistics Canada

(F) Student Pathways and Graduate Outcomes: Studies from linked data Aimé Ntwari, Éric Fecteau, Rubab Arim, Christine Hinchley and Sylvie

Gauthier, Statistics Canada

Tuesday, March 22, 2016

10

(E) Estimating the effects related to the timing of participation in Employment Assistance Services using rich administrative data Stéphanie Roberge, Andy Handouyahia, Tony Haddad, Georges Awad

and Yves Gingras, Employment and Social Development Canada, Canada

15:00 - 15:30 Afternoon Break

15:30 - 17:00 Session 4A – Open Data – Chapleau Room

Organizer / Chair: Hélène Bérard, Statistics Canada

(E) An International Overview of Open Data Experiences

Timothy Herzog, World Bank, USA

(E) Open data at Statistics Canada

Bill Joyce, Statistics Canada

(E) Exploring Canada’s Open Government Portal

Ashley Casovan, Treasury Board of Canada Secretariat, Canada

15:30 - 17:00 Session 4B – Quality of Administrative Data – Gatineau

Room

Chair: Laurie Reedman, Statistics Canada

(E) Assimilation and Coverage of the Foreign-Born Population in Administrative Records Renuka Bhaskar, Leticia Fernandez and Sonya Rastogi, U.S. Census

Bureau, USA

(E) When Race and Hispanic Origin Reporting are Discrepant Across Administrative Records Sources: Exploring Methods to Assign Responses Sharon R. Ennis, Sonya Rastogi and James Noon, U.S. Census Bureau,

USA

(F) The challenges of linking and using administrative data from different sources

Philippe Gamache, Institut national de santé publique du Québec, Canada (F) Using data linkage to evaluate address matches between census data

and tax data Julien Bérard-Chagnon and Georgina House, Statistics Canada

(F) Estimating Internal Migration: Issues Related to Using Tax Data

Guylaine Dubreuil and Georgina House, Statistics Canada

Wednesday, March 23 2016

11

8:00 - 17:00 Registration – Third Floor

8:45 - 9:45 Session 5 – Waksberg Award Winner Address – Chapleau and Gatineau Rooms

Chair: Mike Hidiroglou, Statistics Canada (E) Towards a Quality Framework for Blends of Designed and Organic Data

Dr. Robert Groves, Georgetown University, USA

9:45 - 10:00 Speed Advertisement for Posters and Software

Demonstration – Chapleau and Gatineau Rooms

Organizer / Chair: Susie Fortier, Statistics Canada

10:00 - 10:30 Poster Session, Software Demonstration and Morning

Break – Papineau Room Poster Session

(E) Handling survey feedback in business statistics Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden,

Sweden

(E) Measuring Data Quality of Price Indexes: the Producer Prices Division’s Performance Measure Grading Scheme

Kate Burnett-Isaacs, Statistics Canada,

(E) Creation and use of large synthetic data sets in 2021 Census Transformation Programme Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith Spicer and Peter Youens, Office for National Statistics, United Kingdom

Software Demonstration

(E/F) The use of a SAS Grid at Statistics Canada Yves Deguire, Statistics Canada

(E/F) SAS® High-Performance Forecasting Software at Statistics Canada Frédéric Picard, Statistics Canada

(E) High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!

Steve Holder, SAS Canada, Canada

(E) Machine learning in the service of official statistics

Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria

(E) Common Statistical Production Architecture and "confidentiality-on-the-fly"

Robert McLellan and Predrag Mizdrak, Statistics Canada

Wednesday, March 23 2016

12

10:30 - 12:00 Session 6A – New Advancements in Record Linkage – Chapleau Room

Organizer / Chair: Abel Dasylva, Statistics Canada

(E) Statistical Modeling for Errors in Record Linkage Applied to SEER Cancer Registry Data

Michael D. Larsen, The George Washington University, USA

(E) Sampling procedures for assessing accuracy of record linkage Paul Smith, University of Southampton, United Kingdom; Shelley Gammon,

Sarah Cummins, Christos Chatzoglou and Dick Heasman, Office for National Statistics, United Kingdom

(E) Bayesian Estimation of Bipartite Matchings for Record Linkage

Mauricio Sadinle, Duke University, USA 10:30 - 12:00 Session 6B – Confidentiality – Gatineau Room

Chair: Peter Wright, Statistics Canada (E) Finding a Needle in a Haystack: The Theoretical and Empirical

Foundations of Assessing Disclosure Risk for Contextualized Microdata

Kevin T. Leicht, University of Illinois, USA

(E) A modern Job Submission Application to access IABs confidential administrative and survey research data Johanna Eberle, Jörg Heining, Dana Müller and David Schiller, Institute for

Employment Research, Germany (E) Enhancing Data Sharing via “Safe Designs”

Kristine Witkowski, University of Michigan, USA

(E) Privacy and Security Aspects Related to the Use of Big Data – progress of work in the European Statistical System (ESS)

Pascal Jacques, EUROSTAT, Luxembourg (E) Practical Applications of Secure Computation for Disclosure Control

Luk Arbuckle, Children’s Hospital of Eastern Ontario Research Institute,

Canada and Khaled El Emam, Children’s Hospital of Eastern Ontario Research Institute, University of Ottawa, Canada

12:00 - 13:30 Lunch

13:00 - 13:30 Poster Session and Software Demonstration – Papineau

room Poster Session

(E) Handling survey feedback in business statistics Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden,

Sweden

(E) Measuring Data Quality of Price Indexes: the Producer Prices Division’s Performance Measure Grading Scheme

Kate Burnett-Isaacs, Statistics Canada,

Wednesday, March 23 2016

13

(E) Creation and use of large synthetic data sets in 2021 Census Transformation Programme Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith

Spicer and Peter Youens, Office for National Statistics, United Kingdom

Software Demonstration

(E/F) The use of a SAS Grid at Statistics Canada Yves Deguire, Statistics Canada

(E/F) SAS® High-Performance Forecasting Software at Statistics Canada Frédéric Picard, Statistics Canada

(E) High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!

Steve Holder, SAS Canada, Canada

(E) Machine learning in the service of official statistics

Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria

(E) Common Statistical Production Architecture and "confidentiality-on-the-fly" Robert McLellan and Predrag Mizdrak, Statistics Canada

13:30 - 15:00 Session 7A – Non-traditional Methods for Analysis of

Survey Data – Chapleau Room

Organizer / Chair: Lenka Mach, Statistics Canada

(E) Empirical Likelihood Confidence Intervals for Finite Population Proportions

Changbao Wu, University of Waterloo, Canada

(E) Hypotheses Testing from complex survey data using bootstrap weights: a unified approach J. N .K. Rao, Carleton University, Canada and Jae Kwang Kim, Iowa State

University, USA

(E) An objective stepwise Bayes approach to account for missing observations

Glen Meeden, University of Minnesota, USA 13:30 - 15:00 Session 7B – Applications of record linkage and statistical

matching – Gatineau Room

Chair: Colin Babyak, Statistics Canada

(E) Linking 2006 Census data to the 2011 mortality file Mohan Kumar and Rose Evra, Statistics Canada

(E) Estimating the impact of active labour market programs using

administrative data and matching methods Andy Handouyahia, Tony Haddad, Stéphanie Roberge and Georges

Awad, Employment and Social Development Canada, Canada

Wednesday, March 23 2016

14

(E) An Overview of Business Record Linkage at Statistics Canada: How to link the “unlinkable” Javier Oyarzun and Laura Wile, Statistics Canada

(E) Linking Canadian Patent Records from the US Patent Office to Statistics Canada’s Business Register, 2000 to 2011

Paul Holness, Statistics Canada

(E) Measuring the Quality of a Probabilistic Linkage through Clerical Reviews Abel Dasylva, Melanie Abeysundera, Blache Akpoué, Mohammed Haddou

and Abdelnasser Saïdi, Statistics Canada

15:00 - 15:30 Poster Session, Software Demonstration and Afternoon

Break – Papineau Room Poster Session

(E) Handling survey feedback in business statistics Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden,

Sweden

(E) Measuring Data Quality of Price Indexes: the Producer Prices Division’s Performance Measure Grading Scheme

Kate Burnett-Isaacs, Statistics Canada,

(E) Creation and use of large synthetic data sets in 2021 Census Transformation Programme Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith Spicer and Peter Youens, Office for National Statistics, United Kingdom

Software Demonstration

(E/F) The use of a SAS Grid at Statistics Canada

Yves Deguire, Statistics Canada

(E/F) SAS® High-Performance Forecasting Software at Statistics Canada

Frédéric Picard, Statistics Canada

(E) High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!

Steve Holder, SAS Canada, Canada

(E) Machine learning in the service of official statistics

Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria

(E) Common Statistical Production Architecture and "confidentiality-on-the-fly"

Robert McLellan and Predrag Mizdrak, Statistics Canada

15:30 - 17:00 Session 8A – Paradata – Chapleau Room

Organizer / Chair: Michelle Simard, Statistics Canada

Wednesday, March 23 2016

15

(E) On the Utility of Paradata in Major National Surveys: Challenges and Benefits Brady West, University of Michigan, USA and Frauke Kreuter, University of

Maryland, USA

(E) A Bayesian analysis of survey design parameters Barry Schouten, Joep Burger, Lisette Bruin and Nini Mushkudiani, Statistics

Netherlands, Netherlands

(E) Statistics Canada’s Experiences in Using Paradata to Manage Responsive Collection Design CATI household surveys François Laflamme, Sylvain Hamel and Dominique Chabot-Hallé, Statistics

Canada 15:30 - 17:00 Session 8B – Use of Administrative Data –Gatineau Room

Chair: Martin Renaud, Statistics Canada

(E) Redesign of the longitudinal immigration database (IMDB)

Rose Evra, Statistics Canada

(F) Creating a longitudinal database based on linked administrative registers: An example Philippe Wanner, Université de Genève et NCCR On The Move,

Switzerland and Ilka Steiner, Université de Genève, Switzerland

(E) Use of admin data to increase the efficiency of the sample design of the new National Travel Survey

Charles Choi, Statistics Canada

(E) Using Administrative Data to Study Education in Canada Martin Pantel, Statistics Canada

Thursday, March 24 2016

16

8:00 - 12:00 Registration – Third Floor

8:45 - 10:15 Session 9A – Scanner Data –Chapleau Room

Organizer / Chair: Martin Beaulieu, Statistics Canada

(F) Challenges Associated with Using Scanner Data for the Consumer Price Index Catherine Deshaies-Moreault and Nelson Émond, Statistics Canada

(E) Product homogeneity and weighting when using scanner data for price index calculation

Antonio G. Chessa, Statistics Netherlands, Netherlands (E) A look into the future – Scanner data and big data

Muhanad Sammar, Statistics Sweden, Sweden

8:45 - 10:15 Session 9B – Health Data – Gatineau Room

Chair: François Brisebois, Statistics Canada

(E) Comparing Canada's Healthcare System: Benefits and Challenges Katerina Gapanenko, Grace Cheung, Deborah Schwartz and Mark

McPherson, Canadian Institute for Health Information (CIHI), Canada

(E) A systematic review: Evaluating extant data sources for potential linkage Erin Tanenbaum, NORC at the University of Chicago, USA; Michael

Sinclair, Mathematica Policy Research, USA; Jennifer Hasche, NORC at the University of Chicago, USA and Christina Park, National Institute of Child Health and Human Development (NICHD), USA

(E) Providing Meaningful and Actionable Health System Performance Information via a Unique Pan-Canadian Interactive Web Tool Jeanie Lacroix and Kristine Cooper, Canadian Institute for Health

Information (CIHI), Canada

(E) Epidemiological observatory on Brazilian health data Raphael de Freitas Saldanha and Ronaldo Rocha Bastos, Universidade

Federal de Juiz de Fora, Brazil

(E) Data surveillance on the clinical data used for health system funding in Ontario Lori Kirby and Maureen Kelly, Canadian Institute for Health Information

(CIHI), Canada

10:15- 10:45 Morning Break

10:45 – 11:45 Session 10 – Plenary Session – Chapleau and Gatineau

Rooms

Organizer / Chair: Pierre Lavallée, Statistics Canada

(E) Data Science for Dynamic Data Systems: Implications for Official Statistics

Mary E. Thompson, University of Waterloo, Canada

Thursday, March 24 2016

17

11:45 – 12:00 Session 11 – Closing remarks – Chapleau and Gatineau Rooms

Claude Julien, Director General of the Methodology Branch, Statistics Canada

Tuesday March 22, 2016

18

ABSTRACTS Session 1 – Keynote Address

(E) Methodological Issues and Challenges in the Production of Official Statistics

Danny Pfeffermann, Government Statistician of Israel, Hebrew University of Jerusalem, Israel, Southampton Statistical Sciences Research Institute, United Kingdom The big advancement in technology, coupled with increased availability of 'big data', and yet increased demand for more accurate, more detailed and more timely official data with tightened budgets, places enormous challenges to producers of official statistics across the world. In this presentation I shall discuss some of the major challenges as I see them and in some cases offer ways of dealing with them. Examples include the potential use of big data; privacy and confidentiality; possible use of data obtained from web-panels; accounting for mode effects; and the integration of administrative data and small area estimation for future censuses. In the last part of my talk I shall confront the question of whether universities train their students to work at National Statistical Offices.

Session 2A – Big Data in Official Statistics

(E) Challenges to Methodological Research in Official Statistics

Kees Zeelenberg, Statistics Netherlands, Netherlands We identify several research areas and topics for methodological research in official statistics. We argue why these are important, and why these are the most important ones for official statistics. We describe the main topics in these research areas and sketch what seems to be the most promising ways to address them. The areas we touch upon are:

Quality of National accounts, in particular the rate of growth of GNI

Big data, in particular how to create representative estimates and how to make the most of big data when this is difficult or impossible

Nonresponse, in particular for web surveys

Statistical analysis, in particular of complex and coherent phenomena

Disclosure control, in particular with open data and big data

Increasing timeliness of preliminary and final statistical estimates

Dynamics of activities of individual, in particular flows on the labor market

Frameworks for statistical development, in particular when development is decentralized

These topics are elements in the present Strategic Methodological Research Program that has recently been adopted at Statistics Netherlands, but they seem to be of wider interest to other national statistical institutes and the statistical research community. (E) Profiling of Twitter data: a Big Data selectivity study Joep Burger, Quan Le, Olav ten Bosch and Piet Daas, Statistics Netherlands,

Netherlands An ever increasing amount of data about human behavior and economic activity is automatically logged by social media, road sensors, mobile phones and the like. This so-called Big Data is a potential data source for official statistics. Its high volume and rapid availability could be exploited for quick indicators on a diverse number of topics and for more precise estimates about smaller areas. One of the major challenges is to infer unbiased estimates from Big Data. Unlike in sample surveys, the mechanism generating Big Data is not a probability sample. As a

Tuesday March 22, 2016

19

result, Big Data typically covers a selective part of the target population. Auxiliary information explaining the missingness could be used to correct for this selectivity. Auxiliary variables might be linked from administrative registers, but often this is not feasible as the units in Big Data sources are difficult to relate to the units in administrative data. We wondered if it would be possible to obtain auxiliary information in another way. In this presentation, we will show how auxiliary information can be derived from the Big Data source itself, an approach called profiling, using Twitter as an example. We will show that we can reliably determine gender from Twitter accounts using the user name, the bio information, the profile picture and public tweets. From an associated LinkedIn account several additional characteristics can be derived. (E) The Alternative Data Solution – Experience from Statistics Canada’s Producer Prices Division Gaétan Garneau and Mary Beth Garneau, Statistics Canada

Over the last decade, Producer Prices Division of Statistics Canada has been expanding its program of Service Producer Price Indexes while continuing to improve its Goods and Construction Producer Price Indexes program. While the majority of Producer Price Indexes are based on traditional survey methods, efforts were made to increase the use of administrative and alternative data in order to minimize the response burden from our respondents. These alternative data come in many forms such as data from the web (list prices), statistics from other collecting agencies, third party data files, purely administrative files and micro data from non-price surveys. Focusing exclusively on producer price programs, this paper provides information on the way Statistics Canada makes use of alternative data sources. It also presents operational challenges and risks that statistical offices could face while counting more and more on third party outputs. Data quality, certification processes, data availability, financial consideration, risks of data manipulation and other factors that are changing over time are only a sub-set of risks and features that needs to be assessed in a robust manner. In addition, the paper describes how Statistics Canada is currently developing tools for producer prices to integrate upcoming alternative set of data while collecting metadata and building a reporting tool that will help to populate Statistics Canada’s Administrative Data Inventory.

Session 2B – Applications Related to Growth in Statistical Information

(E) Challenges and results in using Audit trail data to monitor Labour Force Survey data quality Justin Francis and Yves Lafortune, Statistics Canada

The Labour Force Survey (LFS) is a monthly household survey of about 56,000 households that provides information on the Canadian labour market. Audit Trail is a Blaise programming option, for surveys like LFS with Computer Assisted Interviewing (CAI), which creates files containing every keystroke and edit and timestamp of every data collection attempt on all households. Combining such a large survey with such a complete source of paradata opens the door to in-depth data quality analysis but also quickly leads to Big Data challenges. How can meaningful information be extracted from this large set of keystrokes and timestamps? How can it help assess the quality of LFS data collection? The presentation will describe some of the challenges that were encountered, solutions that were used to address them, and results of the analysis on data quality. (E) Statistics Canada’s Household Survey Frames Programme – Strategic Research Enabling a shift to increased use of Admin Data as Input to the Social statistics program Tim Werschler, Edward Chen, Kim Charland and Crystal Sewards, Statistics

Canada

Tuesday March 22, 2016

20

Statistics Canada’s Household Survey Frames (HSF) Programme provides various universe files that can be used alone or in combination to improve survey design, sampling, collection, and processing in the traditional “need to contact a household model.” Even as surveys are migrating onto these core suite of products, the HSF is starting to plan the changes to infrastructure, organisation, and linkages with other data assets in Statistics Canada that will help enable a shift to increased use of a wide variety of administrative data as input to the social statistics programme. The presentation will provide an overview of the HSF Programme, foundational concepts that will need to be implemented to expand linkage potential, and will identify strategic research being under-taken toward 2021. (F) Road Congestion Measures Using Instantaneous Information from CVUS

Émile Allie, Transport Canada, Canada Traffic congestion was identified by the Urban Transportation Task Force (2012) as a growing problem in Canada. The problem is not limited to large cities; it is also becoming a problem in medium-size cities and to roads going through cities. The congestion analysis is limited to provinces that were participating in the Canadian Vehicle Use Study (CVUS) for four quarters in 2014: Québec, Ontario, Manitoba and Saskatchewan. For each participant in the CVUS, when a vehicle is active, the GPS information, the instantaneous speed and fuel consumption are collected every second. The GPS information was linked to a road segment. Observations that could not be linked to a road segment were excluded from the analysis. We kept only 180 million out of 350 million of records for the CVUS -light component and 58 million out of 140 million of records for the CVUS -heavy component. Because of the nature of the data, we had to adapt and develop concepts to measure congestion. The measures show that congestion varies by day of the week and hour of the day and also by province, Census Metropolitan Area, quarter and road segment. Graphs and maps of Toronto area are used to illustrate those measures. (E) Data warehouse and analytical tools to facilitate the integration of the Canadian macroeconomic accounts Alistair Macfarlane and Jordan-Daniel Sabourin, Statistics Canada

Canada’s System of National Accounts is a highly integrated macroeconomic accounting system which requires the integration of concepts, methods and data across a series of economic accounts dealing with production, consumption, incomes, financial flows and stocks. This integration adds to the overall quality of statistics produced within the system; however it not without it challenges. One of the ways Statistics Canada has addressed these challenges is through the development of a data warehouse which facilitates this integration process. This paper outlines the integrated nature of the Canadian System of macroeconomic accounts and how the data warehouse facilitates both the compilation of the data and acts as a multidimensional analytical tool and that provides Business Intelligence (BI) Analytics to the compilers of the Canadian System of National Accounts at Statistics Canada.

Session 3A – Total Survey Error (E) Using Administrative Records to Evaluate Survey Data Mary H. Mulry, Elizabeth M. Nichols and Jennifer Hunter Childs, US Census

Bureau, USA Administrative records provide a source of data to use in evaluating errors in survey responses. Such evaluations have the potential to aid in the design of data collection and estimation using survey data but also to provide insight for designing estimation methodology that relies on a combination of survey and administrative records data or a transition from survey data to an administrative source. Although

Tuesday March 22, 2016

21

administrative records contain copious amounts of data, they are collected for their own purposes and have their own error sources. Using administrative records to evaluate survey data is not always as straight forward as it may sound. In this paper, we draw on our experience with two studies of error in survey reports of moves that used administrative records and on the experience of others to discuss some of the challenges that researchers must address when using administrative records to evaluate survey error. (E) Trends in Nonresponse and Linkage Consent Bias in a Panel Survey Joseph Sakshaug, University of Manchester, United Kingdom and Martina Huber,

Institute for Employment Research, United Kingdom Surveys are susceptible to multiple error sources which threaten the validity of inferences drawn from them. While much of the survey methods literature has focused on identifying errors in cross-sectional surveys, errors in panel surveys have received less attention. Administrative records linked to the entire sample (respondents and nonrespondents) can be useful for studying various errors in panel surveys, including nonresponse, which tends to accumulate over multiple waves of the study. Record data can also be used to study errors due to linkage consent, which is commonly asked for in panel surveys, but not provided by all respondents. In this paper, we present bias estimates for both error sources from a panel survey in Germany. The bias estimates are derived from administrative data collected on a sample of employees who were invited to participate in the panel. We find evidence of increasing nonresponse bias over time for cross-sectional and longitudinal measured outcomes. The opposite pattern is observed for linkage consent bias, which decreases over time when respondents who do not provide consent in a prior wave are asked to reconsider their decision in subsequent waves. We conclude the paper with a discussion of the practical implications of these findings and propose suggestions for future research. (E) Big Data: A Survey Research Perspective

Reg Baker, Marketing Research Institute International, USA Big data is one of those terms that can mean different things to different people. To some, it simply means taking advantage of existing datasets of all sizes and finding ways to merge them in the hope of generating new insights. To others, it means datasets so large that our traditional processing and analytic systems can no longer accommodate them. In this more intriguing view big data involves the merging of data from three principal sources: (1) transaction data generated when someone interacts with a company or institution; (2) the mostly unstructured data of social media; and (3) the Internet of Things, meaning the increasing use of interconnected objects—mobile phones, appliances, cars, traffic scanners, etc.—capable of measuring and transmitting information. Researchers are enthusiastic about big data because of the potential it offers not only to study behavior but also to build models to predict it. Many also see it as a substitute for surveys, which have become increasingly difficult, expensive, and sometimes less reliable. This presentation will explore four broad themes: (1) the amorphous character of big data; (2) the inherent data quality challenge; (3) the analytic challenge; and (4) big data’s potential impact on the future of surveys. It will attempt to separate out the hype surrounding big data from the actual progress being made across sectors.

Session 3B – Alternative Data Sources to Replace or Complement Survey Data

(E) A case study in administrative data informing policy development while cutting costs Yves Gingras, Tony Haddad, Stéphanie Roberge, Georges Awad and Andy

Handouyahia, Employment and Social Development Canada, Canada

Tuesday March 22, 2016

22

The Labour Market Development Agreements (LMDAs) are agreements between Canada and the provinces and territories to fund labour market training and support services to Employment Insurance clients. The objective of this paper is to discuss the improvements over the years in the impact assessment methodology. The paper describes the LMDAs, past evaluation work and discusses the drivers to make better use of large and detailed administrative data holdings. It then provides a detailed explanation of how the new approach made the evaluation process less resource-intensive for the Government of Canada and for provinces and territories, while results are more relevant to policy development. The paper also outlines the lessons learned from a methodological perspective. It also provides insight into ways for making this type of use of administrative data effective and efficient, especially in the context of large programs delivered by various organisations. (E) Towards an integrated census -- administration data approach to item-level imputation for the 2021 UK Census Steven Rogers and Fern Leather, Office for National Statistics, United Kingdom

In preparation for 2021 UK Census the ONS has committed to an extensive research programme exploring how linked administration data can be used to support conventional statistical processes. Item-level edit and imputation (E&I) will play an important role in adjusting the 2021 Census database. However, uncertainty associated with the accuracy and quality of available administration data renders the efficacy of an integrated census-administration data approach to E&I unclear. Current constraints that dictate an anonymised ‘hash-key’ approach to record linkage to ensure confidentiality add to that uncertainty. Here, we provide preliminary results from a simulation study comparing the predictive and distributional accuracy of the conventional E&I strategy implemented in Canceis for the 2011 UK Census to that of an integrated approach using synthetic administration data with systematically increasing error as auxiliary information. In this initial phase of research we focus on imputing single year of age. The aim of the study is to gain insight into the point at which the degree of error in the administration data leads to poorer performance than a conventional E&I strategy such as that used for 2011 UK Census. (E) Comparison of Survey Data to Administrative Sources

James Hemeon, Statistics Canada Administrative data, depending on its source and original purpose, can be considered a more reliable source of information than survey-collected data. It does not require a respondent to be present and understand question wording, and it is not limited by the respondent’s ability to recall events retrospectively. This paper compares selected survey data, such as demographic variables, from the Longitudinal and International Study of Adults (LISA) to various administrative sources for which LISA has linkage agreements in place. The agreement between data sources, and some factors that might affect it, are analyzed for various aspects of the survey. (F) Student Pathways and Graduate Outcomes: Studies from linked data Aimé Ntwari, Éric Fecteau, Rubab Arim, Christine Hinchley and Sylvie Gauthier,

Statistics Canada Files with linked data from the Postsecondary Student Information System (PSIS) and Tax data can be used to examine the trajectories of students who pursue postsecondary education (PSE) programs and their post-schooling labour market outcomes. On one hand, administrative data on students linked longitudinally can provide aggregate information on student pathways during postsecondary studies such as persistence rates, graduation rates, mobility, etc. On the other hand, the tax data could provide information on employment outcomes such as average earnings or earnings progress by employment sector (industry), field of study, education level and/or other demographic information, year over year after graduation. Two longitudinal pilot studies are being done using administrative data

Tuesday March 22, 2016

23

on postsecondary students of Maritimes institutions which have been longitudinally linked and linked to Statistics Canada Tax data (the T1 Family File) for relevant years. This presentation first focuses on the quality of information in the administrative data and the methodology used to conduct these longitudinal studies and derive indicators. Second, it will focus on some limitations when using administrative data, rather than a survey, to define some concepts. (E) Estimating the effects related to the timing of participation in Employment Assistance Services using rich administrative data Stéphanie Roberge, Andy Handouyahia, Tony Haddad, Georges Awad and

Yves Gingras, Employment and Social Development Canada, Canada This study is aimed at determining whether starting unemployed Canadians participation in Employment Assistance Services (EAS) earlier after initiating an Employment Insurance (EI) claim leads to better impacts for participants than participating later during the EI benefit period. Specifically, the main question is: Do the impacts on earnings, employment and EI use vary according to early, medium and late entry in EAS-only? We investigate the dependence of the program effect on varying entry times for the EAS participants. Based on unique data derived from several administrative sources in Employment and Social Development Canada (ESDC), we analyze the effects of EAS program for EI claimants who started their participation between April 2001 and March 2003 with respect to the timing of treatment. To do so, we apply a stratified propensity score matching approach conditional on the discretized duration of unemployment until the program starts (in quarters). Program effects are estimated up to 5 years after the employment assistance services program has started. The results from this study show that individuals who participated in EAS-only within the first four weeks after initiating an EI claim had the best impacts on earnings and incidence of employment while also experiencing reduced use of EI starting the second year post-program.

Session 4A – Open Data (E) An International Overview of Open Data Experiences

Timothy Herzog, World Bank, USA Open Data initiatives are transforming how governments and other public institutions interact and provide services to their constituents. They increase transparency and value to citizens, reduce inefficiencies and barriers to information, enable data-driven applications that improve public service delivery, and provide public data that can stimulate innovative business opportunities. As one of the first international organizations to adopt an open data policy, the World Bank has been providing guidance and technical expertise to developing countries that are considering or designing their own initiatives. This presentation will give an overview of developments in open data at the international level along with current and future experiences, challenges, and opportunities. Mr. Herzog will discuss the rationales under which governments are embracing open data, demonstrated benefits to both the public and private sectors, the range of different approaches that governments are taking, and the availability of tools for policymakers, with special emphasis on the roles and perspectives of National Statistics Offices within a government-wide initiative. (E) Open data at Statistics Canada

Bill Joyce, Statistics Canada Although the term Open Data is relatively new, the concept itself has had a long history and has roots in the data liberation movement. This presentation will trace the evolution of the principles of Open Data at Statistics Canada and will describe the agency’s current Open Data Strategy. Statistics Canada’s contribution to the

Tuesday March 22, 2016

24

wider Government of Canada Open Government environment will also be discussed. Statistics Canada actually plays a dual role within the wider federal government context: the agency is the main contributor of non geo-spatial data for the Government of Canada Open Government Portal and we act as a service provider for open.canada.ca. (E) Exploring Canada’s Open Government Portal

Ashley Casovan, Treasury Board of Canada Secretariat, Canada Open data is becoming an increasingly important expectation of Canadians, researchers, and developers. Learn how and why the Government of Canada has centralized the distribution of all Government of Canada open data through Open.Canada.ca and how this initiative will continue to support the consumption of statistical information.

Session 4B – Quality of Administrative Data

(E) Assimilation and Coverage of the Foreign-Born Population in Administrative Records Renuka Bhaskar, Leticia Fernandez and Sonya Rastogi, U.S. Census Bureau,

USA The U.S. Census Bureau is researching ways to incorporate administrative data in decennial census and survey operations. Critical to this work is an understanding of the coverage of the population by administrative records. Using federal and third party administrative data linked to the American Community Survey (ACS), we evaluate the extent to which administrative records provide data on foreign-born individuals in the ACS and employ multinomial logistic regression techniques to evaluate characteristics of those who are in administrative records relative to those who are not. We find that overall, administrative records provide high coverage of foreign-born individuals in our sample for whom a match can be determined. The odds of being in administrative records are found to be tied to the processes of immigrant assimilation – naturalization, higher English proficiency, educational attainment, and full-time employment are associated with greater odds of being in administrative records. These findings suggest that as immigrants adapt and integrate into U.S. society, they are more likely to be involved in government and commercial processes and programs for which we are including data. We further explore administrative records coverage for the two largest race/ethnic groups in our sample – Hispanic and non-Hispanic single-race Asian foreign born, finding again that characteristics related to assimilation are associated with administrative records coverage for both groups. However, we observe that neighborhood context impacts Hispanics and Asians differently. (E) When Race and Hispanic Origin Reporting are Discrepant across Administrative Records Sources: Exploring Methods to Assign Responses Sharon R. Ennis, Sonya Rastogi and James Noon, U.S. Census Bureau, USA

The U.S. Census Bureau is researching uses of administrative records in survey and decennial operations in order to reduce costs and respondent burden while preserving data quality. One potential use of administrative records is to utilize the data when race and Hispanic origin responses are missing. When federal and third party administrative records are compiled, race and Hispanic origin responses are not always the same for an individual across different administrative records sources. We explore different sets of business rules used to assign one race and one Hispanic response when these responses are discrepant across sources. We also describe the characteristics of individuals with matching, non-matching, and missing race and Hispanic origin data across several demographic, household, and contextual variables. We find that minorities, especially Hispanics, are more likely to have non-matching Hispanic origin and race responses in administrative records than in the 2010 Census. Hispanics are less likely to have missing Hispanic origin data but more likely to have missing race data in administrative records. Non-Hispanic Asians and non-Hispanic Pacific Islanders are more likely

Tuesday March 22, 2016

25

to have missing race and Hispanic origin data in administrative records. Younger individuals, renters, individuals living in households with two or more people, individuals who responded to the census in the nonresponse follow-up operation, and individuals residing in urban areas are more likely to have non-matching race and Hispanic origin responses. (F) The challenges of linking and using administrative data from different sources

Philippe Gamache, Institut national de santé publique du Québec, Canada At the Institut national de santé publique du Québec, the Quebec Integrated Chronic Disease Surveillance System (QICDSS) has been used daily for approximately four years. The benefits from this system are numerous for measuring the extent of diseases more accurately, evaluating the use of health services properly and identifying certain groups at risk. However, in the past months, various problems have arisen that have required a great deal of careful thought. The problems have affected various areas of activity, such as data linkage, data quality, coordinating multiple users and meeting legal obligations. The purpose of this presentation is to describe the main challenges associated with using QICDSS data and to present some possible solutions. In particular, this presentation discusses the processing of five data sources that not only come from five different sources, but also are not mainly used for chronic disease surveillance. The varying quality of the data, both across files and within a given file, will also be discussed. Certain situations associated with the simultaneous use of the system by multiple users will also be examined. Examples will be given of analyses of large data sets that have caused problems. As well, a few challenges involving disclosure and the fulfillment of legal agreements will be briefly discussed. (F) Using data linkage to evaluate address matches between census data and tax data Julien Bérard-Chagnon and Georgina House, Statistics Canada

The purpose of Statistics Canada’s Population Estimates Program is to produce population estimates and estimates of components of population growth. These estimates are calculated by combining census, administrative and survey data. Final estimates of internal migration are produced using tax data from the T1 Family File (T1FF), by comparing the addresses of tax filers for two consecutive years. Therefore, this approach is based on the assumption that tax filers’ address in the tax files match those in the census files. This presentation evaluates both the degree to which the addresses in the two sources match and the characteristics of tax filers associated with a lower level of comparability. To that end, data linkage is performed on data from the 2011 Census, the National Household Survey and the 2010 T1FF. The analysis shows that, even though the tax data and census data have different concepts of place of residence, the comparability of addresses in the two sources exceeds 90%. However, this rate varies, sometimes considerably, depending on tax filer characteristics. The tax address of filers who moved in the last year, who are in their twenties and who live in a collective dwelling, is particularly less likely to match the census address. (F) Estimating Internal Migration: Issues Related to Using Tax Data Guylaine Dubreuil and Georgina House, Statistics Canada

The purpose of Statistics Canada’s Population Estimates Program (PEP) is to produce population estimates and estimates of components of population growth. Internal migration, which combines interprovincial migration and intra-provincial migration, is one of these components. It is estimated by comparing tax filer addresses at the beginning and end of a given period. The PEP does this using Canadian Child Tax Benefit (CCTB) data and T1 Family File (T1FF) data.

Tuesday March 22, 2016

26

Therefore, the quality of addresses in the tax data plays a crucial role in estimating internal migration. Access to more sources of tax data at Statistics Canada makes it possible to do more evaluations and to test previously untested assumptions. This presentation will look at certain evaluations of files that are currently used, sometimes in comparison with other sources, other times in combination with other sources. It will discuss issues such as coverage of certain subpopulations that are more likely to migrate and the accuracy of addresses for which late change-of-address updates have been identified. As well, other aspects of address quality will be presented, along with the solutions to a few of the issues.

Wednesday March 23, 2016

27

Session 5 – Waskberg Award Winner Address (E) Towards a Quality Framework for Blends of Designed and Organic Data

Dr. Robert Groves, Georgetown University, USA

Probability samples of near-universal frames of households and persons, administered standardized measures, yielding long multivariate data records, and analyzed with statistical procedures reflecting the design – these have been the cornerstones of the empirical social sciences for 75 years. That measurement structure have given the developed world almost all of what we know about our societies and their economies. The stored survey data form a unique historical record.

We live now in a different data world than that in which the leadership of statistical agencies and the social sciences were raised. High-dimensional data are ubiquitously being produced from Internet search activities, mobile Internet devices, social media, sensors, retail store scanners, and other devices. Some estimate that these data sources are increasing in size at the rate of 40% per year. Together their sizes swamp that of the probability-based sample surveys.

Further, the state of sample surveys in the developed world is not healthy. Falling rates of survey participation are linked with ever-inflated costs of data collection. Despite growing needs for information, the creation of new survey vehicles is hampered by strained budgets for official statistical agencies and social science funders.

These combined observations are unprecedented challenges for the basic paradigm of inference in the social and economic sciences. This paper discusses alternative ways forward at this moment in history.

Poster Session (E) Handling survey feedback in business statistics Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden, Sweden

The Swedish business register is updated continuously by several different sources. Coordinated samples are drawn from frames which are built from the business register. Coordination is done over time and between different surveys, using permanent random numbers. This technique has many benefits but suffers from a drawback concerning updating of register data. Survey feedback means information on sampled units fed back from a sample survey to a register which is used to build a frame for future surveys. It may seem evident to update a register with survey feedback in order to make it as suitable as possible. However, methodological problems may occur due to the dependency of samples. This paper explores the bias in estimators that arise from survey feedback and points out some ways to reduce the bias. The feedback comprises data on status, business size, sector, and industry as well as contact information. The survey feedback effect on sampling design, auxiliary information for estimation and distribution by domains of study, respectively, is studied. Industry updates in the register, and by that in the frame, is shown to have a bias effect on the estimators. It seems hard to adjust the estimators for the presence of survey feedback. Another approach is to implement source and time stamps in the business register. Survey feedback can then be used for contact information and distribution by domains of study, but removed when frames are created for sampling purposes. Thereby, the estimation will not be disturbed by survey feedback. (E) Measuring Data Quality of Price Indexes: the Producer Prices Division’s Performance Measure Grading Scheme

Wednesday March 23, 2016

28

Kate Burnett-Isaacs, Statistics Canada The Producer Prices Division (PPD) has developed a Performance Measure Grading Scheme to evaluate each PPD index on key performance indicators to promote sound methodological practices and convey overall quality and reliability of published index numbers. This grading scheme was developed to meet the recommendations of the agency wide Quality Assurance Review Committee and aid the divisional Performance Measurement Strategy. Its components were drawn from the OECD Generic Statistical Business Process Model and Statistics Canada’s six dimensions of quality. Assessing the quality of an index is multi-faceted because of the complexities of index numbers and calculations and the different components of index compilation. An index number is comprised of price relatives, weights and a variety of forms of treatments to these data. The quality of an index must be assessed on the individual parts as well as the whole. The grading scheme intends to capture this and provide a measure of quality for the entire index, as well as its individual components, starting from a qualitative conceptual assessment, to a quantitative processing perspective. PPD produces 25 indexes that cover a wide scope of the business sector including goods production and manufacturing, construction, financial, transportation and professional services. These industries each have their own sources of data and standards of price measurement. The diversity of PPD’s index coverage brings with it a complexity when developing a standard method of assessing data quality. This paper will discuss the complexities of measuring data quality for indexes, explain the development of the grading scheme and choice of performance measures and discuss the difficulties and considerations required when making a standard measure of quality covering a broad range of industries and data sources. (E) Creation and use of large synthetic data sets in 2021 Census Transformation Programme Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith Spicer

and Peter Youens, Office for National Statistics, United Kingdom The Census Transformation Programme of the Office for National Statistics in England and Wales is investigating the use of synthetic data for a number of potential uses. One potential application is the creation of a household microdata sample. Due to disclosure concerns, we have not been able to provide a 2011 Census household microdata sample accessible outside secure research environments, with sufficient utility for users, using standard disclosure control techniques. The method being tested to create a household microdata file punches holes in a microdata sample, and uses the edit and imputation process from 2011 Census (using CANCEIS) to fill in the holes. This attempts to preserve the relationships between variables within households and individuals, but introduces sufficient uncertainty to mitigate disclosure risk. Methods are being investigated to test utility and risk in the resultant data. This poster will demonstrate issues we have to overcome, and the methods we are investigating to come up with a solution to provide useful, non-disclosive, microdata for a wider range of users. Given these issues, this is just a proof of concept at this stage, to test whether the approach is feasible.

Software Demonstration (E/F) The use of a SAS Grid at Statistics Canada

Yves Deguire, Statistics Canada A SAS grid is a sophisticated computing platform that provides load balancing, high availability, and scalability. This presentation will demystify the SAS grid and show how it has been deployed at Statistics Canada to support a large number of users as well as to perform huge amounts of statistical processing. Several use cases will also be proposed.

Wednesday March 23, 2016

29

(E/F) SAS® High-Performance Forecasting Software at Statistics Canada

Frédéric Picard, Statistics Canada Statistics Canada recently started to use SAS® HIGH-PERFORMANCE FORECASTING (HPF). SAS®HPF is a large-scale automatic system that can

evaluate and select appropriate models and generate a high number of Time Series predictions rapidly. It can be used by writing SAS code or through its graphical user interface. We will give an overview of some of the system's features and useful options and a presentation of the Graphical User Interface. We will also briefly describe examples of projects at Statistics Canada that have already benefitted from the software.

(E) High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!

Steve Holder, SAS Canada, Canada What would you do with an extra 269 minutes? The SAS High Performance Analytics (HPA) framework helps organizations move from dated and inefficient processes to modern analytics. In one case reducing the time it takes to make critical business decisions from 4.5 hours to just 60 seconds. Grab your coffee, come around and meet Steve Holder, National Lead, Analytics, SAS Canada. As an analytics practitioner you can find out how you can make decisions in real time, transforming your big data into relevant business value and ensure you do this in an easy to use governed way with the SAS analytics portfolio.

(E) Machine learning in the service of official statistics

Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria Machine learning (ML) is a very popular, data-intensive, computer science discipline. It is fairly generic and can be applied in various settings, however applications in official statistics became known only recently. To throw insight into this, to identify the techniques that have been explored, to investigate the opportunities for extending the links between official statistics and machine learning in particular and data science in general, a survey across the national statistical offices was conducted recently by Statistics Canada. In a paper presented at the workshop of the Modernisation Committee on Production and Methods in 2014 an overview of the machine learning techniques currently in use or in consideration at statistical agencies worldwide were discussed and the main reasons why statistical agencies should start exploring the use of machine learning techniques were outlined. Best choices of software tools for practical implementation of ML algorithms are Python and R. The purpose of this contribution is to present an update of the survey mentioned above, to map its findings to the currently available in the public domain R packages and to sketch the possible way forward. Mini-tutorial of R packages and illustration with several particular applications will be presented: automatic coding of item responses, outlier detection and imputation and record linkage, which will reduce manual examination of records.

(E) Common Statistical Production Architecture and "confidentiality-on-the-fly"

Robert McLellan and Predrag Mizdrak, Statistics Canada

The Common Statistical Production Architecture (CSPA) is a framework for "plug and play" statistical components. This presentation will showcase the work done at Statistics Canada in the CSPA context for a confidentiality tool developed by the Australian Bureau of Statistics (ABS). “Confid-on-the-fly” is an analytical tool where results are automatically and instantly made confidential. The presenters will illustrate how they turned the R-code implementation from ABS into a set of services that allow for model exploration and generation by researchers with access to confidential microdata.

Wednesday March 23, 2016

30

Session 6A – New Advancements in Record Linkage (E) Statistical Modeling for Errors in Record Linkage Applied to SEER Cancer Registry Data

Michael D. Larsen, The George Washington University, USA Record linkage joins together two or more sources. The product of record linkage is a file with one record per individual containing all the information about the individual from the multiple files. The problem is difficult when a unique identification key is not available, there are errors in some variables, some data are missing, and files are large. Probabilistic record linkage computes a probability that records from on different files pertain to a single individual. Some true links are given low probabilities of matching, whereas some non links are given high probabilities. Errors in linkage designations can cause bias in analyses based on the composite data base. The SEER cancer registries contain information on breast cancer cases in their registry areas. A diagnostic test based on the Oncotype DX assay, performed by Genomic Health, Inc. (GHI), is often performed for certain types of breast cancers. Record linkage using personal identifiable information was conducted to associate Oncotype DC assay results with SEER cancer registry information. The software Link Plus was used to generate a score describing the similarity of records and to identify the apparent best match of SEER cancer registry individuals to the GHI database. Clerical review was used to check samples of likely matches, possible matches, and unlikely matches. Models are proposed for jointly modeling the record linkage process and subsequent statistical analysis in this and other applications. (E) Sampling procedures for assessing accuracy of record linkage Paul Smith, University of Southampton, United Kingdom; Shelley Gammon, Sarah

Cummins, Christos Chatzoglou and Dick Heasman, Office for National Statistics, United Kingdom The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many of these rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census. (E) Bayesian Estimation of Bipartite Matchings for Record Linkage

Mauricio Sadinle, Duke University, USA

In this presentation we are concerned with the most traditional scenario of record linkage, which consists of linking two disparate datafiles containing overlapping information on a set of entities, and it is assumed that each entity is recorded maximum once in each datafile. This is an important task with a wide variety of applications, given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently in use are derived from a seminal paper by Fellegi and Sunter (1969) who formalized procedures that had been used earlier by other researchers. These techniques usually assume independence in the matching status of record pairs to derive estimation procedures and point estimators (e.g. Fellegi-Sunter decision rule). We argue that this independence assumption is unreasonable and target instead a bipartite matching between the two sets of records coming from the two files as our parameter of interest. The Bayesian implementation presented here allows us to incorporate prior information on the quality of the fields in the datafiles, which in

Wednesday March 23, 2016

31

turn helps to obtain better results when the datafiles do not share a large amount of identifying information. Our Bayesian implementation also permits us to properly quantify linkage uncertainty and obtain point estimators under different loss functions. In particular, we propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We demonstrate the improvements of our approach over traditional methodologies in a number of realistic simulation studies.

Session 6B – Confidentiality (E) Finding a Needle in a Haystack: The Theoretical and Empirical Foundations of Assessing Disclosure Risk for Contextualized Microdata

Kevin T. Leicht, University of Illinois, USA In their efforts to broadly release information that has high scientific value, producers may consider releasing the attributes of geographies instead of directly identifying the locations of respondents. Informing the design and production of such data files, this study describes various factors that are of concern when evaluating disclosure risk of contextualized microdata and some of the empirical steps that are involved in their assessment. Utilizing synthetic sets of survey respondents, I illustrate how different postulates shape the assessment of risk when considering: (1) estimated probabilities that unidentified geographic areas are represented within a survey; (2) the number of people in the population who share the same personal and contextual identifiers as a respondent; and (3) the anticipated amount of coverage error in census population counts and extant files that provide identifying information (i.e., name, address). Informing the construction of anonymized research data files that contain the attributes of spatial units, I then conduct reidentification experiments for nearly 15,000 simulated datasets to assess likely patterns of disclosure risk for alternative database designs, particularly those relating to: (1) direct geographic identifiers, as determined by known division, state, and MSA-status of study locations; (2) the type of geographic entity, as determined by well-known spatial units used in the administration of surveys and governments; (3) the number of indirect geographic identifiers provided in a dataset, as determined by samples of geographic attribute sets; and (4) the coarseness of these contextual measures, as determined by global recoding schema. (E) A modern Job Submission Application to access IABs confidential administrative and survey research data Johanna Eberle, Jörg Heining, Dana Müller and David Schiller, Institute for

Employment Research, Germany The Institute for Employment Research (IAB) is the research institute of the German Employment Agency. Via the Research Data Centre (FDZ) administrative and survey data is provided in a standardized way to researchers. These microdata files provide very detailed information for researchers. Accordingly they also come with a high disclosure risk. Access can only be given in a controlled environment. There are two access ways for the confidential data of IAB; first on-site stays in safe rooms and second analysis via job submission. When using job submission researchers cannot see the actual data source. They submit program code that runs on the data and get output files back that are checked for disclosure risk. IAB provides a sophisticated Job Submission Application that allows user to interact with IAB via a specialist interface and to distinguish between interim and publication outputs. Interim output cannot be downloaded; it can only be viewed as pictures within a secured environment. According to that output can be provided immediately; only a short scribed runs on the output to avoid disclosure issues. Publication output is send to the researchers and needs therefore to be checked more closely for any disclosure problematic. This is done by IAB experts and it may take up to five working days until the output can be send to the user. This solution provides the best possible usability while ensuring a high data security standard. Researchers work on the highly detailed data files and can even jump

Wednesday March 23, 2016

32

from on-site access to job submission, depending on the needs of the current research phase. This presentation will describe the Job Submission Application used at IAB and provide some feedback from researchers. (E) Enhancing data sharing via "safe designs"

Kristine Witkowski, University of Michigan, USA The social value of data collections are dramatically enhanced by the broad dissemination of research files and the resulting increase in scientific productivity. Currently, most studies are designed with a focus on collecting information that is analytically useful and accurate, with little forethought as to how it will be shared. Both literature and practice also presume that disclosure analysis will take place after data collection. But to produce public-use data of the highest analytical utility for the largest user group, disclosure risk must be considered at the beginning of the research process. Drawing upon economic and statistical decision-theoretic frameworks and survey methodology research, this study seeks to enhance the scientific productivity of shared research data by describing how disclosure risk can be addressed in the earliest stages of research with the formulation of "safe designs" and "disclosure simulations", where an applied statistical approach has been taken in: (1) developing and validating models that predict the composition of survey data under different sampling designs; (2) selecting and/or developing measures and methods used in the assessments of disclosure risk, analytical utility, and disclosure survey costs that are best suited for evaluating sampling and database designs; and (3) conducting simulations to gather estimates of risk, utility, and cost for studies with a wide range of sampling and database design characteristics. (E) Privacy and Security Aspects Related to the Use of Big Data – progress of work in the European Statistical System (ESS)

Pascal Jacques, EUROSTAT, Luxembourg Data protection and privacy are key challenges that need to be tackled with high priority in order to enable the use of Big Data in the production of Official Statistics. This was emphasized in 2013 by the Directors of National Statistical Insitutes (NSIs) of the European Statistical System Committee (ESSC) in the Scheveningen Memorandum. The ESSC requested Eurostat and the NSIs to elaborate an action plan with a roadmap for following up the implementation of the Memorandum. At the Riga meeting on September 26, 2014, the ESSC endorsed the Big Data Action Plan and Roadmap 1.0 (BDAR) presented by the Eurostat Task Force on Big Data (TFBD) and agreed to integrate it into the ESS Vision 2020 portfolio. Eurostat also collaborates in this field with external partners such as the United Nations Economic Commission for Europe (UNECE). The big data project of the UNECE High-Level Group is an international project on the role of big data in the modernization of statistical production. It comprised four ‘task teams’ addressing different aspects of Big Data issues relevant for official statistics: Privacy, Partnerships, Sandbox, and Quality. The Privacy Task Team finished its work in 2014 and gave an overview of the existing tools for risk management regarding privacy issues, described how risk of identification relates to Big Data characteristics and drafted recommendations for National Statistical Offices (NSOs). It mainly concluded that extensions to existing frameworks, including use of new technologies were needed in order to deal with privacy risks related to the use of Big Data. The BDAR builds on the work achieved by the UNECE task teams. Specifically, it recognizes that a number of big data sources contain sensitive information, that their use for official statistics may induce negative perceptions with the general public and other stakeholders and that this risk should be mitigated in the short to medium term. It proposes to launch multiple actions like e.g., an adequate review on ethical principles governing the roles and activities of the NSIs and a strong communication strategy.

Wednesday March 23, 2016

33

The paper presents the different actions undertaken within the ESS and in collaboration with UNECE, as well as potential technical and legal solutions to be put in place in order to address the data protection and privacy risks in the use of Big Data for Official Statistics. (E) Practical Applications of Secure Computation for Disclosure Control Luk Arbuckle, Children’s Hospital of Eastern Ontario Research Institute, Canada

and Khaled El Emam, Children’s Hospital of Eastern Ontario Research Institute, University of Ottawa, Canada Microdata dissemination normally requires data reduction and modification methods be applied, and the degree to which these methods are applied depend on the control methods that will be required to access and use the data. An approach that is in some circumstances more suitable for accessing data for statistical purposes is secure computation, which involves computing analytic functions on encrypted data without the need to decrypt the underlying source data to run a statistical analysis. Considering a risk-based approach to disclosure control, secure computation can be formulated as “protected” pseudonymous data. The encryption ensures the "protection" in that no human is working directly with the data: instead they are only seeing the statistical results on key variables. Although the source data is protected by encryption for secure computation, there would still be a need to address some of the concerns that exist with remote analysis systems in general. Secure computation is well suited to scenarios in which there is a continuous, systematic collection and analysis of data, because the computations can be defined and optimized and then continuously applied. This approach also allows multiple sites to contribute data while providing strong privacy guarantees. This way the data can be pooled and contributors can compute analytic functions without either party knowing their inputs. We will explain how secure computation can be applied in practical contexts, with some theoretical results and real healthcare examples.

Poster Session (E) Handling survey feedback in business statistics Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden, Sweden

The Swedish business register is updated continuously by several different sources. Coordinated samples are drawn from frames which are built from the business register. Coordination is done over time and between different surveys, using permanent random numbers. This technique has many benefits but suffers from a drawback concerning updating of register data. Survey feedback means information on sampled units fed back from a sample survey to a register which is used to build a frame for future surveys. It may seem evident to update a register with survey feedback in order to make it as suitable as possible. However, methodological problems may occur due to the dependency of samples. This paper explores the bias in estimators that arise from survey feedback and points out some ways to reduce the bias. The feedback comprises data on status, business size, sector, and industry as well as contact information. The survey feedback effect on sampling design, auxiliary information for estimation and distribution by domains of study, respectively, is studied. Industry updates in the register, and by that in the frame, is shown to have a bias effect on the estimators. It seems hard to adjust the estimators for the presence of survey feedback. Another approach is to implement source and time stamps in the business register. Survey feedback can then be used for contact information and distribution by domains of study, but removed when frames are created for sampling purposes. Thereby, the estimation will not be disturbed by survey feedback. (E) Measuring Data Quality of Price Indexes: the Producer Prices Division’s Performance Measure Grading Scheme

Kate Burnett-Isaacs, Statistics Canada

Wednesday March 23, 2016

34

The Producer Prices Division (PPD) has developed a Performance Measure Grading Scheme to evaluate each PPD index on key performance indicators to promote sound methodological practices and convey overall quality and reliability of published index numbers. This grading scheme was developed to meet the recommendations of the agency wide Quality Assurance Review Committee and aid the divisional Performance Measurement Strategy. Its components were drawn from the OECD Generic Statistical Business Process Model and Statistics Canada’s six dimensions of quality. Assessing the quality of an index is multi-faceted because of the complexities of index numbers and calculations and the different components of index compilation. An index number is comprised of price relatives, weights and a variety of forms of treatments to these data. The quality of an index must be assessed on the individual parts as well as the whole. The grading scheme intends to capture this and provide a measure of quality for the entire index, as well as its individual components, starting from a qualitative conceptual assessment, to a quantitative processing perspective. PPD produces 25 indexes that cover a wide scope of the business sector including goods production and manufacturing, construction, financial, transportation and professional services. These industries each have their own sources of data and standards of price measurement. The diversity of PPD’s index coverage brings with it a complexity when developing a standard method of assessing data quality. This paper will discuss the complexities of measuring data quality for indexes, explain the development of the grading scheme and choice of performance measures and discuss the difficulties and considerations required when making a standard measure of quality covering a broad range of industries and data sources. (E) Creation and use of large synthetic data sets in 2021 Census Transformation Programme Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith Spicer

and Peter Youens, Office for National Statistics, United Kingdom The Census Transformation Programme of the Office for National Statistics in England and Wales is investigating the use of synthetic data for a number of potential uses. One potential application is the creation of a household microdata sample. Due to disclosure concerns, we have not been able to provide a 2011 Census household microdata sample accessible outside secure research environments, with sufficient utility for users, using standard disclosure control techniques. The method being tested to create a household microdata file punches holes in a microdata sample, and uses the edit and imputation process from 2011 Census (using CANCEIS) to fill in the holes. This attempts to preserve the relationships between variables within households and individuals, but introduces sufficient uncertainty to mitigate disclosure risk. Methods are being investigated to test utility and risk in the resultant data. This poster will demonstrate issues we have to overcome, and the methods we are investigating to come up with a solution to provide useful, non-disclosive, microdata for a wider range of users. Given these issues, this is just a proof of concept at this stage, to test whether the approach is feasible.

Software Demonstration (E/F) The use of a SAS Grid at Statistics Canada

Yves Deguire, Statistics Canada A SAS grid is a sophisticated computing platform that provides load balancing, high availability, and scalability. This presentation will demystify the SAS grid and show how it has been deployed at Statistics Canada to support a large number of users as well as to perform huge amounts of statistical processing. Several use cases will also be proposed.

Wednesday March 23, 2016

35

(E/F) SAS® High-Performance Forecasting Software at Statistics Canada

Frédéric Picard, Statistics Canada Statistics Canada recently started to use SAS® HIGH-PERFORMANCE FORECASTING (HPF). SAS®HPF is a large-scale automatic system that can

evaluate and select appropriate models and generate a high number of Time Series predictions rapidly. It can be used by writing SAS code or through its graphical user interface. We will give an overview of some of the system's features and useful options and a presentation of the Graphical User Interface. We will also briefly describe examples of projects at Statistics Canada that have already benefitted from the software.

(E) High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!

Steve Holder, SAS Canada, Canada What would you do with an extra 269 minutes? The SAS High Performance Analytics (HPA) framework helps organizations move from dated and inefficient processes to modern analytics. In one case reducing the time it takes to make critical business decisions from 4.5 hours to just 60 seconds. Grab your coffee, come around and meet Steve Holder, National Lead, Analytics, SAS Canada. As an analytics practitioner you can find out how you can make decisions in real time, transforming your big data into relevant business value and ensure you do this in an easy to use governed way with the SAS analytics portfolio.

(E) Machine learning in the service of official statistics

Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria Machine learning (ML) is a very popular, data-intensive, computer science discipline. It is fairly generic and can be applied in various settings, however applications in official statistics became known only recently. To throw insight into this, to identify the techniques that have been explored, to investigate the opportunities for extending the links between official statistics and machine learning in particular and data science in general, a survey across the national statistical offices was conducted recently by Statistics Canada. In a paper presented at the workshop of the Modernisation Committee on Production and Methods in 2014 an overview of the machine learning techniques currently in use or in consideration at statistical agencies worldwide were discussed and the main reasons why statistical agencies should start exploring the use of machine learning techniques were outlined. Best choices of software tools for practical implementation of ML algorithms are Python and R. The purpose of this contribution is to present an update of the survey mentioned above, to map its findings to the currently available in the public domain R packages and to sketch the possible way forward. Mini-tutorial of R packages and illustration with several particular applications will be presented: automatic coding of item responses, outlier detection and imputation and record linkage, which will reduce manual examination of records.

(E) Common Statistical Production Architecture and "confidentiality-on-the-fly"

Robert McLellan and Predrag Mizdrak, Statistics Canada The Common Statistical Production Architecture (CSPA) is a framework for "plug and play" statistical components. This presentation will showcase the work done at Statistics Canada in the CSPA context for a confidentiality tool developed by the Australian Bureau of Statistics (ABS). “Confid-on-the-fly” is an analytical tool where results are automatically and instantly made confidential. The presenters will illustrate how they turned the R-code implementation from ABS into a set of services that allow for model exploration and generation by researchers with access to confidential microdata.

Wednesday March 23, 2016

36

Session 7A – Non-traditional Methods for Analysis of Survey Data (E) Empirical Likelihood Confidence Intervals for Finite Population Proportions

Changbao Wu, University of Waterloo, Canada Empirical likelihood (EL) ratio confidence intervals are very attractive for parameters with range restrictions such as population proportions or distribution functions. Wu and Rao (2006) studied the pseudo EL confidence intervals using complex survey data and Rao and Wu (2010) developed a Bayesian approach based on the pseudo empirical likelihood function. In this paper, we examine the performances of the pseudo EL and Bayesian EL intervals for finite population proportions using complex survey data. We also address a practically important scenario where the basic design weights and second order inclusion probabilities are not available but instead the final adjusted or calibrated weights with suitable replication weights are provided by the data file producers. Results from simulation studies will be reported. The research is joint work with J.N.K. Rao of Carleton University. (E) Hypotheses Testing for Complex Survey Data Using Bootstrap Weights: a unified approach J. N .K. Rao, Carleton University, Canada and Jae Kwang Kim, Iowa State

University, USA Standard statistical methods that do not take proper account of the complexity of survey design can lead to erroneous inferences when applied to survey data. In particular, the actual type I error rates of tests of hypotheses based on standard tests can be much bigger than the nominal level. Methods that take account of survey design features in testing hypotheses have been proposed, including Wald tests and quasi-score tests (Rao, Scott and Skinner, 1998) that involve the estimated covariance matrices of parameter estimates. The bootstrap method of Rao and Wu (1983) is often applied at Statistics Canada to estimate the covariance matrices, using the data file containing columns of bootstrap weights. Standard statistical packages often permit the use of survey weighted test statistics and it is attractive to approximate their distributions under the null hypothesis by their bootstrap analogues computed from the bootstrap weights supplied in the data file. Beaumont and Bocci (2009) applied this bootstrap method to testing hypotheses on regression parameters under a linear regression model, using weighted F statistics. In this paper, we present a unified approach to the above method by constructing bootstrap approximations to weighted likelihood ratio statistics and weighted quasi-score statistics. We report the results of a simulation study on testing independence in a two way table of categorical survey data. We studied the relative performance of the proposed method to alternative methods including the Rao-Scott corrected chi-squared statistic for categorical survey data. (E) An objective stepwise Bayes approach to account for missing observations

Glen Meeden, University of Minnesota, USA

Given a sampling design, say , let denote the probability that unit i is selected in the sample. The standard approach to observations missing at random

is to assume that for each i there is a probability, say , that unit i will be observed when it is in the sample. This probability is assumed to be independent of the sampling design and so the probability that we actually observe yi in our

sample is . Unfortunately the 's are not known. To overcome this lack

of knowledge of the 's the statistician uses the observed values of x, some

auxiliary variable, to construct weighting adjustment classes with the hope that

responders and non-responders in the same class are similar, that is, the 's

Wednesday March 23, 2016

37

within each class are roughly constant. This approach essentially assumes that

is a function of xi and so we write (xi) for the probability that unit i will respond given that it is in the sample. The idea is to use the pattern of responders and non-responders, as a function of

x, to get estimated values of the (xi)'s for each responder in the sample. These

estimated values along with the inclusion probabilities are then combined to define a constrained Dirichlet distribution which is used to simulate complete copies of the population. These simulated copies can be used to generate point and interval estimates of population parameters. We will give examples where the resulting point estimators do better than standard estimators and the interval estimators have good frequentist properties.

Session 7B – Applications of record linkage and statistical matching (E) Linking 2006 Census data to the 2011 mortality file Mohan Kumar and Rose Evra, Statistics Canada

A project to link data from the 2006-2011 mortality file to data from the 2006 Census was carried out to perform analyses, such as calculating the life expectancy and weighted mortality rate of populations of interest, such as immigrants and Aboriginal people. Linking the data provides key information that would otherwise be unavailable. This project is a follow-up to the data linkage of the 1991 Census file with the mortality files. At that time, the names and the Aboriginal identity variable were not available; the new linkage includes this information and deals with a more recent population, namely persons living in Canada in 2006. The data were linked in multiple steps, including pre-processing the variables and creating linkage blocks. Matching was performed using the deterministic hierarchical method, which involves linking common ID variables across the files in several successive matching waves. The MixMatch linkage software was used. This presentation discusses the data linkage steps, the methods used to perform the linkage and ways to assess the quality of the links. (E) Estimating the impact of active labour market programs using administrative data and matching methods Andy Handouyahia, Tony Haddad, Stéphanie Roberge and Georges Awad,

Employment and Social Development Canada, Canada A review of the literature on the impacts of active labor market policy programs (ALMPs) on participants shows clearly that the Canadian experience is conspicuously missing. In this paper we present the effects of the Employment Benefit and Support Measures (EBSM) that are financed by the Government of Canada and administered by provinces and territories. The impact estimates on increased incidence of employment and earnings as well as the reduction of Employment Insurance are based on the use of rich administrative data and quasi-experimental methodology. As a case study, we use linked longitudinal administrative data covering all program participants between 2002 and 2005. Applying propensity score matching as in Blundell et al. (2004), Gerfin and Lechner (2002), and Sianesi (2004), the national net impact estimates were produced using difference-in-differences (DID) and Kernel Matching (KM) estimator (Heckman and Smith (1999)). We compared the KM estimates with Inverse Probability Weighting and Nearest Neighbor estimates for robustness checks against possible unobserved influence. In terms of employment and earnings, the central findings suggest a global effectiveness in the selection of recipients (Employment Insurance active claimants) into different programs.

Wednesday March 23, 2016

38

For both employment assistance services (EAS) and employment benefit programs such as Skills Development (SD) and Wage Subsidies (WS), we observe significant sustainable moderate positive effects on earnings and employment. However the cumulative effects are larger for SD participants over a six year period than for EAS Only. Among all participants, participation in WS led to the largest increases in incidence of employment. (E) An Overview of Business Record Linkage at Statistics Canada: How to link the “unlinkable” Javier Oyarzun and Laura Wile, Statistics Canada

Statistics Canada’s mandate includes producing statistical data to shed light on current business issues. The linking of business records is an important technique used in the development, production, evaluation and analysis of these statistical data. As record linkage can intrude on one’s privacy, Statistics Canada uses the technique only when the public good is clearly evident and outweighs the intrusion. Record linkage is experiencing a revival triggered by a greater use of administrative data in many statistical programs. There are many challenges to business record linkage. For example, many administrative files do not have common identifiers, information being recorded in non-standardized formats, information containing typographical errors, administrative data files usually large in size, and finally the evaluation of multiple record pairings make absolute comparison impractical and sometimes impossible. Due to the importance and challenges associated with record linkage, Statistics Canada has been developing a record linkage standard to help users optimize their business record linkage process. For example, this process includes building on a record linkage blocking strategy that will reduce the amount of record-pairs to compare and match, and create a standard business name that will be available on Statistics Canada’s Business Register. This presentation will give an overview of the business record linkage methodology and will look at the various economic projects which use record linkage at Statistics Canada, these include projects in the National Accounts, International Trade, Agriculture and the Business Register. (E) Linking Canadian Patent records from the U.S. Patent office to Statistics Canada’s Business Register, 2000 to 2011

Paul Holness, Statistics Canada This paper describes the Quick Match System (QMS), an in-house application designed to match business microdata records, and the methods used to link the United States Patent and Trademark Office (USPTO) dataset to Statistics Canada’s Business Register (BR) for the period from 2000 to 2011. The paper illustrates the record-linkage framework and outlines the techniques used to prepare and classify each record and evaluate the match results. The USPTO dataset consisted of 41,619 U.S. patents granted to 14,162 distinct Canadian entities. The record-linkage process matched the names, city, province and postal codes of the patent assignees in the USPTO dataset with those of businesses in the January editions of the Generic Survey Universe File (GSUF) from the BR for the same reference period. As the vast majority of individual patent assignees are not engaged in commercial activity to provide taxable property or services, they tend not to appear in the BR. The relatively poor match rate of 24.5% among individuals, compared to 84.7% among institutions, reflects this tendency. Although the 8,844 individual patent assignees outnumbered the 5,318 institutions, the institutions accounted for 73.0% of the patents, compared to 27.0% held by individuals. Consequently, this study and its conclusions focus primarily on institutional patent assignees. The linkage of the USPTO institutions to the BR is significant because it provides access to business micro-level data on firm characteristics, employment, revenue, assets and liabilities. In addition, the retrieval of robust administrative identifiers enables subsequent linkage to other survey and administrative data sources. The

Wednesday March 23, 2016

39

integrated dataset will support direct and comparative analytical studies on the performance of Canadian institutions that obtained patents in the United States between 2000 and 2011. (E) Measuring the Quality of a Probabilistic Linkage through Clerical Reviews Abel Dasylva, Melanie Abeysundera, Blache Akpoué, Mohammed Haddou and

Abdelnasser Saïdi, Statistics Canada Probabilistic linkage is susceptible to linkage errors such as missed links and false links. In many cases, these errors may be reliably measured through clerical-reviews, i.e. the visual inspection of a sample of record pairs to determine if they are matched. A framework is described to effectively carry-out such clerical-reviews. It is based on a probabilistic sample of pairs, repeated independent reviews of some pairs and latent class analysis to account for clerical errors.

Poster Session (E) Handling survey feedback in business statistics Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden, Sweden

The Swedish business register is updated continuously by several different sources. Coordinated samples are drawn from frames which are built from the business register. Coordination is done over time and between different surveys, using permanent random numbers. This technique has many benefits but suffers from a drawback concerning updating of register data. Survey feedback means information on sampled units fed back from a sample survey to a register which is used to build a frame for future surveys. It may seem evident to update a register with survey feedback in order to make it as suitable as possible. However, methodological problems may occur due to the dependency of samples. This paper explores the bias in estimators that arise from survey feedback and points out some ways to reduce the bias. The feedback comprises data on status, business size, sector, and industry as well as contact information. The survey feedback effect on sampling design, auxiliary information for estimation and distribution by domains of study, respectively, is studied. Industry updates in the register, and by that in the frame, is shown to have a bias effect on the estimators. It seems hard to adjust the estimators for the presence of survey feedback. Another approach is to implement source and time stamps in the business register. Survey feedback can then be used for contact information and distribution by domains of study, but removed when frames are created for sampling purposes. Thereby, the estimation will not be disturbed by survey feedback. (E) Measuring Data Quality of Price Indexes: the Producer Prices Division’s Performance Measure Grading Scheme

Kate Burnett-Isaacs, Statistics Canada The Producer Prices Division (PPD) has developed a Performance Measure Grading Scheme to evaluate each PPD index on key performance indicators to promote sound methodological practices and convey overall quality and reliability of published index numbers. This grading scheme was developed to meet the recommendations of the agency wide Quality Assurance Review Committee and aid the divisional Performance Measurement Strategy. Its components were drawn from the OECD Generic Statistical Business Process Model and Statistics Canada’s six dimensions of quality. Assessing the quality of an index is multi-faceted because of the complexities of index numbers and calculations and the different components of index compilation. An index number is comprised of price relatives, weights and a variety of forms of treatments to these data. The quality of an index must be assessed on the individual parts as well as the whole. The grading scheme intends to capture this and provide a measure of quality for the entire index, as well as its individual components, starting from a qualitative conceptual assessment, to a quantitative processing perspective. PPD produces

Wednesday March 23, 2016

40

25 indexes that cover a wide scope of the business sector including goods production and manufacturing, construction, financial, transportation and professional services. These industries each have their own sources of data and standards of price measurement. The diversity of PPD’s index coverage brings with it a complexity when developing a standard method of assessing data quality. This paper will discuss the complexities of measuring data quality for indexes, explain the development of the grading scheme and choice of performance measures and discuss the difficulties and considerations required when making a standard measure of quality covering a broad range of industries and data sources. (E) Creation and use of large synthetic data sets in 2021 Census Transformation Programme Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith Spicer

and Peter Youens, Office for National Statistics, United Kingdom The Census Transformation Programme of the Office for National Statistics in England and Wales is investigating the use of synthetic data for a number of potential uses. One potential application is the creation of a household microdata sample. Due to disclosure concerns, we have not been able to provide a 2011 Census household microdata sample accessible outside secure research environments, with sufficient utility for users, using standard disclosure control techniques. The method being tested to create a household microdata file punches holes in a microdata sample, and uses the edit and imputation process from 2011 Census (using CANCEIS) to fill in the holes. This attempts to preserve the relationships between variables within households and individuals, but introduces sufficient uncertainty to mitigate disclosure risk. Methods are being investigated to test utility and risk in the resultant data. This poster will demonstrate issues we have to overcome, and the methods we are investigating to come up with a solution to provide useful, non-disclosive, microdata for a wider range of users. Given these issues, this is just a proof of concept at this stage, to test whether the approach is feasible.

Software Demonstration (E/F) The use of a SAS Grid at Statistics Canada

Yves Deguire, Statistics Canada A SAS grid is a sophisticated computing platform that provides load balancing, high availability, and scalability. This presentation will demystify the SAS grid and show how it has been deployed at Statistics Canada to support a large number of users as well as to perform huge amounts of statistical processing. Several use cases will also be proposed.

(E/F) SAS® High-Performance Forecasting Software at Statistics Canada

Frédéric Picard, Statistics Canada Statistics Canada recently started to use SAS® HIGH-PERFORMANCE FORECASTING (HPF). SAS®HPF is a large-scale automatic system that can

evaluate and select appropriate models and generate a high number of Time Series predictions rapidly. It can be used by writing SAS code or through its graphical user interface. We will give an overview of some of the system's features and useful options and a presentation of the Graphical User Interface. We will also briefly describe examples of projects at Statistics Canada that have already benefitted from the software.

(E) High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!

Steve Holder, SAS Canada, Canada

Wednesday March 23, 2016

41

What would you do with an extra 269 minutes? The SAS High Performance Analytics (HPA) framework helps organizations move from dated and inefficient processes to modern analytics. In one case reducing the time it takes to make critical business decisions from 4.5 hours to just 60 seconds. Grab your coffee, come around and meet Steve Holder, National Lead, Analytics, SAS Canada. As an analytics practitioner you can find out how you can make decisions in real time, transforming your big data into relevant business value and ensure you do this in an easy to use governed way with the SAS analytics portfolio.

(E) Machine learning in the service of official statistics

Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria Machine learning (ML) is a very popular, data-intensive, computer science discipline. It is fairly generic and can be applied in various settings, however applications in official statistics became known only recently. To throw insight into this, to identify the techniques that have been explored, to investigate the opportunities for extending the links between official statistics and machine learning in particular and data science in general, a survey across the national statistical offices was conducted recently by Statistics Canada. In a paper presented at the workshop of the Modernisation Committee on Production and Methods in 2014 an overview of the machine learning techniques currently in use or in consideration at statistical agencies worldwide were discussed and the main reasons why statistical agencies should start exploring the use of machine learning techniques were outlined. Best choices of software tools for practical implementation of ML algorithms are Python and R. The purpose of this contribution is to present an update of the survey mentioned above, to map its findings to the currently available in the public domain R packages and to sketch the possible way forward. Mini-tutorial of R packages and illustration with several particular applications will be presented: automatic coding of item responses, outlier detection and imputation and record linkage, which will reduce manual examination of records.

(E) Common Statistical Production Architecture and "confidentiality-on-the-fly"

Robert McLellan and Predrag Mizdrak, Statistics Canada

The Common Statistical Production Architecture (CSPA) is a framework for "plug and play" statistical components. This presentation will showcase the work done at Statistics Canada in the CSPA context for a confidentiality tool developed by the Australian Bureau of Statistics (ABS). “Confid-on-the-fly” is an analytical tool where results are automatically and instantly made confidential. The presenters will illustrate how they turned the R-code implementation from ABS into a set of services that allow for model exploration and generation by researchers with access to confidential microdata.

Session 8A – Paradata (E) On the Utility of Paradata in Major National Surveys: Challenges and Benefits Brady West, University of Michigan, USA and Frauke Kreuter, University of

Maryland, USA This presentation will begin with Dr. West providing a summary of research that has been conducted on the quality and utility of paradata collected as part of the United States National Survey of Family Growth (NSFG). The NSFG is the major national fertility survey in the U.S., and an important source of data on sexual activity, sexual behavior, and reproductive health for policy makers. For many years, the NSFG has been collecting various forms of paradata, including keystroke information (e.g., Couper and Kreuter 2013), call record information, detailed case disposition information, and interviewer observations related to key

Wednesday March 23, 2016

42

NSFG measures (e.g., West 2013). Dr. West will discuss some of the challenges of working with these data, in addition to evidence of their utility for nonresponse adjustment, interviewer evaluation, and/or responsive survey design purposes. Dr. Kreuter will then present research done using paradata collected as part of two panel surveys: the Medical Expenditure Panel Survey (MEPS) in the United States, and the Panel Labour Market and Social Security (PASS) in Germany. In both surveys, information from contacts in prior waves were experimentally used to improve contact and response rates in subsequent waves. In addition, research from PASS will be presented where interviewer observations on key outcome variables were collected to be used in nonresponse adjustment or responsive survey design decisions. Dr. Kreuter will not only present the research results but also the practical challenges in implementing the collection and use of both sets of paradata. (E) A Bayesian analysis of survey design parameters

Barry Schouten, Joep Burger, Lisette Bruin and Nini Mushkudiani, Statistics Netherlands, Netherlands In the design of surveys a number of parameters like contact propensities, participation propensities and costs per sample unit play a decisive role. In on-going surveys, these survey design parameters are usually estimated from previous experience and updated gradually with new experience. In new surveys, these parameters are estimated from expert opinion and experience with similar surveys. Although survey institutes have a fair expertise and experience, the postulation, estimation and updating of survey design parameters is rarely done in a systematic way. This paper presents a Bayesian framework to include and update prior knowledge and expert opinion about the parameters. This framework is set in the context of adaptive survey designs in which different population units may receive different treatment given quality and cost objectives. For this type of survey, the accuracy of design parameters becomes even more crucial to effective design decisions. The framework allows for a Bayesian analysis of the performance of a survey during data collection and in between waves of a survey. We demonstrate the Bayesian analysis using a realistic simulation study. (E) Statistics Canada’s Experiences in Using Paradata to Manage Responsive Collection Design CATI household surveys François Laflamme, Sylvain Hamel and Dominique Chabot-Hallé, Statistics

Canada

Over the past decade, paradata research has focused on identifying strategic data collection improvement opportunities that could be operationally viable and lead to improvements in quality or cost efficiency. To that extent, Statistics Canada has developed and implemented a Responsive Collection Design (RCD) strategy for Computer-Assisted Telephone Interview (CATI) household surveys to maximize quality and potentially reduce costs. RCD is an adaptive approach to survey data collection that uses information available prior to and during data collection to adjust the collection strategy for the remaining in-progress cases. In practice, the survey managers monitor and analyze collection progress against a pre-determined set of indicators for two purposes: to identify critical data collection milestones that require significant changes to the collection approach and to adjust collection strategies to make the most efficient use of remaining available resources. In the RCD context, numerous considerations come into play when determining which aspects of data collection to adjust and how to adjust them. Paradata sources play a key role in the planning, development and implementation of active management for RCD. Since 2009, Statistics has conducted several RCD surveys. This paper describes Statistics Canada’s experiences in implementing, and especially, in monitoring these surveys. In particular, this paper presents the plans, tools and strategies used to actively manage the RCD surveys and how these strategies evolved and improved over time.

Wednesday March 23, 2016

43

Session 8B – Use of Administrative Data

(E) Redesign of the longitudinal immigration database (IMDB)

Rose Evra, Statistics Canada The Longitudinal Immigration Database (IMDB) combines Citizenship and Immigration Canada’s (CIC) Immigrant Landing File (ILF) with annual T1 Family Files (T1FF). This record linkage is performed using a tax filer database, the Linkage Control File (LCF), which covers all tax filers in Canada since 1981. The ILF includes all immigrants who have landed in Canada since 1980. In looking to enhance the IMDB, the possibility of adding temporary residents (TR) and immigrants who landed prior to 1980 but since 1952 (PRE80) was studied. CIC provided Statistics Canada with TR and PRE80 files similar to the ILF. Adding this information would give a more complete picture of the foreign born population living in Canada. The entire migration history of TRs who become Permanent Residents would be available; it would be possible to compare the socioeconomic integration of immigrants with pre-landing experience with those without any pre-landing experience. To integrate the TR and PRE80 files into the IMDB, record linkages between these 2 files and the LCF, were performed. In addition, there is an administrative link between the TR and ILF files already made by CIC. This exercise was challenging in part due to the presence of duplicates in the files and conflicting links between the different record linkages. The presentation will go over the process and challenges faced to enhance the IMDB with these 2 new files. (F) Creating a longitudinal database based on linked administrative registers: An example Philippe Wanner, Université de Genève and NCCR On The Move, Switzerland

and Ilka Steiner, Université de Genève, Switzerland

In the last few decades, international migration has played a key role in the population growth of numerous industrialized countries and has sparked numerous debates on the economic and social integration of migrants. As a result, new data are essential for documenting migration. This presentation describes, based on the context above, the creation of a demographic database in Switzerland based on the harmonized register of residents, the registers of foreign nationals, the Structural Survey (which replaces the traditional census) and other social insurance registers. The purpose of this work, performed as part of a research project on migration (NCCR—On the Move) in co-operation with the Federal Statistical Office, is to longitudinally track foreign nationals, from arrival to departure, while identifying changes in occupational, economic or demographic status during that period. Currently, 15 years of tracking are available for nearly 4 million foreign nationals who lived in Switzerland between 1998 and 2013. For these 4 million foreign nationals, linking various registers provides information on socioeconomic status, residential mobility or naturalization-related actions. The linkage methods used, the results obtained and the research contributions are described. In particular, we will show the significance of linking administrative registers for describing social phenomena using longitudinal approaches based on individual data. We will end with a discussion on the benefits and limitations of registers for social statistics. (E) Use of admin data to increase the efficiency of the sample design of the new National Travel Survey Charles Choi, Statistics Canada

Wednesday March 23, 2016

44

As part of the redesign of the Tourism Statistics Program, Statistics Canada is developing the National Travel Survey (NTS) to collect travel information from Canadian travellers. This new survey will replace the Travel Survey of Residents of Canada and the Canadian Residents component of the International Travel Survey. The NTS will take advantage of Statistics Canada’s common sampling frames and common processing tools while maximizing the use of administrative data. This presentation will discuss the potential uses of administrative data such as Passport Canada files, Canadian Border Service Agency files and Canada Revenue Agency files, to increase the efficiency of the NTS sample design. (E) Using Administrative Data to Study Education in Canada Martin Pantel, Statistics Canada

The Educational Master File (EMF) is a system that was built to allow the analysis of educational programs in Canada. At the core of the system are administrative files that record all of the registrations to post-secondary and apprenticeship programs in Canada. New administrative files become available on an annual basis. Once a new file becomes available, a first round of processing and cleanup is performed; this processing includes a linkage to other administrative tax records managed by Statistics Canada. The information obtained from this linkage lets us further improve the quality of the file, it lets us link to other data describing labour market outcomes, and it’s the first step in adding the file to the EMF. Once part of the EMF, information from the file can be included in cross-sectional and longitudinal projects: analysts can study the pathways that form an academic career, and the labour market outcomes after graduation. The EMF currently consists of data from 2005 to 2013, but it evolves as new data become available. This presentation will give an overview of the mechanisms used to build the EMF, but will focus on the structure of the final system and describe some of its analytical potential.

Thursday March 24, 2016

45

Session 9A – Scanner Data (F) Challenges Associated with Using Scanner Data for the Consumer Price Index Catherine Deshaies-Moreault and Nelson Émond, Statistics Canada

Practically all major retailers use optical scanners to record the details of their transactions with clients (consumers). These data normally include the product code, a brief description, the price and the quantity sold. This is an extremely relevant data source for statistical programs such as Statistics Canada’s Consumer Price Index (CPI), one of Canada’s most important economic indicators. Using scanner data could improve the quality of the CPI by increasing the number of prices used in calculations, expanding geographic coverage and including the quantities sold, among other things, while lowering data collection costs. However, using these data presents many challenges. An examination of scanner data from a retailer revealed a high rate of change in product identification codes over a one-year period. The effects of these changes pose challenges from a product classification and estimate quality perspective. This article focuses on the issues associated with acquiring, classifying and exploring these data to assess their quality for use in the CPI. (E) Product homogeneity and weighting when using scanner data for price index calculation

Antonio G. Chessa, Statistics Netherlands, Netherlands The availability of both prices and quantities sold at article at the Global Trade Item Number (GTIN) level in scanner data opens possibilities for compiling more accurate price indices. A number of critical questions needs to be answered to this end: what are the individual consumer products and how should these be weighted in price index calculations? The first problem may be hampered by the “relaunch” phenomenon, which refers to a change in bar code while the consumable part of an article remains the same. Broader product definitions are needed, such that GTINs are combined in order to capture associated price increases. This can be achieved by creating GTIN groups (“products”), in which GTINs share the same article characteristics. Characteristics could be selected by applying statistical model selection methods (information criteria). These methods are also used in this paper to compare different weighting schemes. Several examples illustrate the sensitivity of price indices to different selections of article characteristics when defining products. The results also show that GTINs as homogeneous products are not suited for clothing and drugstore articles. Price indices also result to be sensitive to the type of weighting scheme. From a statistical perspective, weighting products by their share in turnover or quantities sold is clearly superior to equal weighting. Price indices calculated from scanner data show that the traditional use of the Jevons index for elementary aggregates should be abandoned. The results also imply that weights from an additional source are needed when using prices collected by web scraping. (E) A look into the future – Scanner data and big data

Muhanad Sammar, Statistics Sweden, Sweden The fact that the world is in continuous change and that new technologies are becoming widely available creates new opportunities and challenges for National Statistical Institutes (NSIs) worldwide. What if NSIs could access vast amounts of sophisticated data for free (or for a low cost) from enterprises? Could this facilitate the possibility for NSIs to disseminate more accurate indicators for the policy-makers and users, significantly reduce the response burden for companies, reduce costs for the NSIs and in the long run improve the living standards of the people in a country?

Thursday March 24, 2016

46

The time has now come for NSIs to find the best practice to align legislation, regulations and practices in relation to scanner data and big data. Without common ground, the prospect of reaching consensus is unlikely. The discussions need to start with how to define quality. If NSIs define and approach quality differently, this will lead to a highly undesirable situation, as NSIs will move further away from harmonisation. Sweden was one of the leading countries that put these issues on the agenda for European cooperation; in 2012 Sweden implemented scanner data in the national Consumer Price Index after it was proven through research studies and statistical analyses that scanner data was significantly better than the manually collected data. The author has extensive experience of working with scanner data and will gladly share his experiences in this field. Let us inspire each other, let us be inspired by each other!

Session 9B – Health Data (E) Comparing Canada’s Healthcare System: Benefits and Challenges Katerina Gapanenko, Grace Cheung, Deborah Schwartz and Mark McPherson,

Canadian Institute for Health Information (CIHI), Canada Background: There is increasing interest in measuring and benchmarking health system performance. We compared Canada’s health system with other countries in the Organisation for Economic Co-operation and Development (OECD) on both the national and provincial levels, across 50 indicators of health system performance. This analysis can help provinces identify potential areas for improvement, considering an optimal comparator for international comparisons.

Methods: OECD Health Data from 2013 was used to compare Canada’s results internationally. We also calculated provincial results for OECD’s indicators on health system performance, using OECD methodology. We normalized the indicator results to present multiple indicators on the same scale and compared them to the OECD average, 25th and 75th percentiles.

Results: Presenting normalized values allow Canada’s results to be compared across multiple OECD indicators on the same scale. No country or province consistently has higher results than the others. For most indicators, Canadian results are similar to other countries, but there remain areas where Canada performs particularly well (i.e. smoking rates) or poorly (i.e. patient safety). This data was presented in an interactive eTool.

Conclusion: Comparing Canada’s provinces internationally can highlight areas where improvement is needed, and help to identify potential strategies for improvement.

(E) A systematic review: Evaluating extant data sources for potential linkage Erin Tanenbaum, NORC at the University of Chicago, USA; Michael Sinclair,

Mathematica Policy Research, USA; Jennifer Hasche, NORC at the University of Chicago, USA and Christina Park, National Institute of Child Health and Human Development (NICHD), USA

The National Children’s Vanguard Study was a pilot for a large-scale epidemiological cohort study of children and their parents. Measures were to be taken from pre-pregnancy until the children reached adulthood. Plans called for the use of extant data (e.g. administrative data, registry data, or external source) to supplement the survey data. This data was planned to reduce respondent burden as well as supply information for which the respondent was not knowledgeable. A researcher should investigate an extant data source to make informed judgments about their use.

Thursday March 24, 2016

47

This paper outlines a strategy for cataloging and evaluating potential extant data sources. The evaluation of extant data may require hands on experience with the proposed source, and yet such evaluations can be costly. Therefore, we present a tiered system for the collection of extant sources metadata to initially prioritize sources into a more manageable subset for in-depth investigation. We build a set of criteria based on a literature review, and assess which criteria to use for each tier based on the expected availability and difficulty of collection.

Through our review we selected five evaluation factors to guide a researcher through available data sources including 1) relevance, 2) timeliness, 3) spatial, 4) accessibility, and 5) accuracy. We use these evaluation factors to aid in creation of a metadata library to help researchers quickly identify data sources that could assist in answering a specific research question.

(E) Providing Meaningful and Actionable Health System Performance Information via a Unique Pan-Canadian Interactive Web Tool Jeanie Lacroix and Kristine Cooper, Canadian Institute for Health Information

(CIHI), Canada How can we bring together multidimensional health system performance data in a simplified way that is easy to access and organized for actionability? The Canadian Institute for Health Information has developed a suite of Your Health System (YHS) web tools to meet performance measurement needs across different audiences, to identify improvement priorities, understand how regions and facilities compare with peers and support transparency and accountability. The pan-Canadian tools consolidate reporting of 45 key performance indicators in a structured way, and are comparable over time and at different geographic levels; including national, provincial, regional and facility levels. Through the use of simple visuals, such as a 3x3 matrix, users can assess their health system’s overall performance and use this information to help prioritize areas for improvement. Multidimensional data is incorporated in the matrix methodology through relative performance assessment against peer comparators and performance over time reflecting significant change. Geospatial visuals with interactive maps also allow users to visualize performance across an area and drill down to different geographic layers. Examples of achievable top results are highlighted using a methodology that accounts for true and consistent top performance that can be easily understood and enables benchmarking. The YHS tools aim to simplify large quantities of information across different health sectors into a usable and relevant decision-making tool for health system managers and the general public. This presentation will explore the methodological approaches and considerations taken to create a dynamic tool that enables benchmarking and comparisons in a sustainable way for health system performance improvement. (E) Epidemiological observatory on Brazilian health data Raphael de Freitas Saldanha and Ronaldo Rocha Bastos, Universidade Federal

de Juiz de Fora, Brazil

The Unified Brazilian Health System (SUS) was created in 1988. With the aim of organizing the sundry health information systems and databases already in use, a unified databank (DataSUS) was created in 1991, fulfilling the need for control and management centralization of such systems. DataSUS files are freely and openly available via Internet to the population in general. Access and visualization of such data is currently done through customized tables and simple diagrams, which do not entirely meet the needs of health managers, researchers and the population in general for a flexible and easy-to-use tool that can tackle different aspects of health which are relevant to their purposes of knowledge-seeking and decision-making.

Thursday March 24, 2016

48

The current project proposes the interactive monthly generation of synthetic epidemiological reports, which are not only easily accessible but also easy to interpret and understand. Emphasis is put on data visualization through more informative diagrams and maps. The monthly dbc files containing raw health data released by DataSUS are read using the free software environment for statistical computing and graphics R, totaling tens of thousands of registers for a one-year period. Data bank management strategies are used to minimize hardware needs. Standardized and easily-updated reports are generated using dynamic algorithms implemented in the markup language Markdown. Access to the project is free in a friendly and clear web site (episus.org) which is currently being developed. The available prototype features reports on themes such as hospitalizations for primary care-sensitive conditions and traffic-related injuries related to the 5,570 municipalities and other relevant administrative regions in Brazil.

(E) Data surveillance on the clinical data used for health system funding in Ontario

Lori Kirby and Maureen Kelly, Canadian Institute for Health Information (CIHI),

Canada

Several Canadian jurisdictions including Ontario are using patient-based healthcare data in their funding models. These initiatives can influence the quality of this data both positively and negatively as people tend to pay more attention to the data and its quality when financial decisions are based upon it. Ontario’s funding formula uses data from several national databases housed at the Canadian Institute for Health Information (CIHI). These databases provide information on patient activity and clinical status across the continuum of care. As funding models may influence coding behaviour, CIHI is collaborating with the Ontario Ministry of Health and Long-Term Care to assess and monitor the quality of this data. CIHI is using data mining software and modelling techniques (that are often associated with “big data”) to identify data anomalies across multiple factors. The models identify what the “typical” clinical coding patterns are for key patient groups (for example, patients seen in special care units or discharged to home care), so that outliers can be identified, where patients do not fit the expected pattern. A key component of the modelling is segmenting the data based on patient, provider and hospital characteristics to take into account key differences in the delivery of health care and patient populations across the province. CIHI’s analysis identified several hospitals with coding practices that appear to be changing or significantly different from their peer group. Further investigation is required to understand why these differences exist and to develop appropriate strategies to mitigate variations.

Session 10 – Plenary Session (E) Data Science for Dynamic Data Systems: Implications for Official Statistics

Mary E. Thompson, University of Waterloo, Canada Many of the challenges and opportunities of modern data science have to do with dynamic aspects: evolving populations, the growing volume of administrative and commercial data on individuals and establishments, continuous flows of data and the capacity to analyze and summarize them in real time, and the deterioration of data absent the resources to maintain them. With its emphasis on data quality and supportable results, the domain of official statistics is ideal for highlighting statistical and data science issues in a variety of contexts. Examples in the talk will illustrate the importance of population frames and their maintenance; the use of multi-frame methods and linkages in estimating the extent of a population; the use of large scale non-survey data as auxiliary information for estimate and model

Thursday March 24, 2016

49

calibration; issues for researchers, service providers and policy makers around the timing of data processing, integration and release; recursive methods and regularization for forecasting high dimensional time series; and the benefits and limitations of sophisticated data visualization tools in capturing change.

50

Final Thoughts

Our warmest thanks for attending this symposium. We hope that you have found it valuable and beneficial. Please complete the symposium evaluation form before leaving,

and drop it into one of the specially marked boxes. An electronic evaluation form will be send to Statistics Canada employees in the upcoming days. Proceedings from Symposium 2016 should be available in January 2017.

The 2016 Symposium Organizing Committee

Joseph Duggan, Chair