european conference on quality in official statistics, rome, july 2008 community innovation survey:...

15
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata Files for Research Daniela Ichim

Upload: douglas-rice

Post on 17-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Community Innovation Survey:a Flexible Approach to the Dissemination of Microdata Files for Research

Daniela Ichim

Page 2: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Outline

• Dissemination of Microdata Files for Research• Risk assessment• Disclosure limitation• Data quality

– Record linkage– Data utility

Page 3: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Confidentiality against Dissemination

Find the right balance!

Disclosure scenarios

Page 4: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Community Innovation Survey

• IDENTIFYING VARIABLES– Nace– Nuts– Size– Turnover (TURN)

(STRUCTURAL VARIABLES)

• CONFIDENTIAL VARIABLES– Expenditures in innovation (RTOT, …)– Number of patents, …

(VARIABLES INVOLVED IN ANALYSES)

Page 5: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Confounding

Categorical Numerical

safe

unsafe

AA…Ak-anonymity

cn ttt ,

Page 6: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

a) Given a threshold (on units)b) Local Outlier Factor as a

measure of difference in density between a unit and its nearest neighbours

General risk function

Distance between and

*M

1t 2t

1,0,),(),(),( 211

2121 ccnn Ied tttttt

t

)(

)(

)(

)(*

*'

*

*

*

)(

'

t

t

t

ttt

M

NM

M

M N

LRD

LRD

LOFM

Density around :

Page 7: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

• Threshold - dissemination policy

Parameters*M

• Cut-off point for density (LOF)– quantiles– automatic

Page 8: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Stratification variables

TUR

N

Analysis by Nace

Nace A all Nace

Page 9: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Disclosure limitation

MFR Selective masking

k-anonymity Nearest neighbour

Micro-aggregation on tails

Page 10: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Quality assessment

Dissemination

Confidentiality

Page 11: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Risk measure assessment

Quality of the external database

D

E

Chambers of Commerce database

Record linkage

Page 12: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Record linkage

M*=3

1 equal unit within 10%

less than 3 units within 10%

less than 3 units within 20%

less than 3 units within 30%

NACE 88% 84% 97% 100%

NACEEMP 63% 60%a 74%a 87%a

M*=5

1 equal unit within 10%

less than 5 units within 10%

less than 5 units within 20%

less than 5 units within 30%

NACE 88% 73% 87% 96%

NACEEMP 63% 58%a 70%a 80%a

a) 100% for enterprises with more than 250 employees

Page 13: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Information content analysis

Information preservation• Selective masking

– Data utility– Only identifying and confidential variables were

modified.– Only records at risk were modified.

• The weights were not modified.– weighted totals (coherence with the already

published information)

Some statistical indicators were slightly modified: variances

Page 14: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Information content analysisData utility

Assessment of the perturbation impact on ratios like RTOT/TURN

Original

Selective masking

Individual ranking

Page 15: European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata

European Conference on Quality in Official Statistics, Rome, July 2008

Conclusions

1. Confidentiality: Risk measure based on the k-anonymity principle

Flexible a) continuous and categorical variables b) easy to implement c) consistent for extreme choices

2. Data utility: Selective protection to achieve the k-anonymity

3. Comparable dissemination: Control both risk of re-identification and information loss

QUALITY DIMENSIONS