european conference on quality in official statistics, rome, july 2008 community innovation survey:...
TRANSCRIPT
European Conference on Quality in Official Statistics, Rome, July 2008
Community Innovation Survey:a Flexible Approach to the Dissemination of Microdata Files for Research
Daniela Ichim
European Conference on Quality in Official Statistics, Rome, July 2008
Outline
• Dissemination of Microdata Files for Research• Risk assessment• Disclosure limitation• Data quality
– Record linkage– Data utility
European Conference on Quality in Official Statistics, Rome, July 2008
Confidentiality against Dissemination
Find the right balance!
Disclosure scenarios
European Conference on Quality in Official Statistics, Rome, July 2008
Community Innovation Survey
• IDENTIFYING VARIABLES– Nace– Nuts– Size– Turnover (TURN)
(STRUCTURAL VARIABLES)
• CONFIDENTIAL VARIABLES– Expenditures in innovation (RTOT, …)– Number of patents, …
(VARIABLES INVOLVED IN ANALYSES)
European Conference on Quality in Official Statistics, Rome, July 2008
Confounding
Categorical Numerical
safe
unsafe
AA…Ak-anonymity
cn ttt ,
European Conference on Quality in Official Statistics, Rome, July 2008
a) Given a threshold (on units)b) Local Outlier Factor as a
measure of difference in density between a unit and its nearest neighbours
General risk function
Distance between and
*M
1t 2t
1,0,),(),(),( 211
2121 ccnn Ied tttttt
t
)(
)(
)(
)(*
*'
*
*
*
)(
'
t
t
t
ttt
M
NM
M
M N
LRD
LRD
LOFM
Density around :
European Conference on Quality in Official Statistics, Rome, July 2008
• Threshold - dissemination policy
Parameters*M
• Cut-off point for density (LOF)– quantiles– automatic
European Conference on Quality in Official Statistics, Rome, July 2008
Stratification variables
TUR
N
Analysis by Nace
Nace A all Nace
European Conference on Quality in Official Statistics, Rome, July 2008
Disclosure limitation
MFR Selective masking
k-anonymity Nearest neighbour
Micro-aggregation on tails
European Conference on Quality in Official Statistics, Rome, July 2008
Quality assessment
Dissemination
Confidentiality
European Conference on Quality in Official Statistics, Rome, July 2008
Risk measure assessment
Quality of the external database
D
E
Chambers of Commerce database
Record linkage
European Conference on Quality in Official Statistics, Rome, July 2008
Record linkage
M*=3
1 equal unit within 10%
less than 3 units within 10%
less than 3 units within 20%
less than 3 units within 30%
NACE 88% 84% 97% 100%
NACEEMP 63% 60%a 74%a 87%a
M*=5
1 equal unit within 10%
less than 5 units within 10%
less than 5 units within 20%
less than 5 units within 30%
NACE 88% 73% 87% 96%
NACEEMP 63% 58%a 70%a 80%a
a) 100% for enterprises with more than 250 employees
European Conference on Quality in Official Statistics, Rome, July 2008
Information content analysis
Information preservation• Selective masking
– Data utility– Only identifying and confidential variables were
modified.– Only records at risk were modified.
• The weights were not modified.– weighted totals (coherence with the already
published information)
Some statistical indicators were slightly modified: variances
European Conference on Quality in Official Statistics, Rome, July 2008
Information content analysisData utility
Assessment of the perturbation impact on ratios like RTOT/TURN
Original
Selective masking
Individual ranking
European Conference on Quality in Official Statistics, Rome, July 2008
Conclusions
1. Confidentiality: Risk measure based on the k-anonymity principle
Flexible a) continuous and categorical variables b) easy to implement c) consistent for extreme choices
2. Data utility: Selective protection to achieve the k-anonymity
3. Comparable dissemination: Control both risk of re-identification and information loss
QUALITY DIMENSIONS