privacy statistics and data linkage

35
Privacy Statistics and Data Linkage Mark Elliot Confidentiality and Privacy Group University of Manchester

Upload: feo

Post on 31-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Privacy Statistics and Data Linkage. Mark Elliot Confidentiality and Privacy Group University of Manchester. Overview. The disclosure risk problem Some e-science possibilities Monitored data access Grid based Data environment Analysis The meaning of privacy. Data Data Everywhere…. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Privacy Statistics and  Data Linkage

Privacy Statistics and Data Linkage

Mark Elliot

Confidentiality and Privacy Group

University of Manchester

Page 2: Privacy Statistics and  Data Linkage

Overview

• The disclosure risk problem

• Some e-science possibilities– Monitored data access– Grid based Data environment Analysis

• The meaning of privacy

Page 3: Privacy Statistics and  Data Linkage

Data Data Everywhere…• Massive and exponential increase in data; Mackey

and Purdam(2002); Purdam and Elliot(2002). – These studies have led to the setting up of the data monitoring service.

• Singer(1999) noted three behavioural tendencies:– Collect more information on each population unit

– Replace aggregate data with person specific databases

– Given the opportunity collect personal information

• Purdam and Elliot add:– Link data whenever you can

Page 4: Privacy Statistics and  Data Linkage

Disclosure Risk I: Microdata

Page 5: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type I: Identification

Name Address Sex Age ..

Income .. ..Sex Age ..

IDvariables

Keyvariables

Targetvariables

Identification file

Target file

Page 6: Privacy Statistics and  Data Linkage

Disclosure Risk II: Aggregate Tables of

Counts

Page 7: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 8: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 1 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 9: Privacy Statistics and  Data Linkage

The Disclosure Risk Problem:Type II: Attribution

High Medium Low TotalAccademics 0 100 50 150Lawyers 100 50 5 155Total 100 150 55 305

Income levels for two occupations

Page 10: Privacy Statistics and  Data Linkage

Multiple datasets

• Disclosure Risk assessment for single datasets is a reasonably understood problem.

• But what happens with multiple datasets?

Page 11: Privacy Statistics and  Data Linkage

Data Mining and the Grid

• Traditional Data Mining examines and identifies patterns on single (if massive) datasets.

• But Data Mining is really a method/approach/technology that has been waiting for the grid to happen.

Page 12: Privacy Statistics and  Data Linkage

• Smith and Elliot (2005,06,07)

• Increases in data availability lead inexorably to an increase in disclosure risk

• My ability to make linkages (disclosive or otherwise) between datasets X and Y is facilitated by the copresence of dataset Z.

• It’s all about information!

Page 13: Privacy Statistics and  Data Linkage

CLEF: Clinical e-Science Framework

A solution involving monitored access

Page 14: Privacy Statistics and  Data Linkage

CLEF Consortium

Approximately 40 Staff from

• University of Manchester

• University of Sheffield

• University College London

• University of Brighton

• Royal Marsden Hospital, London

Page 15: Privacy Statistics and  Data Linkage

Purpose

• To provide a system for allowing research access to patient data, whilst maintaining privacy.

• Patient records– Database

• Texts such as referral letters and other clinical texts– Text mining system convert to microdata

Page 16: Privacy Statistics and  Data Linkage

PRE-ACCESS DQI Monitor

Raw Data

Treated Data

Data Intrusion

sentry

PRE-OUTPUT SDRA/SDC

PRE-ACCESS SDRA/SDC

PRE-Output DQI Monitor

Firewall

CLEF one possible architecture

Workbench

Page 17: Privacy Statistics and  Data Linkage

Data Sentry: an AI system

• Monitors patterns of analytical requests– 3 levels: users, institution, world.– Looking for intrusive patterns.– Numbers of requests

• Stores Analytical requests for future use.

Page 18: Privacy Statistics and  Data Linkage

PRE-ACCESS DQI Monitor

Raw Data

Treated Data

Data Intrusion

sentry

PRE-OUTPUT SDRA/SDC

PRE-ACCESS SDRA/SDC

PRE-Output DQI Monitor

Firewall

CLEF Proposed Architecture

Workbench

Page 19: Privacy Statistics and  Data Linkage

Data Quality

• User analyses are run on both treated and untreated data. – Outputs are compared and assessed for

difference.– Major research area – Knowledge Engineering

• Analyses are stored and collectively run over pre and post SDC files for assessment of impact.

Page 20: Privacy Statistics and  Data Linkage

The Grid: the context for massive combining.

• “Integrated infrastructure for high-performance distributed computation” Cannataro and Talia (2002)

– Grid middleware handles the technical issues communication, security, access/authentication etc… Cole et al (2002)

• Data grid

• Knowledge grid

Page 21: Privacy Statistics and  Data Linkage

Grid based Data Environment Analysis

Page 22: Privacy Statistics and  Data Linkage

What’s it about?

• Disclosure risk analysis is forever constrained by the fact that we tend to only look at the release object. – This is a bit like evaluating the risk of a house

being vulnerable to flooding without looking at where it is located!

• Data Environment Analysis aims to remedy that situation and complete change the face of disclosure control in so doing…..

Page 23: Privacy Statistics and  Data Linkage

What would it involve?

• Web Crawling

• Data Monitoring

• Synthetic Data Generation

• Grid based disclosure risk analysis

Page 24: Privacy Statistics and  Data Linkage

Web crawling

• Untrained Screen scraping of all web sites that collect personal data.

• Generic info gathering of web published personal info (personal web pages, My space etc)

Page 25: Privacy Statistics and  Data Linkage

Data Monitoring

• The development of sophisticated metadatabases representing available info fields

• Combined Database of web available data. – Involves intelligent interpretation of web data,

record linkage and other AI crossover techniques.

Page 26: Privacy Statistics and  Data Linkage

Architecture

Repository: Data & Metadata

Data monitorSynthesiserSDRA system

Web Crawler

Web Crawler

Web Crawler

Web Crawler

Web Crawler

Page 27: Privacy Statistics and  Data Linkage

What next?

• Decide on roles.

• Identify funder.

• Develop grant application.

Page 28: Privacy Statistics and  Data Linkage

Synthetic Data Generation

• Uses techniques like multiple imputation to generate artificial data from the metadata generated by the data monitors and from data stored and accessed through data repositories.

Page 29: Privacy Statistics and  Data Linkage

Closing thoughts

Page 30: Privacy Statistics and  Data Linkage

A Blurring of Concepts

• The boundaries between data and processes become less distinct.

• Cyberidenties– I am my data?

• The distinction between informational and physical privacy becomes less distinct.

Page 31: Privacy Statistics and  Data Linkage

Data Growth

• There is no reason to suppose that data growth will not continue at the same break neck pace– The data environment will become increasingly

richer

• In this context the meaning of “privacy” will undoubtedly change.– But how?

Page 32: Privacy Statistics and  Data Linkage

The meaning of Privacy

• Do people care about privacy in an orthodox, absolute sense?– What does a blog mean?

• Private-public: Public Privacy

– Control and ownership are more important than the absolute right to secrecy.

Page 33: Privacy Statistics and  Data Linkage

From Data Subjects to Data Citizens

• A data actualised individual in control and self aware of their own data.

• What would data citizens be concerned about?– Ownership– The use/abuse of their data– Harm– Permission/Consent

• This suggests that the law should focus on data abuse rather than privacy per se.

Page 34: Privacy Statistics and  Data Linkage

Summary

• Statistical Disclosure prevents a problem for the use of data

• Multiple linkable datasets exacerbate that problem.

• E-science provides some tools for new modes of data access

Page 35: Privacy Statistics and  Data Linkage

But…..

• Assuming that the global culture continues to feed and be fed by the information explosion:– Our view of ourselves/our data will/must change.

– The meaning of privacy must change with it.

• The key question is what sort of society we are constructing; the meaning of privacy will reflect this.