sci datacon neil-walker-2016-09-12

False promises of data anonymity jeopardize data access

Neil Walker 12th September 2016 JDRF/Wellcome Trust Diabetes and Inflammation Laboratory University of Cambridge nmw24@cam.ac.uk ORCID: http://orcid.org/0000-0001-9796-7688

Contents

In the context of clinical trial data, I’ll discuss • The proposal to share individual-level data • False promises of anonymity • Consequences • Alternatives

Neil Walker 2

The proposal

In a Jan 2016 editorial, the International Committee of Medical Journal Editors (ICMJE)

“Proposes to require authors to share with others the deidentified individual patient data (IPD) underlying the results presented in the article.”

Implementation delayed a year1 to allow new consent models 1. Post adoption

Neil Walker 3

… response to …

Institute of Medicine (IoM) report (2015)

Neil Walker 4

… which includes …

(p144, my emphasis): “De-identification is commonly used to protect the privacy of participants in a clinical trial (see also Appendix B). Various jurisdictions may differ on the degree to which the risk of re-identification must be reduced for the data to be considered sufficiently de-identified to justify more widespread sharing, particularly in the absence of specific informed consent of the data subjects.”

[Appendix B is 54 pages on "Concepts and Methods for De-identifying Clinical Trial Data", by Khaled El Emam, and Bradley Malin]

Neil Walker 5

… and cites and relies on …

Neil Walker 6

… which is a poor implementation of …

Neil Walker 7

Why poor? ISTDB-2 has 19435 participants, and 112 variables, e.g.:

Randomisation data; HOSPNUM;Hospital number RDELAY;Delay between stroke and randomisation in hours RCONSC;Conscious state at randomisation (F - fully alert, D - drowsy, U - unconscious) SEX;"M=male; F=female" AGE;Age in years RSLEEP;Symptoms noted on waking (Y/N) RATRIAL;"Atrial fibrillation (Y/N); not coded for pilot phase - 984 patients" ... COUNTRY;Abbreviated country code CNTRYNUM;Country code ...

This should be enough people, right?

Neil Walker 8

Let’s count the people from each country

$ cut -f82 IST_corrected.txt | sort | uniq -c | sort –nr

6257 UK 3437 ITAL 1631 SWIT 759 POLA ... 9 JAPA 2 FRAN 1 COUNTRY

NB: dataset superseded by ISTDB-3, currently emabargoed due to "UK NHS Information Governance"

Neil Walker 9

Is this just isolated sloppiness?

And noting a released dataset cannot be retrieved

Neil Walker 10

Examples - "Anecdata" (from Daniel C Barth-Jones, to whom many thanks)

1. Governor Weld - identified in insurance dataset in 1997

2. Netflix - customers identified in a dataset released to improve recommendations

3. Y-Chromosome STR surname inference - demonstration from Yaniv Erlich's lab

4. PGP - subjects identified in (Open) Personal Genome Project

5. Washington State Hospital Discharge data - patients identified in data sold by hospital

6. NYC Taxi - celebrities identified in FOIL request

7. Mobile phone - theoretical identification from mobile phone location data

Neil Walker 11

Failure modes?

• 1. and 4. are cases where too much data was released (Zipcodes, DOBs)

• 6. and 7. are breached by linking multiple records, individually OK (probably) - though 6. had a key hacked too

But all rely on data available outside the dataset to make the (often small number of) identifications - some of it not obvious

Neil Walker 12

So, de-identification hotly debated…

“There is no evidence that de-identification works either in theory or in practice and attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do.”

Neil Walker 13

How does this jeopardise data access?

• And not just bad publicity, though that doesn’t help!

Image from Fast Company

Neil Walker 14

Data access issue #1: where consent was not sought for data sharing

Data is being redacted e.g. from https://clinicalstudydatarequest.com

GSK’s exclusion criteria includes:

Whether GSK consider it feasible to anonymise the data without compromising the privacy and confidentiality of research participants. For example, anonymisation of data from studies of rare diseases is more difficult to achieve and will be reviewed on a case-by-case basis.

Neil Walker 15

Data access issue #2: where there is no experience of sharing data with consent

Neil Walker 16 Should have lots of choices

Where is clinical data sharing now?

Neil Walker 17 EBI and NIH like this …

Where should it be?

Neil Walker 18

Aggregate Consented, anonymised

Understanding Society1, at UK Data Archive

Neil Walker 19

1. https://www.understandingsociety.ac.uk/documentation/getting-started

Downloads, 2014 3 285 2510

Datasets 2 29 3

Time to decision 3 months 2 weeks 1 day

Decision by DAC Staff, reporting to DAC

Registration, delegated by DAC

i.e. some people do it well

Data access issue #3: no elegant way to respond to a new attack

Neil Walker 20

This paper led to all genotype summary statistics being placed behind firewalls

Data access issue #4: people take risks

STOP PRESS - September 7th 2016 NHGRI give up on access control?

https://www.genome.gov/director/

https://www.genome.gov/27566089/Workshop-on-Sharing-Aggregate-Genomic-Data

“NHGRI should recommend that NIH reconsider the policy for maintaining all genomic summary statistics under controlled access, and develop a default public access model based on transparent policy considerations for most genomics studies.”

Neil Walker 21

The elephant in the room?

Neil Walker 22

From Banksy’s Barely Legal show, LA, 2006

Anonymous data is seen as a asset to buy and sell

However not all subjects will agree to data sharing, with a recent (health-data-related) poll finding 17%

“objected to private companies having access to health data under any circumstances.”

(Ipsos MORI 2016)

Neil Walker 23

So, to repeat: this is not a matter of consent or anonymise

Do both

Neil Walker 24

sci datacon neil-walker-2016-09-12

Science

datacon detail brochure

monthly newsletter for our datacon clients cda: anaheim a...

neil young unreleased - archive information neil …

justin sun boston datacon september 14, 2014. overview why...

functional characterization of keap1 tcga mutants in ...matt...

dirty data? clean it up! - rocky mountain datacon 2016

irish climate policy and green jobs neil walker global...

the datacon master renovation of a datacon...

county data for co-box...mecklenburg county wills 1749-1967...

dr louisa walker walker psychology & consulting 1

neil walker sascon-manchester-2010

sovereignty and beyond: the double edge of external … ·...

february 2013 monthly newsletter for our datacon clients...

dandy walker malformation - hội chứng dandy walker

a caring hand helps ease the mind neil o’grady and john...

1 zoe walker celestial radio a cross-institutional research...

rom walker/ medi walker boot unterschenkelorthese · lower

· manchester raspberry tanqueray 10 tanqueray sevilla...

walker group - organisation - arnold walker ag · alexander...

£86~-~86~ · kelli underwood chuck upton todd van buren...