sci datacon neil-walker-2016-09-12
Post on 23-Jan-2017
173 Views
Preview:
TRANSCRIPT
False promises of data anonymity jeopardize data access
Neil Walker 12th September 2016 JDRF/Wellcome Trust Diabetes and Inflammation Laboratory University of Cambridge nmw24@cam.ac.uk ORCID: http://orcid.org/0000-0001-9796-7688
Contents
In the context of clinical trial data, I’ll discuss • The proposal to share individual-level data • False promises of anonymity • Consequences • Alternatives
Neil Walker 2
The proposal
In a Jan 2016 editorial, the International Committee of Medical Journal Editors (ICMJE)
“Proposes to require authors to share with others the deidentified individual patient data (IPD) underlying the results presented in the article.”
Implementation delayed a year1 to allow new consent models 1. Post adoption
Neil Walker 3
… response to …
Institute of Medicine (IoM) report (2015)
Neil Walker 4
… which includes …
(p144, my emphasis): “De-identification is commonly used to protect the privacy of participants in a clinical trial (see also Appendix B). Various jurisdictions may differ on the degree to which the risk of re-identification must be reduced for the data to be considered sufficiently de-identified to justify more widespread sharing, particularly in the absence of specific informed consent of the data subjects.”
[Appendix B is 54 pages on "Concepts and Methods for De-identifying Clinical Trial Data", by Khaled El Emam, and Bradley Malin]
Neil Walker 5
… and cites and relies on …
Neil Walker 6
… which is a poor implementation of …
Neil Walker 7
Why poor? ISTDB-2 has 19435 participants, and 112 variables, e.g.:
Randomisation data; HOSPNUM;Hospital number RDELAY;Delay between stroke and randomisation in hours RCONSC;Conscious state at randomisation (F - fully alert, D - drowsy, U - unconscious) SEX;"M=male; F=female" AGE;Age in years RSLEEP;Symptoms noted on waking (Y/N) RATRIAL;"Atrial fibrillation (Y/N); not coded for pilot phase - 984 patients" ... COUNTRY;Abbreviated country code CNTRYNUM;Country code ...
This should be enough people, right?
Neil Walker 8
Let’s count the people from each country
$ cut -f82 IST_corrected.txt | sort | uniq -c | sort –nr
6257 UK 3437 ITAL 1631 SWIT 759 POLA ... 9 JAPA 2 FRAN 1 COUNTRY
NB: dataset superseded by ISTDB-3, currently emabargoed due to "UK NHS Information Governance"
Neil Walker 9
Is this just isolated sloppiness?
And noting a released dataset cannot be retrieved
Neil Walker 10
Examples - "Anecdata" (from Daniel C Barth-Jones, to whom many thanks)
1. Governor Weld - identified in insurance dataset in 1997
2. Netflix - customers identified in a dataset released to improve recommendations
3. Y-Chromosome STR surname inference - demonstration from Yaniv Erlich's lab
4. PGP - subjects identified in (Open) Personal Genome Project
5. Washington State Hospital Discharge data - patients identified in data sold by hospital
6. NYC Taxi - celebrities identified in FOIL request
7. Mobile phone - theoretical identification from mobile phone location data
Neil Walker 11
Failure modes?
• 1. and 4. are cases where too much data was released (Zipcodes, DOBs)
• 6. and 7. are breached by linking multiple records, individually OK (probably) - though 6. had a key hacked too
But all rely on data available outside the dataset to make the (often small number of) identifications - some of it not obvious
Neil Walker 12
So, de-identification hotly debated…
“There is no evidence that de-identification works either in theory or in practice and attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do.”
Neil Walker 13
How does this jeopardise data access?
• And not just bad publicity, though that doesn’t help!
Image from Fast Company
Neil Walker 14
Data access issue #1: where consent was not sought for data sharing
Data is being redacted e.g. from https://clinicalstudydatarequest.com
GSK’s exclusion criteria includes:
Whether GSK consider it feasible to anonymise the data without compromising the privacy and confidentiality of research participants. For example, anonymisation of data from studies of rare diseases is more difficult to achieve and will be reviewed on a case-by-case basis.
Neil Walker 15
Data access issue #2: where there is no experience of sharing data with consent
Neil Walker 16 Should have lots of choices
Where is clinical data sharing now?
Neil Walker 17 EBI and NIH like this …
Where should it be?
Neil Walker 18
Aggregate Consented, anonymised
Understanding Society1, at UK Data Archive
Neil Walker 19
1. https://www.understandingsociety.ac.uk/documentation/getting-started
Downloads, 2014 3 285 2510
Datasets 2 29 3
Time to decision 3 months 2 weeks 1 day
Decision by DAC Staff, reporting to DAC
Registration, delegated by DAC
i.e. some people do it well
Data access issue #3: no elegant way to respond to a new attack
Neil Walker 20
This paper led to all genotype summary statistics being placed behind firewalls
Data access issue #4: people take risks
STOP PRESS - September 7th 2016 NHGRI give up on access control?
https://www.genome.gov/director/
https://www.genome.gov/27566089/Workshop-on-Sharing-Aggregate-Genomic-Data
“NHGRI should recommend that NIH reconsider the policy for maintaining all genomic summary statistics under controlled access, and develop a default public access model based on transparent policy considerations for most genomics studies.”
Neil Walker 21
The elephant in the room?
Neil Walker 22
From Banksy’s Barely Legal show, LA, 2006
Anonymous data is seen as a asset to buy and sell
However not all subjects will agree to data sharing, with a recent (health-data-related) poll finding 17%
“objected to private companies having access to health data under any circumstances.”
(Ipsos MORI 2016)
Neil Walker 23
So, to repeat: this is not a matter of consent or anonymise
Do both
Neil Walker 24
top related