evaluation of re-identification risk for anonymized clinical ...iden>fiers are replaced with...
TRANSCRIPT
G -
Derive Value From Excellence …
Evaluation of Re-identification Risk for Anonymized Clinical
Documents
Derive Value From Excellence… Derive Value From Excellence… 2
Agenda 1) RecentBackground2) EMAPolicy00703) Categoriza>onofClinicalData4) AssessmentofAnonymiza>on5) CriteriaforAnonymiza>on6) Qualita>veRisk7) Quan>ta>veRisk
a) Component1b) Component2
8) Ques>ons
Derive Value From Excellence… Derive Value From Excellence…
RECENT BACKGROUND
3
Derive Value From Excellence… Derive Value From Excellence… 4
EMA Policy 0070 Policy 0070 on clinical
data
Phase 1 for clinical reports
• Module 2.5 (clinical overview) • Module 2.7 (clinical summary) • Module 5 (clinical study reports (CSRs) and Appendices 16.1.1 (protocol and protocol
amendments), 16.1.2 (sample case report form) and 16.1.9 (documentation of statistical
methods)).
Phase 2 for IPD
Yet to be implemented
Key Note: External Guidance recently updated to provide specific list of documents to be published under Phase 1.
Derive Value From Excellence… Derive Value From Excellence… 5
Categorization of Clinical Data
Quasi-Identifiers Direct Identifiers: Sensitive Information
Demographic: Race, Ethnicity, Date of Birth, Age, Sex, Country Longitudinal: Adverse Events (Serious): Death, Hospitalization; Start/End of AE – MHTERM – SITEID
USUBJID, INVNAME Substance abuse, mental health disorders, HIV, reproductive health, sexually transmitted diseases – MHTERM (text) – medical history
Risk: The data may lead to identification of subjects participating in the study directly or indirectly. Per EU legislations, this personal data must be protected. Adversaries: Deliberate attempt, acquaintance, Data breach Remedy: Anonymization of clinical trial data to significantly reduce the risk for reidentification of subjects.
Derive Value From Excellence… Derive Value From Excellence…
Methodstoestablishifdatahasbeenproperlyanonymized:
1) DemonstratethataMeranonymiza>on,itisnolongerpossibleto:§ Singleoutanindividualfromtherestofthedata(“SinglingOut”)§ Linkdatasetsandrecordsrelatedtoanindividualtogetherandusethe
combinedinforma>ontoiden>fythatindividual(“Linkability”)§ InferanaVributeofsubjectfromotheraVributewithsomesignificant
probability(“Inference”)Note:Itisclearthatitisverydifficulttomeettheabovecriteriaandmaintaindatau7lityatthesame7me.OnlypossibilityisthatifCSRsdonotcontainanyinforma7onondirectandquasi-iden7fiersthenabovecriteriacanbemet.
OR
2) Performananalysisofre-iden>fica>onrisk,including:§ Anappropriateriskmetric§ Asuitable/establishedthreshold§ Anactual,calculatedmeasurementofrisk
6
Assessment of Anonymization
Derive Value From Excellence… Derive Value From Excellence…
Anonymization
Data Utility
Risk of Re-Idenification
7
Criteria for Anonymization
1) Meetestablishedthresholdsforriskofre-iden>fica>on–forEMA:0.09
2) Confirmanonymiza>on
methodsdonotimpactdatau>lity,i.e.,generallydefinedastheabilitytoreplicateanalyses/results
Derive Value From Excellence… Derive Value From Excellence…
Otherfactorstobeconsidered:Agerange,race,numberofiden>fierswithinthedata/report.
7
Qualitative Risk Assessment
* copied from Transcelerate guidance document
Approachmeasuresriskasqualita;vevaluei.e.low,mediumorhigh.
Derive Value From Excellence… Derive Value From Excellence…
Approachmeasuresriskasnumericalvalue…Twocomponentsofriskofre-iden>fica>onforde-iden>fieddatathatrecentliterature[(Scaianoetal.(2016)&Kniola(2016)]hasiden>fiedare:
9
Quantitative Risk Assessment
1) Riskduetomissingtoannotate/redactanydirectorQuasi-iden>fierwhichisplannedtobeannotated.
2) Riskduequasi-iden>fierswhicharecaughtbyprocessandtransformedorleMunchangedinclinicaldocument.
Derive Value From Excellence… Derive Value From Excellence… 10
Component 1 Probabilityofre-iden;fica;onduetoleakageofiden;fiers:
Pr(reid,leak)=Pr(reid/leak)XPr(leak)Where…
– Pr(reid/leak)=1fordirectiden7fiers– Pr(reid/leak)computedforquasi-iden7fiersbasedonempirically
supportedheuris7csassumingthatleakintwoquasi-iden7fierswouldleadtore-iden7fica7onofpa7ent
Including“Hidinginplainsight”factor:
– Pr(leak)=1-recallwhererecallispropor7onoftrueannota7onbytotalannota7onusingprocess(toolalone,tool+manual,manualonly)
Derive Value From Excellence… Derive Value From Excellence…
• EMAquotesthat“probabilityofre-iden>fica>onofarecordinadatasetis1dividedbythefrequencyoftrialpar>cipantswithsamecategory/valueofasetofthequasiiden>fiers(groupsize).”Forexample:
• StructuredIPDisavailabletothetrialsponsorbutnottoresearcher,whoreceivedDe-IDclinicaldocuments.
RiskduetoComponent2willbecalculatedusingIPDdataconsideringquasi-iden7fiersappearingdirectlyinclinicaldocuments
Note:Popula7onspropor7onconsidera7onnotincludedhere.
11
Component 2
Derive Value From Excellence… Derive Value From Excellence…
Riskassessedconsideringcomponent1andcomponent2jointly,ifquasi-iden>fiersarereplacedwithDe-iden>fiedvaluesfromIPDdata,willbegivenbyComponent1willbecalculatedassuggestedinpreviousslidesandcomponent2willbemajorpartofrisk.Component2willbemaximumofindividualriskforallsubjects.Forexample,referringtable1• ifallsubjectappearedinclinicalreportsasintable1thenriskdueto
component2willbe1.• Ifonlysubject101and102appearedinclinicalreportthenriskwillbe0.5.
Thisapproachdoesnotconsiderquasi-iden2fiersappearanceinclinicalreports.
12
Scenario 1 – Conservative approach
Component1Component2
Derive Value From Excellence… Derive Value From Excellence…
Ifprocessofde-iden>fica>onofclinicaldocumentsisabletodetecttheinforma;onaroundquasi-iden;fiersforeachsubjectappearedinclinicaldocumentthenriskassessedforcomponent2willbemoreexact.Forexample,referringtable1,ifonlysexappearsforsubjectsinclinicalreportsthenriskwillbeasbelow:
13
Scenario 2 – Exact approach
Ifsubject101and102appearsinclinicalreportwithageinforma7onthenriskduetocomponent2willbe0.2.
Derive Value From Excellence… Derive Value From Excellence…
Whatifallquasi-iden>fiersdoesnotappearforallsubject?Forexample:asshownintablebelow,agedoesnotappearfortwosubjectsi.e.103and104
14
Continued
USUBJID SEX AGE EquivalenceClass
Re-idRisk
CT1/101 M 26 A1(3) 0.33CT1/102 F 25 B(2) 0.5CT1/103 M A(3) 0.33CT1/104 F C(1) 0.5CT1/105 M 26 A1(3) 0.33
• Re-idriskforsubjects103and104willbecalculatedconsideringSEXvariableinfoonlyi.e.1/numberofrepeatedvaluesincurrentdataset.
• Reidriskforothersubjectswouldneedsomevalueformissingageoftwosubjects:
1) Assumingmissingvaluecouldbeanyvalueandcoun7ngthesesubjectsineachequivalenceclass
2) Assigningmissingvaluesusingprobabilitycalculatedonremainingdata.• Inaboveexample,itwasassumedthatsubjectwithmissingagewillbe
countedinequivalenceclasseswherevalueofothervariableismatched.103iscountedinequivalenceclassofsub101and105
Derive Value From Excellence… Derive Value From Excellence…
Component1ofre-idriskcanbeeliminatedfromcalcula>onbasedonfollowingjus>fica>on:• Processisrobustenoughandwillnotmisstoannotateanyiden>fier.• Itisexpectedthatatleastnodirectiden>fiersareleakedinclinical
documents.Onedirectiden>fiermissedtobeanonymizedmeansthatthereisdefinitere-iden>fica>onincaseofpublicdisclosureofdocuments.Hence,itisnotaffordabletomissanonymiza>onofevenasingledirectiden>fierforasinglepa>ent.
• ClinicaldocumentsthataresupposedtobesubmiVedforphase1ofPolicy0070arenotsupposedtobeverylengthyasappendicescontainingindividualpa>entdatalis>ngsarenotinscopeforphase1.MedicalwritersorsomeonewithgoodexposuretoICHE3willhaveanideathatwhichsec>onswillpossiblyhavePPDinforma>on.Thiswayanydocumentwhichisanonymizedusingatoolcanbequicklyreviewedforanymissedannota>onbyanexperiencedperson.Thiswillensurethatthereisnomissedannota>onfordirectorquasi-iden>fiers.
15
Thoughts on Component 1
G -
Derive Value From Excellence …
Q&A Thank you