evaluation of re-identification risk for anonymized clinical ...iden>fiers are replaced with...

16
G - Derive Value From Excellence … Evaluation of Re-identification Risk for Anonymized Clinical Documents

Upload: others

Post on 25-Jul-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

G -

Derive Value From Excellence …

Evaluation of Re-identification Risk for Anonymized Clinical

Documents

Page 2: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence… 2

Agenda 1)  RecentBackground2)  EMAPolicy00703)  Categoriza>onofClinicalData4)  AssessmentofAnonymiza>on5)  CriteriaforAnonymiza>on6)  Qualita>veRisk7)  Quan>ta>veRisk

a)  Component1b)  Component2

8)  Ques>ons

Page 3: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

RECENT BACKGROUND

3

Page 4: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence… 4

EMA Policy 0070 Policy 0070 on clinical

data

Phase 1 for clinical reports

• Module 2.5 (clinical overview) • Module 2.7 (clinical summary) • Module 5 (clinical study reports (CSRs) and Appendices 16.1.1 (protocol and protocol

amendments), 16.1.2 (sample case report form) and 16.1.9 (documentation of statistical

methods)).

Phase 2 for IPD

Yet to be implemented

Key Note: External Guidance recently updated to provide specific list of documents to be published under Phase 1.

Page 5: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence… 5

Categorization of Clinical Data

Quasi-Identifiers Direct Identifiers: Sensitive Information

Demographic: Race, Ethnicity, Date of Birth, Age, Sex, Country Longitudinal: Adverse Events (Serious): Death, Hospitalization; Start/End of AE – MHTERM – SITEID

USUBJID, INVNAME Substance abuse, mental health disorders, HIV, reproductive health, sexually transmitted diseases – MHTERM (text) – medical history

Risk: The data may lead to identification of subjects participating in the study directly or indirectly. Per EU legislations, this personal data must be protected. Adversaries: Deliberate attempt, acquaintance, Data breach Remedy: Anonymization of clinical trial data to significantly reduce the risk for reidentification of subjects.

Page 6: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Methodstoestablishifdatahasbeenproperlyanonymized:

1)  DemonstratethataMeranonymiza>on,itisnolongerpossibleto:§  Singleoutanindividualfromtherestofthedata(“SinglingOut”)§  Linkdatasetsandrecordsrelatedtoanindividualtogetherandusethe

combinedinforma>ontoiden>fythatindividual(“Linkability”)§  InferanaVributeofsubjectfromotheraVributewithsomesignificant

probability(“Inference”)Note:Itisclearthatitisverydifficulttomeettheabovecriteriaandmaintaindatau7lityatthesame7me.OnlypossibilityisthatifCSRsdonotcontainanyinforma7onondirectandquasi-iden7fiersthenabovecriteriacanbemet.

OR

2)  Performananalysisofre-iden>fica>onrisk,including:§  Anappropriateriskmetric§  Asuitable/establishedthreshold§  Anactual,calculatedmeasurementofrisk

6

Assessment of Anonymization

Page 7: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Anonymization

Data Utility

Risk of Re-Idenification

7

Criteria for Anonymization

1)  Meetestablishedthresholdsforriskofre-iden>fica>on–forEMA:0.09

2)  Confirmanonymiza>on

methodsdonotimpactdatau>lity,i.e.,generallydefinedastheabilitytoreplicateanalyses/results

Page 8: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Otherfactorstobeconsidered:Agerange,race,numberofiden>fierswithinthedata/report.

7

Qualitative Risk Assessment

* copied from Transcelerate guidance document

Approachmeasuresriskasqualita;vevaluei.e.low,mediumorhigh.

Page 9: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Approachmeasuresriskasnumericalvalue…Twocomponentsofriskofre-iden>fica>onforde-iden>fieddatathatrecentliterature[(Scaianoetal.(2016)&Kniola(2016)]hasiden>fiedare:

9

Quantitative Risk Assessment

1)  Riskduetomissingtoannotate/redactanydirectorQuasi-iden>fierwhichisplannedtobeannotated.

2)  Riskduequasi-iden>fierswhicharecaughtbyprocessandtransformedorleMunchangedinclinicaldocument.

Page 10: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence… 10

Component 1 Probabilityofre-iden;fica;onduetoleakageofiden;fiers:

Pr(reid,leak)=Pr(reid/leak)XPr(leak)Where…

–  Pr(reid/leak)=1fordirectiden7fiers–  Pr(reid/leak)computedforquasi-iden7fiersbasedonempirically

supportedheuris7csassumingthatleakintwoquasi-iden7fierswouldleadtore-iden7fica7onofpa7ent

Including“Hidinginplainsight”factor:

–  Pr(leak)=1-recallwhererecallispropor7onoftrueannota7onbytotalannota7onusingprocess(toolalone,tool+manual,manualonly)

Page 11: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

•  EMAquotesthat“probabilityofre-iden>fica>onofarecordinadatasetis1dividedbythefrequencyoftrialpar>cipantswithsamecategory/valueofasetofthequasiiden>fiers(groupsize).”Forexample:

•  StructuredIPDisavailabletothetrialsponsorbutnottoresearcher,whoreceivedDe-IDclinicaldocuments.

RiskduetoComponent2willbecalculatedusingIPDdataconsideringquasi-iden7fiersappearingdirectlyinclinicaldocuments

Note:Popula7onspropor7onconsidera7onnotincludedhere.

11

Component 2

Page 12: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Riskassessedconsideringcomponent1andcomponent2jointly,ifquasi-iden>fiersarereplacedwithDe-iden>fiedvaluesfromIPDdata,willbegivenbyComponent1willbecalculatedassuggestedinpreviousslidesandcomponent2willbemajorpartofrisk.Component2willbemaximumofindividualriskforallsubjects.Forexample,referringtable1•  ifallsubjectappearedinclinicalreportsasintable1thenriskdueto

component2willbe1.•  Ifonlysubject101and102appearedinclinicalreportthenriskwillbe0.5.

Thisapproachdoesnotconsiderquasi-iden2fiersappearanceinclinicalreports.

12

Scenario 1 – Conservative approach

Component1Component2

Page 13: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Ifprocessofde-iden>fica>onofclinicaldocumentsisabletodetecttheinforma;onaroundquasi-iden;fiersforeachsubjectappearedinclinicaldocumentthenriskassessedforcomponent2willbemoreexact.Forexample,referringtable1,ifonlysexappearsforsubjectsinclinicalreportsthenriskwillbeasbelow:

13

Scenario 2 – Exact approach

Ifsubject101and102appearsinclinicalreportwithageinforma7onthenriskduetocomponent2willbe0.2.

Page 14: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Whatifallquasi-iden>fiersdoesnotappearforallsubject?Forexample:asshownintablebelow,agedoesnotappearfortwosubjectsi.e.103and104

14

Continued

USUBJID SEX AGE EquivalenceClass

Re-idRisk

CT1/101 M 26 A1(3) 0.33CT1/102 F 25 B(2) 0.5CT1/103 M A(3) 0.33CT1/104 F C(1) 0.5CT1/105 M 26 A1(3) 0.33

•  Re-idriskforsubjects103and104willbecalculatedconsideringSEXvariableinfoonlyi.e.1/numberofrepeatedvaluesincurrentdataset.

•  Reidriskforothersubjectswouldneedsomevalueformissingageoftwosubjects:

1)  Assumingmissingvaluecouldbeanyvalueandcoun7ngthesesubjectsineachequivalenceclass

2)  Assigningmissingvaluesusingprobabilitycalculatedonremainingdata.•  Inaboveexample,itwasassumedthatsubjectwithmissingagewillbe

countedinequivalenceclasseswherevalueofothervariableismatched.103iscountedinequivalenceclassofsub101and105

Page 15: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

Derive Value From Excellence… Derive Value From Excellence…

Component1ofre-idriskcanbeeliminatedfromcalcula>onbasedonfollowingjus>fica>on:•  Processisrobustenoughandwillnotmisstoannotateanyiden>fier.•  Itisexpectedthatatleastnodirectiden>fiersareleakedinclinical

documents.Onedirectiden>fiermissedtobeanonymizedmeansthatthereisdefinitere-iden>fica>onincaseofpublicdisclosureofdocuments.Hence,itisnotaffordabletomissanonymiza>onofevenasingledirectiden>fierforasinglepa>ent.

•  ClinicaldocumentsthataresupposedtobesubmiVedforphase1ofPolicy0070arenotsupposedtobeverylengthyasappendicescontainingindividualpa>entdatalis>ngsarenotinscopeforphase1.MedicalwritersorsomeonewithgoodexposuretoICHE3willhaveanideathatwhichsec>onswillpossiblyhavePPDinforma>on.Thiswayanydocumentwhichisanonymizedusingatoolcanbequicklyreviewedforanymissedannota>onbyanexperiencedperson.Thiswillensurethatthereisnomissedannota>onfordirectorquasi-iden>fiers.

15

Thoughts on Component 1

Page 16: Evaluation of Re-identification Risk for Anonymized Clinical ...iden>fiers are replaced with De-iden>fied values from IPD data, will be given by Component 1 will be calculated as

G -

Derive Value From Excellence …

Q&A Thank you