poster: hye-chung kum, phd, darshana pathak, gautam sanka

1
RESEARCH POSTER PRESENTATION DESIGN © 2012 www.PosterPresentations.com THREE MODELS OF DATA ACCESS 2. MINIMUM INFORMATION SHARING Information Suppression during clerical review 1. DECOUPLED DATA SYSTEM 4 STEPS IN DECOUPLING DATA 3. CHAFFING & UNIVERSE MANIPULATION A simple but powerful data system for Privacy Preserving Interactive Record Linkage. Decouples (i.e. isolates) sensitive data (SD) from the personally identifying information – PII. Provides both error management in the data integration and the privacy protection by blocking attribute disclosure and minimizing identity disclosure. 1. Split data set into two tables: One for the identifying information – PII - and the other for remaining – mostly sensitive – information. T = T PII + T SD 2. Shuffling: Randomly shuffle rows in PII table, T PII . 3. Chaffing: Add fake rows of PII to T PII . 4. Encryption: Apply asymmetric encryption to lock the row association between the T PII and T SD . Identity disclosure without sensitive attribute disclosure has a little potential for harm We evaluate three methods for information disclosure: 1. Chaffing 2. Manipulation of universe 2.1 Fabrication 2.2 Non-disclosure Why is record linkage (RL) important? There is a constant need for record linkage to create a coherent „Big Data‟ system for the data originating from heterogeneous uncoordinated systems. Why is record linkage challenging? Redundant and fragmented datasets are split over multiple systems. Missing and erroneous attribute values with no unique, error-free identifiers require approximate record linkage, which result in error from false matches or uncertain matches 3,4 . What is Privacy Preserving Record linkage? To identify the records in one or more datasets that represent the same real world entity, without compromising the privacy of subjects involved 5,8 . What is Interactive Record linkage? Record linkage with people tuning and managing the false matches from the approximate record linkage algorithms. We define the properly tuned output from a hybrid human-machine data integration system as high quality record linkage 7 . ABSTRACT Ambiguous links must be manually reviewed during approximate record linkage to enable accurate data integration. This requirement would seem to make it impossible for researchers to protect patients’ privacy when integrating health informatics data. To address this problem, we propose a novel decoupled data system that blocks sensitive attribute disclosure via encryption and chaffing. We also evaluate three methods—Chaffing, Display control for clerical review and Manipulation of universe around the data—that can minimize identity disclosure. INTRODUCTION First generation: Hash based exact match (2003) 1 . Second generation: Improve the quality of linkages by allowing approximate match utilizing privacy preserving approximate string comparison operations such as bloomfilters (2009) 6 . Third generation [our model]: High quality RL using a hybrid human-machine data integration system for privacy preserving interactive record linkage (2012) 5 . PRIVACY PRESERVING RECORD LINKAGE PROBLEM STATEMENT: PRIVACY PRESERVING INTERACTIVE RECORD LINKAGE (PPIRL) Decoupled Information System for Privacy Preserving Interactive Record Linkage A tractable computational model for privacy preserving interactive record linkage (PPIRL) focusing on protection against attribute disclosure. Three techniques SDLink utilizes for privacy protection: 1. Strict decoupling via TPM – Trusted Platform Module based encryption (pseudonym method) 2. Minimum information sharing during human interaction via information suppression. 3. Chaffing – adding fake data to block attribute inference from group membership Approximate Record Linkage Human in the loop to resolve ambiguous links Threat of sensitive attribute disclosure Let I PPIRL = the category of information I in the Minimal Sharing model; h = a person tuning the false matches manually; α, ε = respective error terms; such that, InteractiveRL(h, α) is the minimum amount of information the person, h, needs to make decisions on linkage with high confidence Disclosure(h, ε) is the level of information disclosed to the honest-but-curious user, h, then, Privacy Preserving Interactive Record Linkage (PPIRL) is defined as the query operation PPIRL(D R , D S , I PPIRL , h) in the minimal sharing model* where D R and D S are the two tables to be linked, h is a honest-but-curious human in the loop making a final judgment on linkage, and I PPIRL is the minimal information to be shared with the human h. METHOD: SECURE DECOUPLED LINKAGE (SDLink) KEY INSIGHT The innovation in decoupling data is the focus on revealing information rather than hiding it. The key is to understand the minimum information required for quality linkage. Then to design protocols to reveal, in a secure manner, only that information. The survey results confirmed that chaffing and either falsifying or not defining the universe around the data were effective in introducing uncertainty to the information disclosed. Under non-disclosure of universe, 56% of the participants were uncertain about the identity given a common name. Even for rare names, if the list is chaffed and the universe is not defined, 66% of the participants were uncertain on the identity. *Minimal Sharing Model [Agrawal 2003] Let there be two parties R (receiver) and S (sender) with databases D R and D S respectively. Given a database query Q spanning the tables in D R and D S , and some categories of information I, compute the answer to Q and return it to R without revealing any additional information to either party except for information contained in I. InteractiveRL(h, α) <= I PPIRL < Disclosure(h, ε) It is important to note that the current norms for data integration in the US are full disclosure of all information to a fully trusted human entity. For e.g., full disclosure of both attribute and identity to certain trusted parties is HIPAA compliant. CONCLUSION Information suppression is essential during clerical review to avoid sensitive attribute disclosure. Furthermore, when chaffing is used in combination with non-disclosure of the universe, even rare names can be displayed with minimum risk of attribute disclosure during clerical review. Our proposed methods are effective in the presence of missing and erroneous data. REFERENCES 1. Agrawal R, Evfimievski A, and Srikant R, Information sharing across private databases. In SIGMOD 2003, pp 86-97, New York, NY, USA, 2003. ACM. 2. Boyd A, Saxman P, Hunscher D, et al. The University of Michigan Honest Broker: A Web-based Service for Clinical and Translational Research and Practice. J Am Med Inform Assoc. 2009 Nov-Dec; 16(6): 784–791. 3. Elfeky M, Verykios V, Elmagarmid A, TAILOR: A Record Linkage Tool Box. In ICDE 2002. IEEE Computer Society, Washington, DC, USA. 4. Elmagarmid K, Panagiotis GI, Verykios SV, Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 2007;19(1):1-16. 5. Kum H.C., Ahalt S, Pathak D. Privacy Preserving Data Integration Using Decoupled Data. Security and Privacy in Social Network, by Y. Elovici, Y. Altshuler, A. Cremers, N. Aharony, A. Pentland (Eds), Springer 2012. 6. Schnell R, Bachteler T and Reiher J, Privacy-preserving record linkage using Bloom filters. BMC Medical Informatics and Decision Making. 2009; 9(41). 7. Wang J, Kraska T, Franklin MJ, and Feng J, “CrowdER: Crowdsourcing Entity Resolution”, Proceedings of Very Large Data Bases (PVLDB) 5(11), 2012 8. Vatsalan D, Christen P, Vassilios S, Verykios, A taxonomy of privacy-preserving record linkage techniques, Information Systems, Available online 27 Nov 2012 CONTACT / ACKNOWLEDGMENTS Hye-Chung Kum, PhD ([email protected]) We thank Mike Reiter and Ashwin Machanavajjhala for their insightful comments, Fabian Monrose for supporting the research, and Ian Sang-Jun Kim and Ren Bauer for their assistance with the experiment. This research was supported in part by funding from the NC Department of Health and Human Services, NIH CTSA UL1TR000083, and NSF award no. CNS-0915364. FUTURE WORK: POPULATION INFORMATICS Today, nearly all of our activities from birth until death leave digital traces in large databases. Together, these digital traces collectively capture our social genome, the footprints of our society. Like the human genome, the social genome data has much buried in the massive almost chaotic data. If properly analyzed and interpreted, this social genome could offer crucial insights into many of the most challenging problems facing our society (i.e. affordable and accessible quality healthcare). The burgeoning field of population informatics is the systematic study of populations via secondary analysis of massive data collections (termed “big data”) about people. In particular, health informatics analyzes electronic health records to improve health outcomes for a population.

Upload: others

Post on 11-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Poster: Hye-Chung Kum, PhD, Darshana Pathak, Gautam Sanka

QUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 48”x72” professional

poster. You can use it to create your research poster and save valuable

time placing titles, subtitles, text, and graphics.

We provide a series of online tutorials that will guide you through the

poster design process and answer your poster production questions.

To view our template tutorials, go online to PosterPresentations.com

and click on HELP DESK.

When you are ready to print your poster, go online to

PosterPresentations.com.

Need Assistance? Call us at 1.866.649.3004

Object Placeholders

Using the placeholders

To add text, click inside a placeholder on the poster and type or paste

your text. To move a placeholder, click it once (to select it). Place

your cursor on its frame, and your cursor will change to this symbol

Click once and drag it to a new location where you can resize it.

Section Header placeholder

Click and drag this preformatted section header placeholder to the

poster area to add another section header. Use section headers to

separate topics or concepts within your presentation.

Text placeholder

Move this preformatted text placeholder to the poster to add a new

body of text.

Picture placeholder

Move this graphic placeholder onto your poster, size it first, and then

click it to add a picture to the poster.

RESEARCH POSTER PRESENTATION DESIGN © 2012

www.PosterPresentations.com

QUICK TIPS (--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or

newer) skills. Below is a list of commonly asked questions specific to

this template. If you are using an older version of PowerPoint some

template features may not work properly.

Template FAQs

Verifying the quality of your graphics

Go to the VIEW menu and click on ZOOM to set your preferred

magnification. This template is at 100% the size of the final poster. All

text and graphics will be printed at 100% their size. To see what your

poster will look like when printed, set the zoom to 100% and evaluate

the quality of all your graphics before you submit your poster for

printing.

Modifying the layout

This template has four different

column layouts. Right-click your

mouse on the background and click

on LAYOUT to see the layout options.

The columns in the provided layouts are fixed and cannot be moved

but advanced users can modify any layout by going to VIEW and then

SLIDE MASTER.

Importing text and graphics from external sources

TEXT: Paste or type your text into a pre-existing placeholder or drag

in a new placeholder from the left side of the template. Move it

anywhere as needed.

PHOTOS: Drag in a picture placeholder, size it first, click in it and

insert a photo from the menu.

TABLES: You can copy and paste a table from an external document

onto this poster template. To adjust the way the text fits within the

cells of a table that has been pasted, right-click on the table, click

FORMAT SHAPE then click on TEXT BOX and change the INTERNAL

MARGIN values to 0.25.

Modifying the color scheme

To change the color scheme of this template go to the DESIGN menu

and click on COLORS. You can choose from the provided color

combinations or create your own.

© 2013 PosterPresentations.com 2117 Fourth Street , Unit C Berkeley CA 94710 [email protected]

Student discounts are available on our Facebook page.

Go to PosterPresentations.com and click on the FB icon

THREE MODELS OF DATA ACCESS

2. MINIMUM INFORMATION SHARING

Information Suppression during clerical review

1. DECOUPLED DATA SYSTEM

4 STEPS IN DECOUPLING DATA

3. CHAFFING & UNIVERSE MANIPULATION

• A simple but powerful data system for Privacy

Preserving Interactive Record Linkage.

• Decouples (i.e. isolates) sensitive data (SD) from the

personally identifying information – PII.

• Provides both error management in the data

integration and the privacy protection by blocking

attribute disclosure and minimizing identity disclosure.

1. Split data set into two tables: One for the

identifying information – PII - and the other for

remaining – mostly sensitive – information.

T = TPII

+ TSD

2. Shuffling: Randomly shuffle rows in PII table, TPII

.

3. Chaffing: Add fake rows of PII to TPII

.

4. Encryption: Apply asymmetric encryption to lock

the row association between the TPII and T

SD.

Identity disclosure without sensitive attribute

disclosure has a little potential for harm

We evaluate three methods for

information disclosure:

1. Chaffing

2. Manipulation of universe

2.1 Fabrication

2.2 Non-disclosure

• Why is record linkage (RL) important?

There is a constant need for record linkage to create

a coherent „Big Data‟ system for the data originating

from heterogeneous uncoordinated systems.

• Why is record linkage challenging?

Redundant and fragmented datasets are split over

multiple systems. Missing and erroneous attribute

values with no unique, error-free identifiers require

approximate record linkage, which result in error

from false matches or uncertain matches3,4.

• What is Privacy Preserving Record linkage?

To identify the records in one or more datasets that

represent the same real world entity, without

compromising the privacy of subjects involved5,8.

• What is Interactive Record linkage?

Record linkage with people tuning and managing the

false matches from the approximate record linkage

algorithms. We define the properly tuned output from

a hybrid human-machine data integration system as

high quality record linkage7.

ABSTRACT

Ambiguous links must be manually reviewed during

approximate record linkage to enable accurate data

integration. This requirement would seem to make it

impossible for researchers to protect patients’ privacy

when integrating health informatics data. To address

this problem, we propose a novel decoupled data

system that blocks sensitive attribute disclosure via

encryption and chaffing. We also evaluate three

methods—Chaffing, Display control for clerical review

and Manipulation of universe around the data—that can

minimize identity disclosure.

INTRODUCTION

• First generation: Hash based exact match (2003)1.

• Second generation: Improve the quality of linkages by

allowing approximate match utilizing privacy preserving

approximate string comparison operations such as

bloomfilters (2009)6.

• Third generation [our model]: High quality RL using a

hybrid human-machine data integration system for

privacy preserving interactive record linkage (2012)5.

PRIVACY PRESERVING RECORD LINKAGE

PROBLEM STATEMENT: PRIVACY PRESERVING INTERACTIVE RECORD LINKAGE (PPIRL)

Decoupled Information System for

Privacy Preserving Interactive Record Linkage

A tractable computational model for privacy preserving

interactive record linkage (PPIRL) focusing on protection

against attribute disclosure.

Three techniques SDLink utilizes for privacy protection:

1. Strict decoupling via TPM – Trusted Platform Module

based encryption (pseudonym method)

2. Minimum information sharing during human

interaction via information suppression.

3. Chaffing – adding fake data to block attribute

inference from group membership

Approximate Record Linkage

Human in the loop to resolve ambiguous links

Threat of sensitive attribute disclosure

Let

IPPIRL

= the category of information I in the Minimal

Sharing model;

h = a person tuning the false matches manually;

α, ε = respective error terms;

such that,

• InteractiveRL(h, α) is the minimum amount of

information the person, h, needs to make decisions on

linkage with high confidence

• Disclosure(h, ε) is the level of information disclosed to

the honest-but-curious user, h, then,

Privacy Preserving Interactive Record Linkage (PPIRL) is

defined as the query operation PPIRL(DR, D

S, I

PPIRL, h) in

the minimal sharing model* where DR and D

S are the two

tables to be linked, h is a honest-but-curious human in the

loop making a final judgment on linkage, and IPPIRL

is the

minimal information to be shared with the human h.

METHOD: SECURE DECOUPLED LINKAGE (SDLink)

KEY INSIGHT

• The innovation in decoupling data is the focus on

revealing information rather than hiding it.

• The key is to understand the minimum information

required for quality linkage. Then to design protocols

to reveal, in a secure manner, only that information.

• The survey results confirmed that chaffing and either falsifying or

not defining the universe around the data were effective in

introducing uncertainty to the information disclosed.

• Under non-disclosure of universe, 56% of the participants were

uncertain about the identity given a common name.

• Even for rare names, if the list is chaffed and the universe is not

defined, 66% of the participants were uncertain on the identity.

*Minimal Sharing Model [Agrawal 2003]

Let there be two parties R (receiver) and S (sender) with databases

DR and D

S respectively. Given a database query Q spanning the

tables in DR and D

S, and some categories of information I, compute

the answer to Q and return it to R without revealing any additional

information to either party except for information contained in I.

InteractiveRL(h, α) <= IPPIRL

< Disclosure(h, ε) It is important to note that the current norms for data integration

in the US are full disclosure of all information to a fully trusted

human entity. For e.g., full disclosure of both attribute and identity

to certain trusted parties is HIPAA compliant.

CONCLUSION

Information suppression is essential during clerical

review to avoid sensitive attribute disclosure.

Furthermore, when chaffing is used in combination

with non-disclosure of the universe, even rare names

can be displayed with minimum risk of attribute

disclosure during clerical review. Our proposed

methods are effective in the presence of missing and

erroneous data.

REFERENCES 1. Agrawal R, Evfimievski A, and Srikant R, Information sharing across private

databases. In SIGMOD 2003, pp 86-97, New York, NY, USA, 2003. ACM.

2. Boyd A, Saxman P, Hunscher D, et al. The University of Michigan Honest Broker:

A Web-based Service for Clinical and Translational Research and Practice. J Am

Med Inform Assoc. 2009 Nov-Dec; 16(6): 784–791.

3. Elfeky M, Verykios V, Elmagarmid A, TAILOR: A Record Linkage Tool Box. In

ICDE 2002. IEEE Computer Society, Washington, DC, USA.

4. Elmagarmid K, Panagiotis GI, Verykios SV, Duplicate record detection: A survey.

IEEE Trans. Knowl. Data Eng. 2007;19(1):1-16.

5. Kum H.C., Ahalt S, Pathak D. Privacy Preserving Data Integration Using

Decoupled Data. Security and Privacy in Social Network, by Y. Elovici, Y.

Altshuler, A. Cremers, N. Aharony, A. Pentland (Eds), Springer 2012.

6. Schnell R, Bachteler T and Reiher J, Privacy-preserving record linkage using

Bloom filters. BMC Medical Informatics and Decision Making. 2009; 9(41).

7. Wang J, Kraska T, Franklin MJ, and Feng J, “CrowdER: Crowdsourcing Entity

Resolution”, Proceedings of Very Large Data Bases (PVLDB) 5(11), 2012

8. Vatsalan D, Christen P, Vassilios S, Verykios, A taxonomy of privacy-preserving

record linkage techniques, Information Systems, Available online 27 Nov 2012

CONTACT / ACKNOWLEDGMENTS Hye-Chung Kum, PhD ([email protected])

We thank Mike Reiter and Ashwin Machanavajjhala for their insightful comments,

Fabian Monrose for supporting the research, and Ian Sang-Jun Kim and Ren Bauer

for their assistance with the experiment. This research was supported in part by

funding from the NC Department of Health and Human Services, NIH CTSA

UL1TR000083, and NSF award no. CNS-0915364.

FUTURE WORK: POPULATION INFORMATICS

Today, nearly all of our activities from birth until death

leave digital traces in large databases. Together, these

digital traces collectively capture our social genome,

the footprints of our society. Like the human genome,

the social genome data has much buried in the massive

almost chaotic data. If properly analyzed and

interpreted, this social genome could offer crucial

insights into many of the most challenging problems

facing our society (i.e. affordable and accessible

quality healthcare). The burgeoning field of population

informatics is the systematic study of populations via

secondary analysis of massive data collections (termed

“big data”) about people. In particular, health

informatics analyzes electronic health records to

improve health outcomes for a population.