exploring new methods for protecting and distributing confidential research data

Post on 12-May-2015

783 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I gave this presentation at the Fall 2009 CNI Membership meeting.

TRANSCRIPT

Exploring New Methods for Protecting and Distributing Confidential Research Data

Bryan BeecherFelicia LeClereICPSR/University of Michigan

Today’s Talk

• What’s ICPSR?• How do organizations distribute

confidential research data today?• What are the problems?• What can we improve?

What’s ICPSR?

• Inter-university Consortium for Political and Social Research– JSTOR for social

science data

• Serving billions since 1962

Who does ICPSR serve?

• Research universities– Discover and download data

• Teaching universities and colleges– On-line analysis

• Federal agencies– Data management, preservation, and

dissemination

Distributing data

Distributing data

• Most of our content is public-use– Anonymized public opinion– Aggregate government data

• Little risk of disclosure• But what about the good stuff?

Distributing sensitive data

Distributing sensitive data

• Higher risk of breech of confidentiality– Variables that give geographic

information that might be combined with other data sources to identify a respondent

• Requires special handling

Distributing sensitive data

• Researcher agrees to protect the data and identities

• Delivered securely

• Harsh penalty deterrent

http://www.flickr.com/photos/lwr/521394398

National Longitudinal Study of Adolescent Health

• Add Health– Highly used and cited study

• Very frank questions– Kids in 7th through 12th grade

• Carolina Population Center• Gold standard in data protection

Traditional Approach

http://www.flickr.com/photos/videolux/2389320345/

http://www.flickr.com/photos/curiousexpeditions/3767246490/

Traditional Approach

Confidential research

data

Apply for access

Write security plan

Repeat

Can we improve upon it?

• Paperwork– How do we speed

the application process?

• Security– How do we ensure

the data are going to a good home?

Paperwork

• Web portal– Research plan– IRB approval– CVs– Confidentiality

agreements

Paperwork

• Web portal– Behavioral

questionnaire– Electronic copy of

contract (HTML, PDF)

– Database back-end to drive workflow systems

Restricted data Contracting System

• Integrated with ICPSR’s existing Web download mechanism

• Collects information that would ordinarily be provided through paper

• “Tickler” system to send reminders, nag about missing items

Security

• Current system relies on…– The data provider to maintain

security templates– The researcher to write an IT security

plan– The data provider to read and

understand the plan– The researcher to execute the plan

ResearcherWorkstationICPSR

Current access model

Secure Area

ResearcherWorkstationICPSR

A new access model?

Secure area = the cloud?

• Cloud-based access– Convenient– Scalable– Economical– Perfect?

http://www.flickr.com/photos/docbudie/2240764187/

What could go wrong?

Almost everything

• Is the cloud reliable?• Will the data be safe?• We are building an analytic

environment for a researcher, how will we know what to provide?

• Will this perform well for the researcher?

• This is the main story…

Cloud reliability

• Already using the cloud for DR purposes since January 2009

• The Merit Network Operations Center monitors all of our stuff

• Ping, http GET every minute 24 x 7• Results?

Local v. cloud – CY 2009

Conclusion

• Cloud has been more reliable than local environment

• If local power was better, cloud would still be better, but only a little better

• Certainly seems to be good enough

Cloud security

• Absolute security?– Who cares?

• More secure than the typical WinTel desktop of a social science researcher?– That’s the goal

http://www.flickr.com/photos/amagill/235453953

Current practice

• Data archive maintains per-platform guidelines on IT security

• Researcher downloads a template and writes his/her own IT security plan

• Data provider reviews plan; approves or iterates until approved or rejected

Sample items

– I secured the computer on which the Add Health data resides in a locked room, or secured the computer to a table with a lock and cable (locking the case so the battery cannot be removed).

– I turned off all unneeded services and disabled unneeded network protocols.

Brutal facts

• Data providers are not IT experts• Researchers are not experts in IT security• Even if the system is secure on Day

One, what assurance is there that it continues to be secure?

http://www.flickr.com/photos/42dreams/1878611309

Our approach to security

• Leverage tools from the cloud provider (AWS access control lists)

• Leverage tools from UMich (regular Retina and Nessus scans)

• Engage a white hat hacker to probe and evaluate the system

Conclusion

• Expecting researchers to build and maintain secure IT environments is not reasonable

• We think we can build something at least as secure in the cloud

• We’ll evaluate our environment using outside evaluators

What to deploy?

• Model means we need to distribute a working analytic environment, not just the data

• Also gives the researcher the opportunity to limit access to only a subset of contractees

May I Take Your Order?

• Operating system?

• Analysis software?

• Who’s allowed to use the system?

• Anything else?

http://www.flickr.com/photos/stephenpougas/2267503544

The ACI Chooser

• Analytic Cloud Instance– Cumulus

• The ACI Chooser• Takes your order• Brings your ACI to your table (in the

cloud)

Conclusion

• We’re building this now• Issues to resolve

– How do we get passwords to people?– Remote access mechanism?

• Citrix? Terminal Services?

– Should we encrypt the data?

Performance

• Will a cloud-based analysis system meet the expectations of a researcher?

• Will one size fit all?

Amazon EC2

• Regular– S (1 CPU, 2GB, $0.12)– L (4 CPU, 7GB, $0.48)– XL (8 CPU, 15GB, $0.96)

• High memory– XXL (13 CPU, 34GB, $1.44)– XXXXL (26 CPU, 68GB, $2.88)

• High CPU– M (5 CPU, 2GB, $0.29)– XL (20 CPU, 7GB, $1.16)

Strategy

• Balance cost and performance• Start small, but give opportunity to

grow– Easy to move an image from one

instance size to another

• Measure performance via researcher’s experience

Conclusion

• Partners– Panel Study of Income Dynamics

(PSID)– Los Angeles Family and Neighborhood

Study (LA FANS)

• Start small; re-launch larger• Ask how well it works

Thanks and Final Thoughts

• Could preserve machine image + data + software + “program” for replication purposes

• enclavecloud.blogspot.com charts our adventures

• Cloud-related work sponsored by a recent NIH Challenge Grant

top related