a mayo clinic big data implementation

21
Mayo Clinic Big Data Projects Experience Brian Brownlow Big Data Professional

Upload: bdpa-education-and-technology-foundation

Post on 25-Jun-2015

629 views

Category:

Education


0 download

DESCRIPTION

Brian Brownlow is an experienced senior analyst programmer for Mayo Clinic. He is made a workshop presentation at the 2014 BDPA Technology Conference on the topic, 'Big Data Implementation - Mayo Clinic Case Study'. This presentation will show part of the Mayo Clinic story on the embarking of an exploration of the application of `Big Data' technologies. `Big Data' is seen as one set of tools that can be used to enhance medical research, medical education and practice management. Mayo Clinic is always searching for better, faster and cheaper ways to use its data to improve patient care and sustain financial outcomes in a challenging reimbursement environment. Our approach uses several components that are open source and combines them with data from various sources to provide information to decision makers in near real time. We have created a center of `Big Data' excellence using in-house staff and vendor engagements. `Big Data' is one element of our Enterprise Data Trust framework.

TRANSCRIPT

Page 1: A Mayo Clinic Big Data Implementation

Mayo Clinic Big Data Projects

Experience

Brian Brownlow

Big Data Professional

Page 2: A Mayo Clinic Big Data Implementation

04/13/2023 2

What is Big Data?• A silver bullet that will solve all the worlds problems? NO

• An arrow in the IT quiver to help solve customer problems? YES

• Does anyone have large data problems? All sales transactions, log reviews, device output, text processing?

• How does you relational DB handle index creation or backup for 500,000,000,000 row tables?

• Popular things that are similar• Seti, many networked computers doing small pieces of work• Watson, many networked computers working together to

solve a problem• What’s one computer that beat a chess master? Kasparov –

Deep Blue (1996–1997), there are others…

• Big data has been around a long time

• Why now? Bigger, cheaper, faster processing, memory, networking and disk

Page 3: A Mayo Clinic Big Data Implementation

04/13/2023 3

Mayo Big Data Elements

• Patient Information• Appointments• Labs• Images• Genome• Appointment Check-in/Check-out• Report text• Vitals• Device reporting, e.g. Holter Monitor• Many more, it keeps growing…

Page 4: A Mayo Clinic Big Data Implementation

04/13/2023 4

Mayo Big Data Elements Potentially Affecting Patient Care

• ALL OF THEM!

• The more we know about a patient the better we can build tools and models to help the care team improve patient care and help the business manage to reimbursement.

Page 5: A Mayo Clinic Big Data Implementation

04/13/2023 5

Mayo Big Data Initial Evaluations

• Hortonworks HDP on a Virtual Machine on my laptop• HDP 1.3.2, 2.1 on Oracle VM• HDP 1.3.2, 2.1 on VMWare

• What can HDP do?• Pig, Hive, Hbase, HDFS, Ambari,

Hue, MapReduce, FLUME, Storm, ElasticSearch, Sqoop…

Page 6: A Mayo Clinic Big Data Implementation

04/13/2023 6

Mayo Big Data Presentations to Leadership

• What is “BIG DATA”? What is Hadoop?

• What are “BIG DATA” capabilities?

• Here is one way you can answer your customer queries about big data!

• Many people want to have a “BIG DATA” story

• Proved out at Mayo by some initial proof of concept projects

• Genomics on Cloudera (early work)• HDP on Oracle VM (my project)• Multi node DEV environment on HDP 1.3.2 running Centos on

XenCenter and an outside edge node

• Helped by media hype.

Page 7: A Mayo Clinic Big Data Implementation

04/13/2023 7

The Virtual Machine!

• Show it.

Page 8: A Mayo Clinic Big Data Implementation

04/13/2023 8

Mayo Big Data DEV

Page 9: A Mayo Clinic Big Data Implementation

04/13/2023 9

Big Data DEV Setup

• Lots of help on the web, Hortonworks website, other websites

• Using the latest version of CentOS: 6.5 (x64)

• Exported VM to CentOS6.5_Hadoop1.32_SSD3.ova

• Installed as a VM from Oracle Virtual Box on Citrix XenCenter

• Installed or Updated latest packages for yum, rpm, wget, curl, scp, pdsh, …

• Downloaded and generated local HDP repository /etc/ yum.repo.d (Note: 3 versions HDP hadoop stacks – 1.3.2, 1.3.3, 2.0.6)

• Configured network (hosts, security, firewall…)

• Installed Ambari (v1.4.4.23) and embedded postgresql DB (v8.4.18)

• Installed Hadoop components from Ambari

Page 10: A Mayo Clinic Big Data Implementation

04/13/2023 10

Big Data DEV Environment

• Was it Perfect? NO• Less stable than preferred due to enabled

updates• Lightly used• Checked daily• By the time of heavier we had our INT and

PROD environments so we didn’t need DEV

• Was It Good Enough? YES

Page 11: A Mayo Clinic Big Data Implementation

04/13/2023 11

Mayo Big Data Platform RFP

• Sent out RFP, got demos based on a use case we submitted with the RFP

• IBM Big Insights• Cloudera Hadoop Distribution• TeraData/Hortonworks Hadoop Distribution

• Selected TeraData/Hortonworks on a TeraData hardware frame

• TDH (Teradata Hadoop is not a exact copy of HDP (Hortonworks Data Platform)

• TeraData brings appliance brings some good things to the table, Viewpoint, HCLI, …

Page 12: A Mayo Clinic Big Data Implementation

04/13/2023 12

Big Data INT and PROD

• TDH INT in one cabinet, TDH PROD in the other, asked Teradata for a VM version

• Additional expansion space available in existing INT and PROD racks, want a big data project? Fund a new edge or data node!

• TeraData add-ons, RAID, Infiband, Viewpoint, HCLI

• TDH 1.3.2 not HDP 1.3.2, same source base but minor differences to support the TeraData infrastructure

• Ideal: DEV=INT=PROD, hardware and software

Page 13: A Mayo Clinic Big Data Implementation

04/13/2023 13

Master Prod 2

Master Prod 1

Edge Prod 2

Data Prod 6

Data Prod 5

Data Prod 4

Data Prod 3

Data Prod 2

Edge Prod 1

Data Prod 1

Primary SM Enet Switch

System VMS

Network-0 InfiniBand Switch

KVM

Cabling Slot

Network-1 InfiniBand Switch

Space for

Additional Nodes

Secondary SM Enet Switch

Master Test 2

Master Test 1

Edge Test 2

Data Test 6

Data Test 5

Data Test 4

Data Test 3

Data Test 2

Edge Test 1

Data Test 1

Primary SM Enet Switch

Cabinet VMS

Space for

Additional Nodes

Secondary SM Enet Switch

Viewpoint TMS

• 20 Hadoop nodes total – 10 per cabinet

• 2 Hadoop clusters, one per cabinet:

• Prod: 2 Master, 2 Edge, 1 Viewpoint TMS, 6 Data nodes (can add up to 7 more Edge and/or Data nodes in-cabinet, plus add additional cabinets to the cluster)

• Integration Test: 2 Master, 2 Edge, 6 Data nodes (can add up to 8 more Edge and/or Data nodes in-cabinet, plus add additional cabinets to the cluster)

• Raw user data capacity per cluster: 57+ TB

• Includes HDFS 3x replication & work space

• Does NOT include any compression!

• Example: at 2x compression, user data space per cluster is 114+ TB

• Power: 3 phase; 2 x 60 amps per cabinet; bottom egress

• HDP 1.3.2; Storm, Elasticsearch, and WebSphere MQ to be installed on appliance by project team

• Teradata Managed Server (TMS) for Viewpoint

TDH INT and PROD

Page 14: A Mayo Clinic Big Data Implementation

04/13/2023 14

Big Data Project Setup

• Agile development – 2 week sprints, daily scrums

• Extreme Programming

• Java Development Environment tool tree• SVN (Subversion)• Jenkins• Maven• Eclipse – Kepler

• Open Source Components• Storm • Flume• Elastic Search (Marvel)• NLP - cTAKES

• Acquired training for all components as needed, e.g. Storm, Flume, Elastic Search, SVN, Drools

• Used in DEV, INT and PROD environments

• Consulting engagements

Page 15: A Mayo Clinic Big Data Implementation

04/13/2023 15

DEV Team

• The team• Executive support• Project manager • Senior Technical staff member• 4 very experienced Programmers• Very motivated, flexible, hearts of teachers and

learners

• Agile and Extreme programming relatively new to Mayo IT

• Parts of the tool tree were also relatively new to Mayo IT

Page 16: A Mayo Clinic Big Data Implementation

04/13/2023 16

Part 1

• Verify the development tool tree

• Verify the development process

• Verify the open source components

• Define first use cases

• Start and manage the project backlog list

Page 17: A Mayo Clinic Big Data Implementation

04/13/2023 17

Part 1 Projects

• Natural Language Processing• Lets get more value from unstructured text!

• Standard big data use cases• Exploration

• Log exploration• Search• …

• Data lake• Cohort identification• …

Page 18: A Mayo Clinic Big Data Implementation

04/13/2023 18

Part 1 Pig, Hive

PIGA = LOAD 'default.bnb_test_from_file' USING org.apache.hcatalog.pig.HCatLoader();

DUMP A;

Hive'SELECT * FROM default.bnb_test_from_file limit 2'

Page 19: A Mayo Clinic Big Data Implementation

04/13/2023 19

Part 1

• In production!

• Well received

• Met expectations for the development process and schedule

• Lots of people lined up now to use the environment!

Page 20: A Mayo Clinic Big Data Implementation

04/13/2023 20

Part 2

• More NLP work• Get more source data from more sources• Explore via Drools, ElasticSearch, MapReduce

• Many more lined up• Security – log examination• Clinical Trials cohort discovery• Genomics/Phenomics• Molecular biology• Protein studies• …

Page 21: A Mayo Clinic Big Data Implementation

04/13/2023 21

Conclusion

• Big Data via Hadoop is a relivent choice in certain problem spaces

• Open source can provide valuable tools for our customers

• Questions?