big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • according to...

25
d h ll d h lh Big data challenges and opportunities in healthcare: application to detecting faint signals Dr. Greg Slabaugh City University London City University London School of Informatics

Upload: others

Post on 07-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

d h ll d h l hBig data challenges and opportunities in healthcare: application to detecting faint signals

Dr. Greg SlabaughCity University LondonCity University LondonSchool of Informatics

Page 2: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Data data and more dataData, data, and more data

di 90% f h d i h ld d d i h• According to IBM, 90% of the data in the world today was created in the past two years. [1]

• According to International Data Corporation, the total amount of global data is [2]expected to grow to 2.7 zettabytes during 2012. [2]

• The data is growing exponentially (43% growth rate) and is estimated to be 7.9 zettabytes by 2015. [3]

Term SI prefix

kilobyte (KB) 103

megabyte (MB) 106

gigabyte (GB) 109

terabyte (TB) 1012

terabyte (TB) 10

petabyte (PB) 1015

exabyte (EB) 1018

zettabyte (ZB) 1021

yottabyte (YB) 1024

Page 3: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

US Library of CongressUS Library of Congress

• The world’s largest library◦ 151.8 million items on 838 miles of bookshelves

◦ 34.5 million books and other print materials34.5 million books and other print materials 

◦ 13.4 million photographs

◦ 5.4 million maps

6 5 illi i f h i◦ 6.5 million pieces of sheet music 

◦ 66.6 million manuscripts

By one estimate[4], the entire print collection is roughly 10By one estimate , the entire print collection is roughly 10 petabytes of data

Page 4: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

7 9 zettabytes (2015) is7.9 zettabytes (2015) is

>700 000 Libraries of>700,000 Libraries of CCongress

Page 5: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

SourcesSources

Images from [4]

Page 6: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Not limited to internet mediaNot limited to internet media Large datasets are impacting nearly all areas of business Large datasets are impacting nearly all areas of business

• Transactional data◦ Walmart: >1M transactions / hour◦ Walmart: >1M transactions / hour

◦ Visa: >12.5M transactions / hour[5]

• Networked sensors◦ Surveillance

◦ Automobiles

• Product development and manufacturing• Product development and manufacturing◦ Integration of data from R&D, engineering, manufacturing

◦ RFID

• Healthcare◦ Patient records

Di ti t t◦ Diagnostic tests

◦ Imaging

Page 7: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Why does big data matter?Why does big data matter?

b l• Big data is not just about storing large datasets

• Rather, it is about leveraging datasets 

◦ Mining datasets to find new meaning◦ Mining datasets to find new meaning

◦ Combining datasets that have never been combined before

◦ Making more informed decisionsMaking more informed decisions

◦ Offering new products and services

Data is a vital asset, and analytics are the key to unlocking its potential

“We don’t have better algorithms than anyone else, we just have more data.” [7]

P t N i Di t f R h G l k i 2010‐ Peter Norvig, Director of Research, Google, spoken in 2010

Page 8: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Recognising the valueRecognising the value

l• Oracle

◦ Integration of R (statistical programming language) into database software

• IBMIBM

◦ InfoSphere BigInsights

• Google

◦ BigQuery (preview – invitation only):  web service that enables interactive analysis on massive datasets – billions of rows

• Opera solutions• Opera solutions

◦ Big data analytics software based on machine learning

• Explorysp y

◦ Explore and compare populations of patients based on medical data records

A h H d• Apache:  Hadoop

Page 9: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Big data in healthcareBig data in healthcare

Th t• The past◦ Data is by product of providing healthcare services◦ Data sets filed and never seen again; some datasets discarded; hardcopies

l h l h ( ) l h h k l f f◦ Electronic health records (EHRs): primary value is that they make life easier for doctors and bring down storage costs

• The futurel◦ Data is a central asset

◦ New analytics to mine data and help extract meaning◦ Datasets integrated and cross‐referenced – personalised medicine

l d d l d◦ Digital records aggregated across patients – population studies

• Potential applications[8]

◦ Spotting unwanted drug interactions◦ Identifying the most effective treatments◦ Predicting onset of disease before symptoms emerge◦ Analysing of disease patterns

Page 10: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Challenges in healthcareChallenges in healthcare

Access to data

IT infrastructure

Analytics

Legal issuesData privacy Data integrity

Slide adapted from [9]

Page 11: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Opportunities in healthcareOpportunities in healthcare

$165B Cli i lClinical

$9B  $108B  $5B Public health R&D Business model

$47B Accounts

Slide adapted from [9]

Page 12: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

A case study: colorectal diseaseA case study: colorectal disease

◦ Colorectal cancer secondmost prevalent cancer in Western countries[10]

◦ 940,000 cases occur annually

◦ 655,000 deaths annually

◦ if detected early, 90% of patients live at least ten years

Pre‐cancerous polypPre‐cancerous polyp

Page 13: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Colorectal cancerColorectal cancer

h h l• Pathophysiology

◦ Adenomatous polyps

• Benign tumours of a glandular organ (colon is a mucosal organ)Benign tumours of a glandular organ (colon is a mucosal organ)

• Common, particularly in patients 50+

• Greater than 10 mm: higher likelihood of developing into cancer

◦ Cancer

• Can invade below colon surface and spread to other organs

Progression is slow (5+ years to polyp, 5+ more to cancer)

Page 14: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Screening methodologiesScreening methodologies• Faecal occult blood test (FOBT)Faecal occult blood test (FOBT)

• Optical colonoscopy (OC)

+  Effectiveness, cost, can remove polyps during procedure

I i d i i k f f i l i h i l li i i‐ Invasive, sedation, risk of perforation, occlusions, physical limitations, patient compliance

Page 15: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Screening methodologiesScreening methodologies• CT Colonography (CTC), or “Virtual Colonoscopy”

◦ Examination of the colon using CT imaging◦ Examination of the colon using CT imaging

◦ Patient given laxatives to clear the colon

◦ Patient consumes a faecal tagging solution designed to coat any residual stools or liquid

◦ Thin tube inserted into rectum to inflate colon with gas (C02)

◦ Images taken with patient in prone and supine positionsg p p p p

◦ Images are analysed by radiologist for colorectal lesions

Page 16: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Data overloadData overload• Optical colonoscopyOptical colonoscopy

◦ Approx. 20 minutes of HD video per patient (often not recorded)

◦ 30,000,000 procedures performed annually worldwide

◦ Roughly 15 petabytes of data per year (and growing exponentially)

• CT colonography

◦ Each CT series (prone, supine) has roughly 500 images of size 512x512p p g y g

◦ 1,000,000 procedures performed annually worldwide

◦ Roughly 0.5 petabytes of data per year (and growing exponentially)

All this data must be reviewed by a physician (gastroenterologist or radiologist)

Page 17: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

CT Colonography imagesCT Colonography images• This patient has a polyp in their colon. Did you see it?This patient has a polyp in their colon.  Did you see it?

Polyps can be very subtle and difficult to detect, even for expert radiologists

Page 18: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Computer aided detection (CAD)Computer‐aided detection (CAD)• CAD consists of image processing and pattern recognition algorithmsCAD consists of image processing and pattern recognition algorithms 

designed to detect polyps that may be of interest to a physician

• CAD draws the radiologist’s attention to regions that may have otherwise b l k dbeen overlooked

• “Spell‐checker” for medical images

• Characterised byCharacterised by 

◦ Sensitivity (percentage of polyps found)

◦ Number of false positives

• CAD is designed to be complementary – it is not a replacement for physician

Page 19: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

How does CAD work?Image pre‐processing

How does CAD work?

Organ segmentation

Image pre processing

Candidate generation

g g

Feature calculation

Classifier

Results

A Robust and Fast System for CTC Computer‐Aided Detection of Colorectal Lesions, Greg Slabaugh, Xiaoyun Yang, Xujiong Ye, Richard Boyes, Gareth Beddoe, Algorithms, 3(1):21‐43, special journal issue on Machine Learning for Medical Imaging, 2010.

Page 20: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

PerformancePerformance

• L t l CAD li i l t d (3000+ ti t )• Largest ever colon CAD clinical study (3000+ patients) ◦ Dr. Perry Pickhardt (U. Wisconsin)◦ Published in Radiology 2010[12]; picked up in radiology press ◦ 4.7 false positives per series

CAD identified 15 polyps that were missed by expert radiologists

Page 21: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Advanced analyticsAdvanced analyticsManifold learning[13]Manifold learning

Page 22: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Advanced analyticsAdvanced analytics

Population[ ]regression[14]

Page 23: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Multi‐modality, multi‐scale and hetereogeneous

Organism Organ Tissue CellsOrganism Organ Tissue Cells

Proteomics Geonomics Atomic RecordsProteomics Geonomics Atomic Records

Page 24: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

Other “dimensions” to considerOther  dimensions  to consider

data

time

Page 25: Big data chllhallenges and opportunities in hlhh ealthcare ...€¦ · • According to International Data Corporation, the total amount of global data is expected togrow 2.7 zettabytes

ReferencesReferences[1]  IBM quote, microscope.co.uk[ ] q , p

[2]  International Data Corporation 2012 prediction, IDC website 

[3]  CenturyLink 2015 prediction, ReadWriteWeb website

[4]  Estimated data in US Library of Congress, Wikipedia

[5]  Visa transactions, FastCompany website

[6]  McKinsey Global Institute, “Big data: The next frontier for innovation, competition, and productivity,” 2011

[7]  Peter Norvig quote, CNET website

[8] Economist report on Big Data[8]  Economist report on Big Data

[9]  Wipro infographic

[10]  World health organization, 2008 leading causes of death

[11] A Robust and Fast System for CTC Computer‐Aided Detection of Colorectal Lesions, Slabaugh et al., Algorithms, y p g g3(1):21‐43, 2010.

[12] Colorectal polyps: stand‐alone performance of computer‐aided detection in a large asymptomatic screening population, Lawrence, E.M., Pickhardt, P.J., Kim, D.H., &  Robbins, J.B. (2010). In Radiology, 256, 791‐798.

[13] http://scikit learn github com/scikit learn org/dev/auto examples/manifold/plot lle digits html[13]  http://scikit‐learn.github.com/scikit‐learn.org/dev/auto_examples/manifold/plot_lle_digits.html

[14] Population Shape Regression From Random Design Data, Davis et al., ICCV 2007