predicting breast cancer proliferation scores with apache systemml - uc berkeley - 09.07.16 mwd mjm

48
Predicting Breast Cancer Proliferation Scores with Apache SystemML Mike Dusenberry & Madison J Myers IBM Spark Technology Center, SF

Upload: mike-dusenberry

Post on 15-Apr-2017

69 views

Category:

Documents


2 download

TRANSCRIPT

Predicting Breast Cancer Proliferation Scores with Apache SystemML

Mike Dusenberry & Madison J MyersIBM Spark Technology Center, SF

Let us introduce ourselves.

We like health and we like data.

Found a Challenge

Breast Cancer Tumor Proliferation Challenge

● Images of Tumors: Can be analyzed and given a score for medical assessment.

● Tumor Score: Difficult to determine and takes a trained eye.

● Currently assessed by Pathologists (M.D./D.O.).● Dataset contains 500 images of breast cancer tissue,

each at more than 15GB.

Context•Breast cancer is a leading cause of cancerous death in women.•Survival rates increase as early detection increases, incentivizing quicker detection.•Tumor cell proliferation is a strong indicator of a patient’s prognosis.•Currently, pathologists classify tumors based on proliferation by counting the dividing cell nuclei in hematoxylin & eosin stained slides by hand with a microscope.•Suffers due to underlying subjectivity.

Example Image:

Example Zoom-In of Image

Looking for Nuclei Characteristics

Grading System in Invasive Breast Cancer

Goal:Predict tumor scores from slide images.

Okay easy, where do we start?

Blockers:

1) Using really, really large images

1. Using really, really large images2. Limited number of images

1. Using really, really large images2. Limited number of images3. Current state of the art model for

this type of task is a deep CNN

Seriously, where do we start?

Reference Paper: “Automated Grading of Gliomas using Deep Learning in Digital

Pathology Images”Daniel L. Rubin, MD, MS LabDepartment of Radiology &

Department of Medicine(Biomedical Informatics Research),

Stanford University

“Automated Grading of Gliomas using Deep Learning in Digital Pathology Images”

1. Cut a “whole-slide” image into square “tiles” at 20x magnification.2. Filter the “tiles” to remove any without tissue.3. Cut the remaining “tiles” into smaller “samples”.4. Assign a tumor score label to each sample based on the tumor score of the

“whole-slide” image.5. Repeat 1-4 for all “whole-slide” images.6. Train a convolutional neural network with the resulting dataset of labeled

“samples”.7. Good results!

20 slides vs 500 slides

… we have lots of data

Our Approach:

● Utilize Apache Spark to cut and filter all 500 labeled, extremely high-resolution tumor slide images into 4.7 million smaller square samples.

● Utilize Apache SystemML on top of Spark to train a convolutional neural network on the labeled samples.

After preprocessing, over 7 terabytes of data...

What is Apache Spark?

● Apache Spark is a fast and general engine for large-scale data processing.

● Combines ML, SQL, streaming, and other complex analytics.● Extends Scala idioms, as well as R/Python DataFrame idioms to

cluster computing.● APIs for Scala, Java, Python, R.● Simple to use!● Much more information

at https://spark.apache.org/.

What is Apache SystemML?

● Apache SystemML is a machine learning system for running distributed linear algebra on top of Apache Spark.

● Exposes high-level R-like & Python-like languages focused on linear algebra.

● APIs for Python, Scala, Java.● Much more information

at http://systemml.apache.org/.

Preprocessing

Preprocessing Approach at a High Level“Whole-Slide” Image Image Tiles

1024x1024x3 pixel tiles

Image Tiles

Preprocessing Approach at a High Level (cont.)Example“Tile” Image

(1024x1024x3)

Now that we have tiles, we need to filter out non-tissue. We did this with thresholding.

% of breast cancer tissue in image

After applying thresholding to tiles to close small gaps and adipose tissue:

If >= 90%, we keep the tile.

Preprocessing Approach at a High Level (cont.)

Example “Sample” Image (256x256x3)

Image TilesExample Filtered

“Tile” Image Tile Samples

Preprocessing Code Example:

Now for the good stuff……Machine Learning

What are CNNs, or Convolutional Neural Networks?

● Deep Learning model

● State of the art for computer vision tasks

● and audio….● and….

Example Convolutional Neural Network

Large data, remember?

Apache SystemML & Apache Spark

Breast Cancer ConvNet w/ SystemML

Training ConvNet w/ PySpark API

Entire Pipeline Diagram

Example “Sample” Image

Image TilesExample Filtered

“Tile” Image Tile Samples“Whole-Slide” Image

ConvNet:

“Tumor Proliferation

Score”

Thank You

Backup

% of breast cancer tissue in image

After applying thresholding to tiles to close small gaps and adipose tissue:

If >= 90%, we keep the tile.

Where are we now?

● Preprocessing: Complete● Machine Learning:

○ Small-scale tests complete○ Large-scale tests in progress.