predicting breast cancer proliferation scores with apache systemml - uc berkeley - 09.07.16 mwd mjm
TRANSCRIPT
Predicting Breast Cancer Proliferation Scores with Apache SystemML
Mike Dusenberry & Madison J MyersIBM Spark Technology Center, SF
Breast Cancer Tumor Proliferation Challenge
● Images of Tumors: Can be analyzed and given a score for medical assessment.
● Tumor Score: Difficult to determine and takes a trained eye.
● Currently assessed by Pathologists (M.D./D.O.).● Dataset contains 500 images of breast cancer tissue,
each at more than 15GB.
Context•Breast cancer is a leading cause of cancerous death in women.•Survival rates increase as early detection increases, incentivizing quicker detection.•Tumor cell proliferation is a strong indicator of a patient’s prognosis.•Currently, pathologists classify tumors based on proliferation by counting the dividing cell nuclei in hematoxylin & eosin stained slides by hand with a microscope.•Suffers due to underlying subjectivity.
1. Using really, really large images2. Limited number of images3. Current state of the art model for
this type of task is a deep CNN
Reference Paper: “Automated Grading of Gliomas using Deep Learning in Digital
Pathology Images”Daniel L. Rubin, MD, MS LabDepartment of Radiology &
Department of Medicine(Biomedical Informatics Research),
Stanford University
“Automated Grading of Gliomas using Deep Learning in Digital Pathology Images”
1. Cut a “whole-slide” image into square “tiles” at 20x magnification.2. Filter the “tiles” to remove any without tissue.3. Cut the remaining “tiles” into smaller “samples”.4. Assign a tumor score label to each sample based on the tumor score of the
“whole-slide” image.5. Repeat 1-4 for all “whole-slide” images.6. Train a convolutional neural network with the resulting dataset of labeled
“samples”.7. Good results!
Our Approach:
● Utilize Apache Spark to cut and filter all 500 labeled, extremely high-resolution tumor slide images into 4.7 million smaller square samples.
● Utilize Apache SystemML on top of Spark to train a convolutional neural network on the labeled samples.
What is Apache Spark?
● Apache Spark is a fast and general engine for large-scale data processing.
● Combines ML, SQL, streaming, and other complex analytics.● Extends Scala idioms, as well as R/Python DataFrame idioms to
cluster computing.● APIs for Scala, Java, Python, R.● Simple to use!● Much more information
at https://spark.apache.org/.
What is Apache SystemML?
● Apache SystemML is a machine learning system for running distributed linear algebra on top of Apache Spark.
● Exposes high-level R-like & Python-like languages focused on linear algebra.
● APIs for Python, Scala, Java.● Much more information
at http://systemml.apache.org/.
% of breast cancer tissue in image
After applying thresholding to tiles to close small gaps and adipose tissue:
If >= 90%, we keep the tile.
Preprocessing Approach at a High Level (cont.)
Example “Sample” Image (256x256x3)
Image TilesExample Filtered
“Tile” Image Tile Samples
Entire Pipeline Diagram
Example “Sample” Image
Image TilesExample Filtered
“Tile” Image Tile Samples“Whole-Slide” Image
ConvNet:
“Tumor Proliferation
Score”
% of breast cancer tissue in image
After applying thresholding to tiles to close small gaps and adipose tissue:
If >= 90%, we keep the tile.