anusua trivedi, data scientist at texas advanced computing center (tacc), ut austin at mlconf atl -...
TRANSCRIPT
Machine Learning (ML) and TACC Supercomputers
A little about me
• Data Scientist at Texas Advanced Computing Center (TACC)
• My Contact: [email protected]• TACC - Independent research center at UT Austin• TACC - One of the largest HIPAA compliant supercomputer center
• ~250 faculty, researchers, students and staff• We work on providing support to large scale computing problems
2
Some Basic Observations
There are fundamental differences in data access patterns between Data Intensive Computing and High Performance Computing (HPC)
Today, most of the ML Researchers want/need to work with Big Data, Vectorization, Code Optimization etc.
3
Data Intensive Computing
Specialized in dealing effectively with vast quantities of data in distributed environments
Generates high demand for computational resources, e.g. storing capacity, processing power etc.
4
Big data plays the key role in the popularity and growth of Data intensive computing
Increased the volume of data Improves accuracy of existing algorithms Helps create better predictive models
Increased the complexity
Data Intensive Computing & Big Data
5
What’s the challenge with the big data analysis?
6
Big Data Analysis requires even more computational resources
Storage is triple the standard data size
Algorithms use large data points and is memory intensive
The Big Data Analysis takes much longer time
Typical hard drive read-speed is about 150MB/sec But for reading 1TB ~ 2 hours
Analysis could require processing-time proportional to the size of the data Data Analysis at the rate of 1GB /second would require 11 days to
finish for 1TB data
7
High Performance Computing (HPC)
Hardware with more computational power per compute node
Computation can be done with multiple nodes
Provides highly efficient numeric processing in distributed environments
HPC has seen a recent growth in shared memory architectures
8
Sample TACC Computing Cluster
9
Combine HPC & Data intensive computing
The intersection of these two domains is mainly driven by the use of machine learning (ML)
ML methodologies help extract knowledge from big data
These hybrid environments – take advantage of data locality keep the data exchanges over the network at a manageable level offer high performance through distributed libraries
10
Stampede – Traditional cluster HPC system
Stockyard and Corral – 25 Petabytes of combined disk storage for all data needs
Ranch – 160 Petabytes of tape archive storage
Maverick/Rustler/Rodeo – “Niche” systems with GPU clusters, great for data anatytics and visualization
Wrangler - A New Generation of Data-intensive Supercomputer
TACC Ecosystem
11
TACC Ecosystem Goals
Goal to address the data problem in multiple dimensions Supports data in large and small scales Supports data reliability Supports data security Supports multiple data types: structured and unstructured Supports sequential access Fast for large files
Goal to support a wide range of applications and interfaces Hadoop (and Mahout) & Spark (and MLlib) Traditional R, GIS, DBs, and other HPC style performing
workflows
Goal to support the full data lifecycle Metadata and collection management support
12
Need to analyze large datasets quickly
Need a more on-demand interactive analysis environment
Need to work with databases at high transaction rates
Have a Hadoop or Spark workflow with need for large HDFS datastore
Have a dataset that many users will compute with or analyze
In need of a system with data management capabilities
Have a job that is currently IO bound
Why use TACC Supercomputers?
13
TACC Success Stories
14
15
16
Available ML tools/libraries in TACC Supercomputers
Scikit-learn
Caffe
Theano
CUDA/cuDNN
Hadoop
PyHadoop
RHadoop
Mahout
Spark
PySpark
SparkR
MLlib
17
Two Sample ML workflows in TACC Supercomputers
GPU Powered Deep Learning on MRI images with NVIDIA DIGITS in Maverick Supercomputer
Pubmed Recommender System in Wrangler Supercomputer
18
Deep Learning on Images
Deep Neural Networks are computationally quite demanding
The input data is much larger if we use even a small image resolution 256 x 256 RGB-pixel implies 196,608 input neurons (256 x 256 x 3)
Many of the involved floating point matrix operations can be addressed by GPUs
19
Deep Learning on MRI using TACC Supercomputers
Maverick has large GPU Clusters There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe
We use NVIDIA DIGITS (based on caffe), which is a web server providing a convenient web interface for training and testing Deep Neural Networks
For classification of MRI/images we use a convolutional DNN to figure out the features
We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our MRI/images
In the course of 30 epochs, our classification accuracy ranges from 74.21% to 82.09%
20
Pubmed Recommender System in Wrangler
21
What is a Recommendation System?
Recommender System helps match users with item
Implicit or explicit user feedback or item suggestion
Our Recommendation system: We try to build a model which recommends Pubmed
documents to users, based on the user search profile
22
Types of Recommender System
Types Pros ConsKnowledge‐based(i.e, search)
Deterministic recommendations,
assured quality,
no cold‐ start
Knowledge engineering effort to bootstrap,
basically static
Content‐based No community required,
comparison between items possible
Content descriptions necessary,
cold start for new users
Collaborative No knowledge‐ engineering effort,
serendipity of results
Requires some form of rating feedback,
cold start for new users and new items
23
Using Vector Space Model (VSM) for Pubmed
Given: A set of Pubmed documents N features (unique terms) describing the documents in the set
VSM builds an N-dimensional Vector Space
Each item/document is represented as a point in the Vector Space
Information Retrieval based on search Query: A point in the Vector Space We apply TFIDF to the tokenized documents to weight the documents
and convert the documents to vectors We compute cosine similarity between the tokenized documents and
the query term We select top 3 documents matching our query
We weight the query term in the sparse matrix and rank documents2424
MPI or Hadoop or Spark?
Which is really more suitable for this ML problem in a HPC system ?
25
Message Passing in HPCMessage Passing Interface (MPI) was one of the key factors which supported the initial growth of cluster computing
MPI helped shape what the HPC world has become today
MPI supported a substantial majority of all supercomputing work
Scientists and engineers have relied upon MPI for the past decades
MPI works great for data intensive computing in a GPU cluster
26
Why MPI is not the best tool for ML
A researcher/developer working with MPI needs to manually decompose the common data structures across processors
Every update of the data structure needs to be recast into a flurry of messages, syncs, and data exchange
Programming at the transport layer is an awkward fit for numerical application developers
This led to the advent of other techniques
27
Hadoop is an open source implementation of MapReduce programming model in JAVA
It has interface to other programming languages such as R, python etc.
Hadoop includes - HDFS: A distributed file system based on google file
system (GFS)
YARN: A resource manager to assign resources to the computational tasks
MapReduce: A library to enable e cffi ient distributed data processing easily
Mahout: Scalable machine learning and data mining library
Hadoop streaming: It is a generic API which allows writing Mappers and Reducers in any language.
Hadoop is a good fit for large single-pass data processing, but has its own limitations
Choosing Hadoop over MPI
28
Limitations of Hadoop in HPCHadoop comes with mandatory Map Reduce logging of output to the disk after every Map/Reduce stage
In HPC, logging output to disk could be sped up with caching or SSDs
In general, this fact rendered Hadoop unusable for many ML approaches which required iteration, or interactive use
The real issue with Hadoop was its HDFS file system. The HDFS file system was intimately tied to Hadoop cluster scheduling
The large-scale ML community sought in-memory approaches to avoid this problem
29
Spark
For large-scale technical computing, one very promising in-memory approach is Spark
Spark lacks Map/Reduce-style requirements
Spark can run standalone, without a scheduler like YARN
It has interfaces to other programming languages such as R, python etc.
Spark supports HDFS through YARN
MLlib: Scalable machine learning and data mining library
Spark streaming: Enables stream processing of live data streams
30
Our Recommendation Model
We apply collaborative filtering on the weighted/ranked documents
We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for recommending Pubmed documents MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)
We use collaborative filtering in Scikit-learn & Hadoop as baselines
We use the python-recsys library along with Python Scikit-learn svd.recommend(int product_id)
We use the mahout’s Alternating Least Square for Hadoop
Comparative study of our model shows improved performance in Spark
3131
Performance Evaluation of Pubmed Recommendation Model
We evaluate our recommendation model using Python Scikit-learn, Apache Mahout and PySpark MLlib in Wrangler
Recommendation model use Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) for evaluation
Lower the errors, more accurate the modelLower the time taken to train/test the model, better the performance
Algo: Type Public Dataset Python ML library Eval Test Model Training Time Model Test Time
RecommendationWeighted Pubmed
Documents Python ScikitRMSE=17.96%MAE=16.53% 42 secs 19 secs
RecommendationWeighted Pubmed
Documents Hadoop MahoutRMSE=16.02%MAE=14.98% 38 secs 14 secs
RecommendationWeighted Pubmed
Documents PySpark MLlibRMSE=15.88%MAE=14.23% 34 secs 11 secs
32
THANK YOU !
Questions?
33