Sunnie ChungCleveland State University
• Data Scientist
• Big Data Processing
• Data Mining
2Sunnie Chung Cleveland State University
• INTERSECT of Computer Scientists and Statisticianswith Knowledge of Data Mining AND Big data Processing Skills:
• to Handle Big Data
• to Collect, Process and Extract value from Big Data (giant and diverse data sets)
• to Understand, Visualize and Present their findings to non-data scientists
•Ability to Create Data-driven Solutions that boost profits, reduce costs and even help save the world
3Sunnie Chung Cleveland State University
And tackle big data projects on every level
• Big Data and Cloud Projects are in Every CEO’s To Do List
• The Defense Department
• NASA : Predict Earthquake (specially after Nepal’s Earthquake)
• NSA, Homeland Security : Predict and Prevent Terrorists’ Acts
• Internet start-ups
• Financial institutions
4Sunnie Chung Cleveland State University
• Volume : Unprecedentedly Huge Volume of Data fueled by web based business, social networking, micro blogs (e.g., click streams captured in web server logs)e.g.) Ebay processes 8 Peta Bytes data per night
• Various Structures of Data (No Structure) :Structured (Database, Data Warehouse)
Semi-structured (Web pages) and
Unstructured (Web Server Log, Sensor Data) – most of time !!
• Velocity : Unprecedentedly generate new data at a high rate
e.g.) Streaming Twitter MessagesMachine-generated data streaming in from smart devices, sensors, monitors and meters needs big data analytics
5Sunnie Chung Cleveland State University
• Numerous new analytic and business intelligence opportunities like:
• Fraud detection
• Customer profiling
• Customer loyalty analysis
• All of which directly affect revenue of business and critical business decisions.
6Sunnie Chung Cleveland State University
• Identifying Field Specific Motive/Purposes
• Identify Nature of Big Data Source and Data Specific Processes
• Decisions on Building IT Infrastructure of Big Data Processing Systems
• Public Cloud/Private Cloud
• Which MPP Big Data Systems should be built for our specific Big Data Source and Volume
• Execution of Data Analytics• Data Source Modeling
• Apply Data Mining Strategies
• Research solutions• Implement Big Data Processing Steps for Solutions/Strategies
• Analyze Results/Interpretation -- Feedback
7Sunnie Chung Cleveland State University
Massively Parallel Processing (MPP) Systems
• Parallel Data Warehouse (PDW) System
Oracle, IBM, Teradata, Microsoft
• Hadoop System with Map Reduce
Hive, Hbase, MongoDB, Cassandra, and more
by Google, Yahoo, Facebook, Twitter, LinkedIn
• Hybrid of Both
• MPP System on CloudAmazon, Google, Microsoft, Oracle
8Sunnie Chung Cleveland State University
• Massively Parallel Processing (MPP) Systems
• Virtual Machine (VM)
• Cloud TypeCloud as Service
Cloud as Platform
Cloud as Service
• Amazon Elastic Cloud Computing (EC2)
• Google Cloud
• Microsoft Cloud: Azure
9Sunnie Chung Cleveland State University
Anomaly detectionThe identification of unusual data records, that might be interesting or data errors that require further investigation.
Association rule learning (Dependency modelling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
ClusteringThe task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
ClassificationThe task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".Regression – attempts to find a function which models the data with the least error.
Summarization Providing a more compact representation of the data set, including visualization and report generation.
Results validation10Sunnie Chung Cleveland State University
• Statistics
Naive Bayes, Clustering• Machine Learning
• Classification Algorithms: Decision Tree, Neural Network, Support Vector Machine
New Algorithm: Convolutional Neural Network - still evolving in fast rate
• Database
• Association Rule Mining, Data Warehouse OLAP • Big Data Processing ���� Most Recent - still evolving in fast rate
• Information Retrieval• Google Search Engine -> Artificial Intelligence - still evolving
in fast rate
11Sunnie Chung Cleveland State University
• Databases
Advanced Modern Databases and Data Processing Strategies
• Big Data Processing with:
• Parallel Data Warehouse and OLAP (Online Analytic Processing)
• Map Reduce
• Hadoop Based MPP Systems
• Statistics
• Data Mining
- Database: Association Rule Mining, Data Warehouse OLAP
- Statistics: Bayesian, Clustering
- Machine Learning: SVM, Neural Network: CNN, RNN, LSTM
And More on recent developments
12Sunnie Chung Cleveland State University
Massively Paralle Processing (MPP) Systems
• Parallel Data Warehouse Based Systems : • Oracle, Tera Data, Microsoft PDW, IBM
• In Memeory NEW SQL Systems
• Hadoop/MapReduce Based Systems: No SQL systems• Mongo DB
• Pig Latin
• Hbase
• Hive
• Stream Processing: Spark
• Cloud: Big Data Processing Systems on Cloud• Google Cloud, Amazon Cloud, Microsoft Azure, Oracle, IBM
13Sunnie Chung Cleveland State University
14
http://blogs.the451group.com/opensource/2011/04/15/nosql-newsql-and-beyond-the-answer-to-sprained-relational-databases/
Sunnie Chung Cleveland State University
Popular Free Open Source
• R/ Map R: A programming language and software environment for statistical computing, data mining, and graphics. GNU Project.
• Sparks: Streaming Data Processing• Google Tensorflow: Python, C++ based Image Processing Library, Natural Language Processing Libraray
• Weka: A suite of machine learning software applications written in the Javaprogramming language
• UIMA:(Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video – originally developed by IBM
Major Commercial:
• SAS Enterprise Miner
• Microsoft Business Intelligence Data Analytic Tool using Databases
15Sunnie Chung Cleveland State University
On Databases
CIS 530 : Intro to Database Systems and Processing
CIS 611 : Enterprise Database Systems and Data Warehouse
- Advanced Data Processing Techniques
- Parallel Data Warehouse and OLAP
- Big Data Processing and Management Systems
CIS 612 : Big Data and Parallel Data Processing Systems
- Hadoop and MapReduce
- NoSQL Systems on VM(Virtual Machine), Cloud
- Stream Data Processing: Spark
CIS 695: Practicum in Data Analytics and Big Data Processing
(Scheduled to be created)16Sunnie Chung Cleveland State University
• Data Analytics
CIS 660: Data Mining Techniques from Database, Statistics
and Machin Learning, Text/Web Mining Techniques
EEC 525 Data Mining
17Sunnie Chung Cleveland State University
• Math and Statistics
Graduate Certificate in Applied Predictive Modeling
MTH 521 : Time Series Analysis
MTH 531 : Categorical Data Analysis
MTH 537 : Operation Research
MTH 567 : Applied Linear Models I
MTH 638 : Operation Research II
MTH 668 : Applied Linear Models II
MTH 675 : Applied Multivariate Statistics
18Sunnie Chung Cleveland State University
• Business Analytic Certificates
Focus on SAS Certificate with SAS Enterprise Miner Tool
BUS 575 : Introduction to Business Analytics
BUS 600 : Applied Business Analytics
BUS 601 : Managing Databases for Business Analytics
BUS 602 : Strategy for Business Analytics
BUS 603 : SAS for Data and Statistical Analysis
BUS 604: Advanced Business Analytics I
BUS 606: Practicum in Business Analytics
19Sunnie Chung Cleveland State University
• Explorys by IBM• website: https://www.explorys.com/
• Data Analytic/ Big Data Processing on Health and Wellness Data
• Data Analytic for Cleveland Clinic (Tera Data PDW), Metro Health
• Progressive• Big Data Processing on Auto Insurance : Hadoop Based MPP Systems
• PNC (Tera Data MPP PDW)• Big Data Processing Systems on Financial Data
20Sunnie Chung Cleveland State University
• Hadoop Big Data Processing Workshop/Meetup
EECS Dept of CSU Planning to host the meeting annually to
connect our students to the local Big Data Companies
• Data Scientist Group
Regular webinar on Advanced Data Analytic Topics
21Sunnie Chung Cleveland State University
Current Research/Publications at CSU (by Sunnie Chung)
• Research on Big Data Analytics on Real Time Sentiment Analysis
• Research on Natural Language Processing with Machine Learning
• Research on Cyber Security with Data Analytics Methods
• Research on Question Answering Systems - Data Analytics Applications
• Research on Data Mining for Machine Fault Detection
• Research on Optimizations in MPP Systems
• Research on Integrating Big Data Management Systems
22Sunnie Chung Cleveland State University