in3020/4020 –database systems spring 2020, week 16.2, part ...€¦ · in3020/4020 –database...
TRANSCRIPT
IN3020/4020 – Database Systems Spring 2020, Week 16.2, Part 1
Data Science
Dr. M. Naci Akkøk, Chief Architect, Oracle NordicsCo-founder & CEO, In-Virtualis
Also based upon slides from Renée Wikestad, Senior Solution Engineer, Oracle
Major driver of DBMS development/usageo If you haven´t noticed yet, one of the major users of (and hence
drivers in research for) databases and database management systems is data-intensive or data-driven applications:
o Analyticso Data Science (or applied statistics)o Artificial Intelligence (AI) and Machine Learning (ML)o Also including pattern recognition, text and speech
analysis/synthesis etc.
Artificial Intelligence
Machine Learning
Deep Learning
So what is … ?
Data Science
Artificial Intelligence
Machine Learning
Deep Learning
Data Scientist; Skills
Math and Statistics
Business/DomainsKnowledge
Computer Science
Machine LearningDiagnosticAnalysis
Soft
war
e De
velo
pmen
t
What is Machine Learning?
ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.
What is Machine Learning?
ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.
What is Machine Learning?
ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.
What is Machine Learning?
ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.
What is Machine Learning?
ML is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.
There are hybrid models, but such applications arealways data driven, data intensive.
The old way
Human written rules guided by experience
Problem: Difficult to determine the right rules, rules may change, impossible to evaluate the fit for every individual
?With Machine Learning
Machine finds patterns in historical data (model)
Uses this knowledge to predict best audience
Example: Address the right audience for my product
? pneumonia
healthy
The old way
Human work by highly skilled doctors.
With Machine Learning
Machine finds patterns in labelled historical images.
Uses this knowledge to diagnose new images.
Example: Medical diagnosis
Example: Fraud Detection
The old way
Rule based systems.
High amount of false positives.
With Machine Learning
Combination of techniques, including Anomaly Detection.
Higher accuracy.
SalePrice
Size (m2)
400K
200100 300
300K
200K
100K
500K
A model is a function
f(x)=Target
x = Input Feature
f(x) = 150 + 1.25 x
How to find a model with Linear regression?
Hypothesis:
Parameters:
Cost Function:
Target:
• We use a training set of historical data to find parameters that provide the function that best fits the training data.
• We then use a test data set of historical data to checkhow well the model fits this test data.
• If the model is good enough, we can use it to predictfuture values.
Machine Learning Types
Supervised
CLASSIFICATIONREGRESSION
Machine Learning Types
Supervised
CLASSIFICATIONREGRESSION CLUSTERING
Unsupervised
Machine Learning Types
CLASSIFICATIONREGRESSION CLUSTERING
Supervised Unsupervised Reinforcement
Agent
Environment
state
reward
action
o Supervised Learning: Uses historical data to find a model (mathematical function) or classification / pattern that best fits the data in order to predict valuesfor new data.
o Unsupervised Learning:Finds trends and patterns in data, without previous training on historicaldata. Here you try to find patterns/classifications of the data with littleor no idea what the result should look like.
Reinforcement Learning: Uses observations to learn. Learning is based on the fact that all actionshave an effect on the model and that feedback is given from the actionsthat guide and improve the learning.
Machine Learning Algorithms can be divided into categories according to their purpose.Machine Learning Models
S
M
Top languages for Data Scientists
Python and SQL are leading, with R in a good third place.
Access to Data
• On-prem/cloud/hybrid data management
• Finding data
• Combining data
Challenges of Data Science
Complex lifecycle to build models
• Multiple roles involved
• Reuse of past work
• Access to tools and infrastructure
• Access to flexible compute power
Operationalizing ML• Packaging models for
consumption
• Monitoring models
• Updating models
Problem Definition
Dataset Discovery and
Access
Data Exploration + Preparation
Model BuildingModel
Evaluation + Selection
Model Deployment
Model Evaluation
The Data Science Process
The Machine Learning Process
Business Understanding
Data Understanding Data Preparation Modeling Evaluation Deployment
Capture DataDescribe DataExplore Data
Select DataClean DataConstruct DataCombine Data
Select Modeling TechniqueBuild ModelAsses Model
Evaluate resultDecide Next Step
ObjectivesSuccess Criteria
Analytics Development
Reports & Dashboards
Data Information Predictions & Insights Appls with ML
Analytical Maturity
Valu
e of
Ana
lytic
s
Diagnostic Analysis & Reports
Predictive / Machine Learning
“ML Enabled” Applications
What Happened?
Why it Happened?
What WILL happen?
Automated ML Appls
Machine Learning Differentiators
Business Understanding
• Predict employees that voluntarily leave• Predict customers that are likely to churn• Target “best” customers • Find items that will help me sell more most profitable items• What is a specific customer most likely to purchase next?• How can I combat fraud?• ..and so on
Start with a Well-Defined Business Problem Statement
“If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
― Albert Einstein
Business UnderstandingStart with a Well-Defined Business Problem Statement
Move the Algorithms, Not the Data!
Machine Learning Algorithms Require Data
X1
X2
A1 A2 A3 A4 A5 A6 A7
Algorithms automatically sift through large amounts of data to discover hidden patterns, new insights and make predictions
In-database ML
X1
X2
A1 A2 A3 A4 A5 A6 A7
Identify most important factor (Attribute Importance)
Predict customer behavior (Classification)Find profiles of targeted people or items (Classification)
Predict or estimate a value (Regression)
Segment a population (Clustering)
Find fraudulent or “rare events” (Anomaly Detection)
Determine co-occurring items in a “basket” (Associations)
Built-in ml library. Try these in always free Oracle Autonomous Data Warehouse or Transaction Processing databases: https://www.oracle.com/cloud/free/?source=:ow:o:p:nav:0916BCButton&intcmp=:ow:o:p:nav:0916BCButton
Classification• Naïve Bayes• Logistic Regression (GLM)• Decision Tree• Random Forest• Neural Network• Support Vector Machine• Explicit Semantic Analysis
Clustering• Hierarchical K-Means• Hierarchical O-Cluster• Expectation Maximization (EM)
Anomaly Detection• One-Class SVM
Time Series• Forecasting –
Exponential Smoothing• Includes popular models
e.g. Holt-Winters with trends, seasonality, irregularity, missing data
Regression• Linear Model• Generalized Linear Model• Support Vector Machine (SVM)• Stepwise Linear regression• Neural Network• LASSO
Attribute Importance• Minimum Description Length• Principal Comp Analysis (PCA)• Unsupervised Pair-wise KL Div• CUR decomposition for row & AI
Association Rules• A priori/ market basket
Predictive Queries• Predict, cluster, detect, features
SQL Analytics• SQL Windows• SQL Patterns• SQL Aggregates
Feature Extraction• Principal Comp Analysis (PCA)• Non-negative Matrix Factorization• Singular Value Decomposition (SVD)• Explicit Semantic Analysis (ESA)
Text Mining Support• Algorithms support text• Tokenization and theme extraction• Explicit Semantic Analysis (ESA) for
document similarity
Statistical Functions• Basic statistics:
min, max, median, stdev, t-test, F-test, Pearson’s, Chi-Sq, ANOVA, etc.
Model Deployment• SQL—1st Class Objects • Oracle RESTful API (ORDS)• OML Microservices (for Apps)
X1
X2
A1 A2 A3 A4 A5 A6 A7
o STATISTICAL FUNCTIONS
o Descriptive statistics (e.g. median, stdev, mode, sum, etc.)
o Hypothesis testing (t-test, F-test, Kolmogorov-Smirnov test, Mann Whitney test, Wilcoxon Signed Ranks test
o Correlations analysis(parametric and nonparametric e.g.Pearson’s test for correlation, Spearman's rho coefficient, Kendall's tau-b correlation coefficient)
o Ranking functions
o Cross Tabulations with Chi-square statistics
o Linear regression
o ANOVA (Analysis of variance)
o Test Distribution fit (e.g., Normal distribution test,Binomial test, Weibull test,Uniform test, Exponential test,Poisson test)
o Statistical Aggregates(min, max, mean, median, stdev, mode, quantiles, plus x sigma, minus x sigma, top n outliers, bottom n outliers)
o ANALYTICAL SQL
o SQL Windows
o SQL Aggregate functions
o LAG/LEAD functions
o SQL for Pattern Matching
o Additional approximate query processing: APPROX_COUNT, APPROX_SUM, APPROX_RANK
o Regular Expressions
Statistical Functions and Analytical SQL
In-database ML: Solution to one of the biggest problems
One of the major challenges with ALL data-drive, data-intensive applications is getting the data into the database!
Time consumingSecurity / quality riskOptimization & performance challenges … and others
The way Oracle natively includes of ML algorithmsPerform parallel ML directly in Data Warehouse or OLTP for fast model building and real time scoring of new dataKeep data secure by avoiding copy contagion – copies of data in analytics systems are a common source of data breaches
Obstacles to adoption of ML in database have been removedAbility to call out to latest R and Python algorithmsFor example accessible via SQL, REST API, Oracle R Enterprise and Python
Traditional approaches to building models are cumbersome
Modelbuilding
Scoring
Data extraction
Data exploration and transformation
Model evaluation
Modeldeployment
o There is a better way
§ Optimized algorithms§ Models produced faster and with less effort§ Better data protection and security§ Faster access to data
In-database machine learning is better
Modelbuilding
Modeldeployment
Data extraction
Data exploration and transformation
Model evaluation
Scoring
In-database machine learning
• Parallel, distributed, in-DB execution• Improved memory utilization for
greater scalability• Wide range of algorithms• Flexible model building
SQL notebook R Data Miner (no coding)Python
• Data stays secured in database• Deploy as REST API available
Machine Learning on Data Platforms
OCI Data Science Service Bring Your Own Analytics
Approaches to building ML models
Build ML on databaseManage projects and collaboratively build, deploy,
manage models
Architect your own platform with
open source
Business user friendly approach to ML
• Large amounts of data in database?
• Governance / security?
• Skillset?
Build models in a notebook
Oracle Machine Learning notebooks
o Included in Autonomous Data Warehouse
o Familiar notebook for data scientists and collaborators
o Easy sharing of notebooks and templates with permissions, versioning, and execution scheduling
o In-database machine learning algorithms and analytics functions via SQL, and later Python and R
o Secure access to data in the database
Machine learning is better in the cloudThis is what Oracle Cloud offers:
High performance compute
§ Flexible shapes from VMs to Bare Metal
§ 40% better performance than other clouds
§ 30% cheaper GPUs
Scalable, independent storage
§ NVMe SSD, Object, Block and Archive Storage
§ 50% faster HDFS performance§ 90% lower Data Lake costs with
Data Tiering
Guaranteed network performance
§ Non-over subscribed, flat 25 Gbps bandwidth between nodes
§ Only cloud with Network performance SLA
§ No egress charges
The Evolution of the Database Developer/DBA Role
o Data Engineero Architecture, o “data wrangler”
Machine Learning Solving data-driven problemsDiscovering insights Making predictions
Data SecurityData classification,Data life-cycle mgmt
Application TuningSQL tuning, connection mgmt
o Data remains in Database
• Model building and scoring occur in-database• Leverage investment in Oracle IT
o Eliminate data duplication• Eliminate separate analytical servers
o Deliver enterprise-wide “predictive” applications
Savings with in-database Machine Learning
Don’t Move the Data!Traditional ML
Hours, Days or Weeks
Data Extraction
Data Prep & Transformation
Model Building
Model “Scoring”
Data Prep. &Transformation
Data Import
avings
Model “Scoring”Embedded Data Prep
Data Preparation
Model Building
Oracle Machine Learning
Seconds, Minutes or Hours
.
SELECT SUBSTR(cust_income_level, 1, 22) income_level,AVG(DECODE(cust_gender, 'M', amount_sold, null)) sold_to_men,AVG(DECODE(cust_gender, 'F', amount_sold, null)) sold_to_women,STATS_T_TEST_INDEPU(cust_gender, amount_sold, 'STATISTIC', 'F') t_observed,STATS_T_TEST_INDEPU(cust_gender, amount_sold) two_sided_p_value
FROM customers c, sales sWHERE c.cust_id = s.cust_idGROUP BY ROLLUP(cust_income_level)ORDER BY income_level, sold_to_men, sold_to_women, t_observed;
SQL Statistical Functions
STATS_T_TEST_INDEPU (SQL) Example; P_Values < 0.05 show statistically significantly differences in the amounts purchased by men vs. women
Compare AVG Purchase Amounts Men vs. Women Grouped_By INCOME_LEVELSimple SQL Syntax—Statistical Comparisons (t-tests)
o BEGINo DBMS_DATA_MINING.CREATE_MODEL(o model_name => 'BUY_INSUR1',o mining_function => dbms_data_mining.classification,o data_table_name => 'CUST_INSUR_LTV',o case_id_column_name => 'CUST_ID',o target_column_name => 'BUY_INSURANCE',o settings_table_name => 'CUST_INSUR_LTV_SET');o END;o /
Simple SQL Syntax—Classification ModelOML for SQL Model Build & Apply
Select prediction_probability(BUY_INSUR1, 'Yes'USING 3500 as bank_funds, 825 as checking_amount, 400 as credit_balance, 22 as age, 'Married' as marital_status, 93 as
MONEY_MONTLY_OVERDRAWN, 1 as house_ownership)from dual;
ML Model Build (PL/SQL)
Model Apply (SQL query)
.
o BEGINo DBMS_DATA_MINING.CREATE_MODEL(o model_name => 'BUY_INSURANCE_AI',o mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,o data_table_name => 'CUST_INSUR_LTV',o case_id_column_name => 'cust_id',o target_column_name => 'BUY_INSURANCE',o settings_table_name => 'Att_Import_Mode_Settings');o END;o /
Simple SQL Syntax—Attribute ImportanceOML for SQL Model Build
SELECT attribute_name, explanatory_value, rankFROM BUY_INSURANCE_AIORDER BY rank, attribute_name;
ML Model Build (PL/SQL)
Model Results (SQL query)
ATTRIBUTE_NAME RANK ATTRIBUTE_VALUEBANK_FUNDS 1 0.2161MONEY_MONTLY_OVERDRAWN 2 0.1489N_TRANS_ATM 3 0.1463N_TRANS_TELLER 4 0.1156T_AMOUNT_AUTOM_PAYMENTS 5 0.1095
A1 A2 A3 A4 A5 A6 A7
o Collaborative UI o Based on Apache Zeppelino Supports data scientists, data analysts,
application developers, DBAso Easy sharing of notebooks and templates o Permissions, versioning, and execution scheduling
o Included with Autonomous Databaseo Automatically provisioned, managed, backed upo In-database SQL algorithms and analytics functions o Supports deployment of ML analytics via ORDS
(Oracle REST Data Services)o Soon to be augmented with Python and R
Autonomous Database + Machine Learning PlatformOracle Machine Learning Notebooks
Demo ML in Databaseo Using Oracle Autonomous Database