making r enterprise gradefiles.meetup.com/13745382/session 2 - r server in the enterprise.pdf ·...
TRANSCRIPT
Making R Enterprise Gradewith SQL Server 2016
What is R?
Language
Platform
Community
Ecosystem
• A programming language for statistics, analytics, and data science
• A data visualization framework
• Provided as Open Source
• Used by 3M+ data scientists, statisticians and analysts
• Taught in most university statistics programs
• Active and thriving user groups across the world
• CRAN: 8000+ freely available algorithms, test data and evaluation
• Many of these are applicable to big data if scaled
• New and recent graduates prefer it
History of R
20152009200420032000199719951993
Research Project in
New Zealand
Open Source Project
R-Core Group
R-1.0.0 released
R Foundation
First user
group
New York Times article
R-3.2.0 and R Consortium (founded by Microsoft)
R’s popularity continues to outpace alternatives
IEEE Spectrum
A Vast Community of R Users Share Rich Repositories of Pre-Built Solutions
Standing on the Shoulders of Giants
R is a great tool, but the key question for the enterprise is:
how do we operationalize R?
?
Some key operationalisation questions
How can we
embed R-based
analytics into
business
applications like
CRM?
How can we
build R-based
analytics into BI
dashboards and
deploy quickly?
How can we
maintain data
security for R
users?
How do we
allow data
scientists to
share code and
version control?
SQL Server 2016: Everything to operationalize R
The above graphics were published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any vendor, product or service depicted in its research
publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties,
expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Consistent experience from on-premises to cloud
In-memory across all workloads
built-in built-inbuilt-in built-inbuilt-in
TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
At massive scale
0 14
0 03
34
29
22
15
5
22
6
43
20
69
18
49
3
0
10
20
30
40
50
60
70
80
2010 2011 2012 2013 2014 2015
SQL Server Oracle MySQL SAP HANA TPC-H non-clustered 10TB
Oracle is #5#2
SQL Server
#1
SQL Server
#3
SQL Server
National Institute of Standards and Technology Comprehensive Vulnerability Database update 2/2016
In-database Advanced AnalyticsBuild intelligent applications with SQL Server R Services
Open Source R with in-memory & massive scale - multi-threading and massive parallel processing
Ad
vance
d A
nalytics
NEW
NEW
R built-in to SQL Server
Mission critical OLTP
R built-in to your T-SQL
Real-time operational analyticswithout moving the data
NEW
Most secure database
Least vulnerable 6 yearsLayers of protection
0 2
118
0 14
0 03
69
53 53 55
3429
22
15
5
22
5
2825
29
21
6
13
30
6
16 16
9 8 6
43
20
69
18
49
3
0
10
20
30
40
50
60
70
80
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
SQL Server Oracle DB2 MySQL SAP HANA
* National Institute of Standards and Technology Comprehensive Vulnerability Database update 10/2015
in a row & most utilized
Protect data
• Always Encrypted
• Transparent data encryption
Monitor activity
• Advanced Threat Analytics
• SQL Server auditing
Control access
• Windows Authentication
• Row-level security
• Dynamic data masking
NEW
NEW
NEW
NEW
Mo
st secu
re d
ata
base
Dramatically simplify HA & DRoperational efficiencies with hybrid cloud
Enhanced AlwaysOn
Missio
n critica
l OLT
P
NEW
Highest performing DW
#1 TPC-H10 TB TPC-H world record*
@ ½ hardware costsUnparalleled scalability
with Windows Server 2016
12TB of memory WS 2016 max cores#1 SAP performance
2-tier 16-processor SAP benchmark
on Superdome X + SQL Server
Hig
hest p
erfo
rmin
g D
W
* As of 4/6/2015 (http://www.tpc.org/3312)
24X TPS boostIn throughput on the
same hardware
Lightning fast queries & reports
End-to-end mobile BI on any device
Reports in minutes not days
PB scale DW in SQL Server
On all mobile devices
built-in built-in
Windows iOS Android HTML5
End
-to-e
nd
mo
bile
BI NEW
NEW
NEW
The Data Science Process withSQL Server 2016
Data scientists transform data into intelligent actionwith SQL Server 2016
Deployment
Dashboards &
Visualizations
Information
Management
Data Stores Machine Learning
and Analytics
Hadoop
HDFS
Data Intelligence Action
People
Automated Systems
Apps
Web
Mobile
Bots
SQL R Stored
ProcedureSQL Server
2016
Data Catalog
SQL Server
Integration
Services
Microsoft R
Client
Power BI
Data
Sources
Apps
Sensors
and
devices
Data
Azure Blob
Store
PolyBase
R Server for
Hadoop
Excel
Reporting
Services
SQL Server
Data Mining
Tools
DeployR
(web-services)
SQL Server
Data Tools
1. Business and
Technology
Planning
2. Plan and Prepare
Data
3. Ingest Data into
Data Platform
4. Explore and
Visualize Data
5. Generate and
Select Features
6. Train/Retrain
Models
7. Deploy
models/analytics
8. Build dashboards
and/or Apps for
consumption
Data Science Process with SQL Server
20163. Ingest Data into
Data Platform
4. Explore and Visualize Data
5. Generate and Select Features
6. Train Models/optimize
analytics
7. Deploy models/analytics
8. Build dashboards and/or Apps for
consumption
• SQL Integration Services• R Stored Procedure• Data Tools• Import/Export wizards
for ad-hoc data• Directly via R• Polybase to Hadoop
• Microsoft R Client• RTVS or RStudio• Remote from any IDE onto Server• SQL Server Data Mining tools for
non-programmers
• DeployR (one-click)
• Stored Procedures (one-click)
• Azure Machine Learning (Cloud)
• Reporting
Services
• Power BI
• Excel
• Bots
• PowerApps
• Azure
LogicApps
Productivity Tools: Visual Studionow including R
• One-click to deploy an R-based SQL stored procedure in RTVS
• For RStudio, we have a package to allow the deployment in a line of R code
• R with Team Services for source code control management
Productivity Tools: Data Ingestfrom ad-hoc to production
ad-hoc production
Productivity Tools: Microsoft R Client
• Works with RTVS or RStudio
• Speeds up computation locally thanks to MKL
• ScaleR package built-in
• Reproducible R Toolkit built-in
• Work with files on your local file system
• Work with ODBC data from SQL
• Comes with DeployR remote package:
• Remote into SQL Server and do R on the Server
• Switch between server and local environments
• Send R jobs to a server (asynchronously)
Productivity Tools: DashboardingPowerBI
• Create Dashboards in PBI Desktop (free)
• Deploy in the Cloud
• Deploy on-premvia SSRS
• Consume R-based SQL stored procedures
• Caveat: As yet, no pagination
Productivity Tools: DashboardingSQL Server Reporting Services
• On-Premise
• Option to deploy in the cloud via PBI
• Paginated reports consuming R-based SQL Stored Procedures
• Custom branding
• Easy drag-and-drop report authoring via Visual Studio or Report Builder
• Auto-parameter generation
Productivity Tools: Excel Integration
• Allow Excel users to consume your R-based Stored Procedures
• Excel users can consume DeployR Web Services
• No R programming knowledge required for the end user
• Analytics self-service across the organisation
Productivity Tools: Build Intelligent AppsMicrosoft PowerApps & SQL Server R Stored Procedures/DeployR Web-Services
Hadoop with SQL Server and R
Hadoop Data Pipeline
ORC, Parquet, Avro, JSON, Text
Model
Prepare
Store
Deploy
Big Data (TB)
Small Data (MB)
Large Data (GB)
PolyBase
sparklyr
sparklyr
SQL Server 2016
Consume
R in Hadoop: Options
Edge Node
HeadNode
Head Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Edge Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data Node
Data NodeDeployR
Remote
Data Science Workstation
sparklyr
Provides a scalable, T-SQL-compatible query processing framework for combining data from both universes
Access any data
R in Hadoop: Polybase built-in
PolyBase View
Execute T-SQL queries against
relational data in SQL Server and
semi-structured data in Hadoop or
Azure Blob Storage
Leverage existing T-SQL skills and BI
tools to gain insights from different
data stores
Access any data
PolyBase View in SQL Server 2016
R View
An R Stored procedure has an input SQL query that executes T-SQL queries against relational data in SQL Server and semi-structured data in Hadoop or Azure Blob Storage
The result set of that query is then processed by R to produce an analytics result set, model and/or R graphics – this can be done at Scale with ScaleR
The stored procedure can be set to run on a schedule and/or event trigger.
Another mechanism to operationalize R+Hadoop
Access any data
PolyBase and R
Access any data
PolyBase use cases
High Level Architecture
Data Scientists Workstations
Analytics Environment
PolyBase
Hadoop Cluster
From RStudio/RTVS remote to SQL and/or Hadoop Edge using DeployR R package
Continue to use local (workstation) for ad-hoc work
Once process has been tested in analytics environment, we promote to operational environment.
Operational
Summary
• SQL Server 2016 contains all the components to operationalize data science workloads
• With SQL Stored procedures we can leverage other Microsoft products e.g. PowerBI, PowerApps, etc
• Visual Studio now has R built-in
• SQL Server comes with many tools to aid productivity – e.g. import wizards, SSIS, SSRS, etc.
• Flexibility of compute:
• SQL Server
• Hadoop
• Workstation
Appendix
Revolution R Enterprise and SQL
Big data analytics platform Based on open source R
High-performance, scalable, full-featuredStatistical and machine-learning algorithms are performant, scalable, and distributable
Write once, deploy anywhereScripts and models can be executed on a variety of platforms, including non-Microsoft (Hadoop,
Teradata in-DB)
Integration with the R EcosystemAnalytic algorithms accessed via R function with similar syntax for R users (with arbitrary R functions/packages)
Advanced analytics
Data sourceintegration
Parallel external-memoryalgorithm library
Data
Compute context integration
Reso
urce
s
RequestsR scripts +
CRANalgorithms
SQL Server 2016 R integration scenario
ExplorationUse Revolution R Enterprise (RRE) from R integrated development environment (IDE) to analyze large data sets and build predictive and embedded models with compute on SQL Server machine (SQL Server compute context)
OperationalizationDeveloper can operationalize R script/model over SQL Server data by using T-SQL constructs
DBA can manage resources, plus secure and govern R runtime execution in SQL Server
Advanced analytics
Setup and configuration
SQL Server 2016 setup
Install Advanced Analytics with R feature in setup
Install RRO on SQL Server 2016
machine
Install RRE on SQL Server 2016
machine
Optional: Install R packages on SQL
Server 2016 machine
Database configuration
Enable R language extension in
database
Configure path for RRO runtime in
database
Grant EXECUTE EXTERNAL SCRIPT
permission to users
CREATE EXTERNAL EXTENSION [R]
USING SYSTEM LAUNCHER
WITH (RUNTIME_PATH =
'c:\revolution\bin‘)
GRANT EXECUTE SCRIPT ON
EXTERNAL EXTENSION::R
TO DataScientistsRole;
/* User-defined role /
users */
Advanced analytics
Management and monitoring
R runtime usage
Resource governance via resource pool
Monitoring via DMVs
Troubleshooting via XEvents/
DMVs
CREATE RESOURCE POOL R_runtime
FOR EXTERNAL EXTENSION
WITH MAX_CPU_PERCENT = 20,
MAX_MEMORY_PERCENT = 10;
select * from
sys.dm_resource_governor_resouce_po
ols
where name = 'R_runtime';
Advanced analytics
R script usage from SQL Server
Original R script:
IrisPredict <- function(data, model){library(e1071)predicted_species <- predict(model, data)return(predicted_species)
}
library(RODBC)conn <- odbcConnect("MySqlAzure", uid = myUser, pwd = myPassword);Iris_data <-sqlFetch(conn, "Iris_Data");Iris_model <-sqlQuery(conn, "select model from my_iris_model");IrisPredict (Iris_data, model);
Calling R script from SQL Server:
/* Input table schema */create table Iris_Data (name varchar(100), length int, width int);/* Model table schema */create table my_iris_model (model varbinary(max));
declare @iris_model varbinary(max) = (select model from my_iris_model);exec sp_execute_external_script
@language = 'R', @script = 'IrisPredict <- function(data, model){library(e1071)predicted_species <- predict(model, data)return(predicted_species)
}IrisPredict(input_data_1, model);', @parallel = default, @input_data_1 = N'select * from Iris_Data', @params = N'@model varbinary(max)', @model = @iris_modelwith result sets ((name varchar(100), length int, width int, species varchar(30)));
Values highlighted in yellow are SQL queries embedded in the original R script
Values highlighted in aqua are R variables that bind to SQL variables by name
Advanced analytics
CapabilityExtensible in-database analytics, integrated with R,
exposed through T-SQL
Centralized enterprise library for analytic models
Benefits
SQL Server
Analytical engines
Integrate with R
Become fully extensible
Data management layer
Relational data
Use T-SQL interface
Stream data in-memory
Analytics library
Share and collaborate
Manage and deploy
R +
Data Scientists
Business
Analysts
Publish algorithms, interact
directly with data
Analyze through T-SQL,
tools, and vetted algorithms
DBAsManage storage and
analytics together
Summary: R integration and advanced analytics
Advanced analytics
Summary: Analyze your data
Real-time operational analyticsRun analytical workloads concurrently with OLTP workloads through columnstore
indexes
Analysis Services enhancementsMany-to-many support with bi-directional cross filtering, improved performance, new
DAX functions and tooling
Advanced Analytics with RIntegrated R Services, predictive analytics, big data analytics
Summary: Deeper insights across data
Store your dataPolyBase • Traditional and Modern Data Warehouse • JSON, FILESTREAM, RBS
Make it accessibleEnhanced SSIS • Enhanced MDS
Analyze dataReal-time operational analytics • Enhanced SSAS • Advanced Analytics with R
Share insightsMobile Report Publisher • Modern Browser Support (SSRS) • Power BI Connector