october 2016 hug: architecture of an open source rdbms powered by hbase and spark 

42
Splice Machine Proprietary and Confidential Open Source RDBMS For Mixed Operational and Analytical Workloads Monte Zweben 3/30/22

Upload: yahoo-developer-network

Post on 11-Jan-2017

268 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Open Source RDBMS For Mixed Operational and Analytical Workloads

Monte Zweben

May 1, 2023

Page 2: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Who We Are

The Open Source RDBMS Powered By Hadoop & Spark

2

ANSI SQLNo retraining or rewrites for SQL-based

analysts, reports, and applications

¼ the Cost Scales out on

commodity hardware

SQL Scale Out Speed

TransactionsEnsure reliable updates

across multiple rows

Mixed WorkloadsSimultaneously support

OLTP and OLAP workloads

ElasticIncrease scale in

just a few minutes

10x FasterLeverages Spark

in-memory technology

Page 3: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Life Sciences

Digital Marketing Financial Services

DECISIONS IN THE MOMENTSupply Chain Optimization

Page 4: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 4

Today’s Reality: Stale Data, Backward-Looking Decisions

How old is the data in your reports?

1 day +

1 day

4 hours +

1 hour +

Real-time

Page 5: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 5

Today’s Reality: Stale Data, Backward-Looking Decisions

24%

50%

7%

9%

9%

* Source: Webinars on 11-3-15 and 12-10-15, 237 respondents

How old is the data in your reports?

1 day +

1 day

4 hours +

1 hour +

Real-time

Page 6: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Legacy ETL Architectures Unable to Keep Up

Ad Hoc Analytics

Executive Business Reports

Operational Reports

ERP

CRM

Supply Chain

HR

Data Warehouse

Datamart

Stream or Batch Updates

Mixed Workload Apps

ODS

ETL

OLTP Systems

Extract

Transform

Load

OLAP Systems Pain

Separate OLTP & OLAP systems

Messy ETL “glue”

Why? Different workloads Different data structures Hard to isolate workloads

No longer adequate Can’t afford to wait days or

hours to analyze data

6

Page 7: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Recent Approach: Lambda ArchitectureComplex to setup and maintain

7

Speed Layer

Batch Layer

Serving Layer

Developer Integrates Specialized Compute Engines

Page 8: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

New Approach: Lambda-In-A-Box ArchitectureEasy to use with SQL

8

Speed LayerBatch Layer

SQL Optimizer Selects Pre-Integrated Compute Engines

Serving Layer

Page 9: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 9

Simultaneous OLTP & OLAP WorkloadsUnique Dual-Engine Architecture isolates workloads

Traditional RDBMSs Splice MachineHBASE

EngineSPARKEngine

BOTTLENECKS, DELAYS

O L A P

WORKLOAD ISOLATION

O L T P

K E Y

Page 10: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 10

Simultaneous OLTP & OLAP WorkloadsUnique Dual-Engine Architecture isolates workloads

Traditional RDBMSs Splice Machine

As OLAP load rises, OLTP response times increase

OLAP LOAD

OLT

P RE

SPO

NSE

TIM

E

As OLAP load rises, OLTP response times remain flat

OLAP LOAD

OLT

P RE

SPO

NSE

TIM

E

Page 11: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Power Old and New Applications

Page 12: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Proven Building Blocks: Spark, Hadoop and Derby

Apache Derby ANSI SQL-99 RDBMS Java-based ODBC/JDBC Compliant

Apache HBase/Hadoop Auto-sharding High availability Scalability to 100s of PBs

Apache Spark Analytical engine Fast, in-memory technology Memory resilient to node failure

12

Page 13: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

HBase: Proven Scale-Out

Auto-sharding Scales with commodity hardware Cost-effective from GBs to PBs

High availability thru failover and replication LSM-trees

13

Page 14: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 14

Apache

Unmatched Performance Fastest sort of 1PB of data

Advanced In-Memory Technology Spill-to-disk for large datasets Resilient against node failures Pipelining for computation parallelism

Most Active Apache Community Almost 1000 contributors

Extensive Libraries Over 140 and growing Libraries for machine learning,

streaming and graph processing

Page 15: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 15

Splice Machine: Advanced Spark Integration

Innovative, High-Performance RDD Creation Fast access to HFiles in HDFS Merged with deltas from Memstore Avoids slower HBase API

Universal Execution Plan and Byte Code Optimizer, plan and code shared across

Spark or HBase execution

•••

HBase Region Server

HDFS

•••Region 1

Memstore

Spark Worker

•••RDD 1

HFile HFile•••

P H Y S I C A L N O D E

RDD N

HFile••• HFile•••

Region N

Memstore

HBase Region Server

HDFS

•••Region 1

Memstore

Spark Worker

•••RDD 1

HFile HFile•••

P H Y S I C A L N O D E

RDD N

HFile••• HFile•••

Region N

Memstore

Page 16: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Splice Machine Architecture

1. Standard install of HBase Cluster (HBase, HDFS, ZooKeeper) with Spark

HBase Co-Processor

L E G E N D

2. Distribute Splice Machine JAR to each region server

3. Automatically invoke co-processors on each region

16

Cache

•••Task

Executor

Task

HBase Region Server

•••

HDFS

SPLICE PARSER

SPLICE PLANNER

SPLICE OPTIMIZER

SPLICE EXECUTOR • Snapshot Isolation• Indexes

Region Region

SPLICE EXECUTOR • Snapshot Isolation• Indexes

Spark WorkerRDD

Spark Master

RDD

Cache

•••Task

Executor

Task

•••

•••

•••

Cache

•••Task

Executor

Task

HBase Region Server

HDFS

SPLICE PARSER

SPLICE PLANNER

SPLICE OPTIMIZER

SPLICE EXECUTOR • Snapshot Isolation• Indexes

Region Region

SPLICE EXECUTOR • Snapshot Isolation• Indexes

Spark WorkerRDDRDD

Cache

•••Task

Executor

Task

•••

•••

•••

HMasterZookeeper

Page 17: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 17

Splice Machine: Query Execution

Page 18: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 18

Splice Machine: Query Execution1. Parse SQL

• Generate Abstract Syntax Tree (AST)

• Bind AST to Transactional Dictionary

Page 19: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 19

Splice Machine: Query Execution1. Parse SQL2. Optimize query plan

• Determine access plan (e.g., base table, index), join order and join algorithm using cost-based statistics (e.g., cardinality estimates)

• Unroll nested subqueries

Page 20: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 20

Splice Machine: Query Execution

3. Generate optimal byte code

1. Parse SQL2. Optimize query plan

Page 21: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 21

Splice Machine: Query Execution

OLTP Execution on HBase4a. Execute OLTP query from byte

code5a. Use block cache and bloom

filters to optimize data access6a. Return results

3. Generate optimal byte code

1. Parse SQL2. Optimize query plan

Page 22: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 22

Splice Machine: Query Execution

OLAP Execution on Spark4b. Generate Spark execution plan

OLTP Execution on HBase4a. Execute OLTP query from byte

code5a. Use block cache and bloom

filters to optimize data access6a. Return results

3. Generate optimal byte code

1. Parse SQL2. Optimize query plan

OLAP Execution on Spark4b. Generate Spark execution plan5b. Submit Spark plan with byte code6b. Fair scheduling of distributed of tasks7b. Generate RDD from HFiles and Memstore 8b. Execute query and return results

Page 23: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 23

Isolated Resource ManagementIsolate Spark & HBase resources through Linux Cgroups

Page 24: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 24

Isolated Resource ManagementIsolate Spark & HBase resources through Linux Cgroups

Page 25: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 25

Configurable Spark Resource ManagementPrioritize Spark resources between Query, Admin & Import jobs

Custom resource pools through XML

Page 26: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 26

Spark Query ManagementVisualization of active and completed queries

Page 27: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 27

Spark Query Management (cont’d)Visualization of stages for each query, plus kill function

Page 28: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 28

Spark Query Management (cont’d)Visualization of stages for query plan, plus kill function

Page 29: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 29

Spark Query Management (cont’d)Detailed metrics for tasks in each stage

Page 30: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 30

Spark Query Management (cont’d)

Page 31: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 31

Working With External Data and Compute Engines

Virtual Table Interface (VTI) Execute federated queries against external

files, libraries or databases External Databases

Use JDBC to access data in DBs such as Oracle and DB2

External Libraries Access over 140 Spark libraries for machine

learning and streaming External Files

Pre-defined or dynamic schema Access local FS, HDFS, AWS S3 Sample query:

MapReduce I/O Formats Accept federated queries from

MapReduce, Pig, and Hive Register Splice Machine schema in

HCATALOG Merge structured (Splice) and

unstructured data in ad-hoc query Seamless integration to Hadoop

ecosystem

Page 32: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 32

ANSI SQL-99+ Coverage Data types – e.g., INTEGER, REAL,

CHARACTER, DATE, BOOLEAN, BIGINT DDL – e.g., CREATE TABLE, CREATE SCHEMA,

ALTER TABLE, DELETE, UPDATE TABLE Predicates – e.g., IN, BETWEEN, LIKE, EXISTS DML – e.g., INSERT, DELETE, UPDATE, SELECT Query specification – e.g., GROUP BY,

HAVING SET functions – e.g., UNION, ABS, MOD, ALL,

INTERSECT, EXCEPT Aggregation functions – e.g., AVG, MAX,

COUNT String functions – e.g., SUBSTRING,

concatenation, UPPER, LOWER, TRIM, LENGTH

Constraints – e.g., PRIMARY KEY, CHECK, FOREIGN KEY, UNIQUE, NOT NULL

Conditional functions – e.g., CASE, searched CASE

Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE

Joins – e.g., INNER JOIN, LEFT OUTER JOIN Transactions – e.g., COMMIT, ROLLBACK,

Snapshot Isolation Sub-queries Triggers User-defined functions (UDFs) Views – including grouped views Window Functions – e.g., FIRST_VALUE,

LAST_VALUE, LEAD, LAG

Page 33: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 33

High Concurrency, ACID transactionsRequired to support OLTP applications

share_quantity share_priceTIMESTAMP VALUE TIMESTAMP VALUE

T12 4,000 “Virtual” Snapshot T7 $15.11

T7 2,000 T5 $15.65

T3 5,000 Transaction @T6 T2 $15.74

T1 3,000 T0 $15.27

T3 5,000 Transaction @T6 T2 $15.74

T5 $15.65

value_held = share_quality* share_price

@T6: value_held = 5,000 * $15.65@T3: value_held = 5,000 * $15.74

State-of-the-art, distributed snapshot isolation Form of Multi-Version

Concurrency Control (MVCC) Writers do not block readers Fast, high concurrency Delivers performance for small

reads/writes & batch loads Extends research from Google

Percolator & Yahoo Labs Patent pending technology

Page 34: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

BI and SQL tool support via ODBC/JDBC

34

No application rewrites needed

Page 35: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Open Source Features Community

EditionEnterprise

Edition

Scale-out Architecture, ANSI SQL & Concurrent ACID Transactions ✓ ✓

OLAP and OLTP Resource Isolation ✓ ✓

Distributed In-Memory Joins, Aggregations, Scans and Groupings ✓ ✓

Cost-Based Statistics, Query Optimizer, Management Console ✓ ✓

Compaction Optimization ✓ ✓

Apache Kafka-enabled Streaming ✓ ✓

Virtual Table Interfaces ✓ ✓

New Releases and Maintenance Updates ✓ ✓

Tutorials, Forums, Videos, Documentation, Community Support ✓ ✓

Backup and Restore, Column Access Control ✓

Encryption, Kerberos, LDAP Support ✓

24/7 Support via Web and Phone ✓

Complimentary Account Management Services ✓

Page 36: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Try it at scale immediately on AWS Sandbox

5 Click Sand Box Cluster has full system deployed SSH for CLI URL to Management Consoles Open SQL connection on any

node Customize template

Page 37: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Community

Slack channel - #splicecommunity Video and code tutorials GitHub

Page 38: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 41

Advisory BoardAdvisory Board includes luminaries in databases and technology

Roger BamfordFormer Principal Architect at Oracle

Father of Oracle RAC

Mike FranklinChair,Dept of Computer Science, UChicago

Director, UC Berkeley AMPLabFounder of Apache Spark

Marie-Anne NeimatCo-Founder, Times-Ten Database

Former VP, Database Eng. at Oracle

Ken RudinHead of Growth and Analytics for Google Search

Head of Analytics at Facebook

Abhinav Gupta Co-Founder, Rocket FuelRuns 15PB HBase Cluster

Page 39: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 42

WE ARE HIRING

Page 40: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Seasoned Team

43

Monte Zweben

Co-Founder & Chief Executive

Officer

John LeachCo-Founder &

Chief Technology Officer

St. Louis Hadoop User Group

KrishnanParasuramanVP of Sales and

Business Development

Eran PilovskyChief Financial

Officer

Gene DavisCo-Founder & VP

of Products & Operations

Eric KalabacosVP of Customer

Solutions

Page 41: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential 44

Next Steps

Try Us!splicemachine.com/get-started

GitHub • Tutorials • Sandbox

Page 42: October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and Spark 

Splice Machine Proprietary and Confidential

Powering Real-Time Applications & Analytics

Enabling Decisions in the Moment

May 1, 2023