october 2016 hug: architecture of an open source rdbms powered by hbase and spark
TRANSCRIPT
Splice Machine Proprietary and Confidential
Open Source RDBMS For Mixed Operational and Analytical Workloads
Monte Zweben
May 1, 2023
Splice Machine Proprietary and Confidential
Who We Are
The Open Source RDBMS Powered By Hadoop & Spark
2
ANSI SQLNo retraining or rewrites for SQL-based
analysts, reports, and applications
¼ the Cost Scales out on
commodity hardware
SQL Scale Out Speed
TransactionsEnsure reliable updates
across multiple rows
Mixed WorkloadsSimultaneously support
OLTP and OLAP workloads
ElasticIncrease scale in
just a few minutes
10x FasterLeverages Spark
in-memory technology
Splice Machine Proprietary and Confidential
Life Sciences
Digital Marketing Financial Services
DECISIONS IN THE MOMENTSupply Chain Optimization
Splice Machine Proprietary and Confidential 4
Today’s Reality: Stale Data, Backward-Looking Decisions
How old is the data in your reports?
1 day +
1 day
4 hours +
1 hour +
Real-time
Splice Machine Proprietary and Confidential 5
Today’s Reality: Stale Data, Backward-Looking Decisions
24%
50%
7%
9%
9%
* Source: Webinars on 11-3-15 and 12-10-15, 237 respondents
How old is the data in your reports?
1 day +
1 day
4 hours +
1 hour +
Real-time
Splice Machine Proprietary and Confidential
Legacy ETL Architectures Unable to Keep Up
Ad Hoc Analytics
Executive Business Reports
Operational Reports
ERP
CRM
Supply Chain
HR
…
Data Warehouse
Datamart
Stream or Batch Updates
Mixed Workload Apps
ODS
ETL
OLTP Systems
Extract
Transform
Load
OLAP Systems Pain
Separate OLTP & OLAP systems
Messy ETL “glue”
Why? Different workloads Different data structures Hard to isolate workloads
No longer adequate Can’t afford to wait days or
hours to analyze data
6
Splice Machine Proprietary and Confidential
Recent Approach: Lambda ArchitectureComplex to setup and maintain
7
Speed Layer
Batch Layer
Serving Layer
Developer Integrates Specialized Compute Engines
Splice Machine Proprietary and Confidential
New Approach: Lambda-In-A-Box ArchitectureEasy to use with SQL
8
Speed LayerBatch Layer
SQL Optimizer Selects Pre-Integrated Compute Engines
Serving Layer
Splice Machine Proprietary and Confidential 9
Simultaneous OLTP & OLAP WorkloadsUnique Dual-Engine Architecture isolates workloads
Traditional RDBMSs Splice MachineHBASE
EngineSPARKEngine
BOTTLENECKS, DELAYS
O L A P
WORKLOAD ISOLATION
O L T P
K E Y
Splice Machine Proprietary and Confidential 10
Simultaneous OLTP & OLAP WorkloadsUnique Dual-Engine Architecture isolates workloads
Traditional RDBMSs Splice Machine
As OLAP load rises, OLTP response times increase
OLAP LOAD
OLT
P RE
SPO
NSE
TIM
E
As OLAP load rises, OLTP response times remain flat
OLAP LOAD
OLT
P RE
SPO
NSE
TIM
E
Splice Machine Proprietary and Confidential
Power Old and New Applications
Splice Machine Proprietary and Confidential
Proven Building Blocks: Spark, Hadoop and Derby
Apache Derby ANSI SQL-99 RDBMS Java-based ODBC/JDBC Compliant
Apache HBase/Hadoop Auto-sharding High availability Scalability to 100s of PBs
Apache Spark Analytical engine Fast, in-memory technology Memory resilient to node failure
12
Splice Machine Proprietary and Confidential
HBase: Proven Scale-Out
Auto-sharding Scales with commodity hardware Cost-effective from GBs to PBs
High availability thru failover and replication LSM-trees
13
Splice Machine Proprietary and Confidential 14
Apache
Unmatched Performance Fastest sort of 1PB of data
Advanced In-Memory Technology Spill-to-disk for large datasets Resilient against node failures Pipelining for computation parallelism
Most Active Apache Community Almost 1000 contributors
Extensive Libraries Over 140 and growing Libraries for machine learning,
streaming and graph processing
Splice Machine Proprietary and Confidential 15
Splice Machine: Advanced Spark Integration
Innovative, High-Performance RDD Creation Fast access to HFiles in HDFS Merged with deltas from Memstore Avoids slower HBase API
Universal Execution Plan and Byte Code Optimizer, plan and code shared across
Spark or HBase execution
•••
HBase Region Server
HDFS
•••Region 1
Memstore
Spark Worker
•••RDD 1
HFile HFile•••
P H Y S I C A L N O D E
RDD N
HFile••• HFile•••
Region N
Memstore
HBase Region Server
HDFS
•••Region 1
Memstore
Spark Worker
•••RDD 1
HFile HFile•••
P H Y S I C A L N O D E
RDD N
HFile••• HFile•••
Region N
Memstore
Splice Machine Proprietary and Confidential
Splice Machine Architecture
1. Standard install of HBase Cluster (HBase, HDFS, ZooKeeper) with Spark
HBase Co-Processor
L E G E N D
2. Distribute Splice Machine JAR to each region server
3. Automatically invoke co-processors on each region
16
Cache
•••Task
Executor
Task
HBase Region Server
•••
HDFS
SPLICE PARSER
SPLICE PLANNER
SPLICE OPTIMIZER
SPLICE EXECUTOR • Snapshot Isolation• Indexes
Region Region
SPLICE EXECUTOR • Snapshot Isolation• Indexes
Spark WorkerRDD
Spark Master
RDD
Cache
•••Task
Executor
Task
•••
•••
•••
Cache
•••Task
Executor
Task
HBase Region Server
HDFS
SPLICE PARSER
SPLICE PLANNER
SPLICE OPTIMIZER
SPLICE EXECUTOR • Snapshot Isolation• Indexes
Region Region
SPLICE EXECUTOR • Snapshot Isolation• Indexes
Spark WorkerRDDRDD
Cache
•••Task
Executor
Task
•••
•••
•••
HMasterZookeeper
Splice Machine Proprietary and Confidential 17
Splice Machine: Query Execution
Splice Machine Proprietary and Confidential 18
Splice Machine: Query Execution1. Parse SQL
• Generate Abstract Syntax Tree (AST)
• Bind AST to Transactional Dictionary
Splice Machine Proprietary and Confidential 19
Splice Machine: Query Execution1. Parse SQL2. Optimize query plan
• Determine access plan (e.g., base table, index), join order and join algorithm using cost-based statistics (e.g., cardinality estimates)
• Unroll nested subqueries
Splice Machine Proprietary and Confidential 20
Splice Machine: Query Execution
3. Generate optimal byte code
1. Parse SQL2. Optimize query plan
Splice Machine Proprietary and Confidential 21
Splice Machine: Query Execution
OLTP Execution on HBase4a. Execute OLTP query from byte
code5a. Use block cache and bloom
filters to optimize data access6a. Return results
3. Generate optimal byte code
1. Parse SQL2. Optimize query plan
Splice Machine Proprietary and Confidential 22
Splice Machine: Query Execution
OLAP Execution on Spark4b. Generate Spark execution plan
OLTP Execution on HBase4a. Execute OLTP query from byte
code5a. Use block cache and bloom
filters to optimize data access6a. Return results
3. Generate optimal byte code
1. Parse SQL2. Optimize query plan
OLAP Execution on Spark4b. Generate Spark execution plan5b. Submit Spark plan with byte code6b. Fair scheduling of distributed of tasks7b. Generate RDD from HFiles and Memstore 8b. Execute query and return results
Splice Machine Proprietary and Confidential 23
Isolated Resource ManagementIsolate Spark & HBase resources through Linux Cgroups
Splice Machine Proprietary and Confidential 24
Isolated Resource ManagementIsolate Spark & HBase resources through Linux Cgroups
Splice Machine Proprietary and Confidential 25
Configurable Spark Resource ManagementPrioritize Spark resources between Query, Admin & Import jobs
Custom resource pools through XML
Splice Machine Proprietary and Confidential 26
Spark Query ManagementVisualization of active and completed queries
Splice Machine Proprietary and Confidential 27
Spark Query Management (cont’d)Visualization of stages for each query, plus kill function
Splice Machine Proprietary and Confidential 28
Spark Query Management (cont’d)Visualization of stages for query plan, plus kill function
Splice Machine Proprietary and Confidential 29
Spark Query Management (cont’d)Detailed metrics for tasks in each stage
Splice Machine Proprietary and Confidential 30
Spark Query Management (cont’d)
Splice Machine Proprietary and Confidential 31
Working With External Data and Compute Engines
Virtual Table Interface (VTI) Execute federated queries against external
files, libraries or databases External Databases
Use JDBC to access data in DBs such as Oracle and DB2
External Libraries Access over 140 Spark libraries for machine
learning and streaming External Files
Pre-defined or dynamic schema Access local FS, HDFS, AWS S3 Sample query:
MapReduce I/O Formats Accept federated queries from
MapReduce, Pig, and Hive Register Splice Machine schema in
HCATALOG Merge structured (Splice) and
unstructured data in ad-hoc query Seamless integration to Hadoop
ecosystem
Splice Machine Proprietary and Confidential 32
ANSI SQL-99+ Coverage Data types – e.g., INTEGER, REAL,
CHARACTER, DATE, BOOLEAN, BIGINT DDL – e.g., CREATE TABLE, CREATE SCHEMA,
ALTER TABLE, DELETE, UPDATE TABLE Predicates – e.g., IN, BETWEEN, LIKE, EXISTS DML – e.g., INSERT, DELETE, UPDATE, SELECT Query specification – e.g., GROUP BY,
HAVING SET functions – e.g., UNION, ABS, MOD, ALL,
INTERSECT, EXCEPT Aggregation functions – e.g., AVG, MAX,
COUNT String functions – e.g., SUBSTRING,
concatenation, UPPER, LOWER, TRIM, LENGTH
Constraints – e.g., PRIMARY KEY, CHECK, FOREIGN KEY, UNIQUE, NOT NULL
Conditional functions – e.g., CASE, searched CASE
Privileges – e.g., privileges for SELECT, DELETE, INSERT, EXECUTE
Joins – e.g., INNER JOIN, LEFT OUTER JOIN Transactions – e.g., COMMIT, ROLLBACK,
Snapshot Isolation Sub-queries Triggers User-defined functions (UDFs) Views – including grouped views Window Functions – e.g., FIRST_VALUE,
LAST_VALUE, LEAD, LAG
Splice Machine Proprietary and Confidential 33
High Concurrency, ACID transactionsRequired to support OLTP applications
share_quantity share_priceTIMESTAMP VALUE TIMESTAMP VALUE
T12 4,000 “Virtual” Snapshot T7 $15.11
T7 2,000 T5 $15.65
T3 5,000 Transaction @T6 T2 $15.74
T1 3,000 T0 $15.27
T3 5,000 Transaction @T6 T2 $15.74
T5 $15.65
value_held = share_quality* share_price
@T6: value_held = 5,000 * $15.65@T3: value_held = 5,000 * $15.74
State-of-the-art, distributed snapshot isolation Form of Multi-Version
Concurrency Control (MVCC) Writers do not block readers Fast, high concurrency Delivers performance for small
reads/writes & batch loads Extends research from Google
Percolator & Yahoo Labs Patent pending technology
Splice Machine Proprietary and Confidential
BI and SQL tool support via ODBC/JDBC
34
No application rewrites needed
Splice Machine Proprietary and Confidential
Open Source Features Community
EditionEnterprise
Edition
Scale-out Architecture, ANSI SQL & Concurrent ACID Transactions ✓ ✓
OLAP and OLTP Resource Isolation ✓ ✓
Distributed In-Memory Joins, Aggregations, Scans and Groupings ✓ ✓
Cost-Based Statistics, Query Optimizer, Management Console ✓ ✓
Compaction Optimization ✓ ✓
Apache Kafka-enabled Streaming ✓ ✓
Virtual Table Interfaces ✓ ✓
New Releases and Maintenance Updates ✓ ✓
Tutorials, Forums, Videos, Documentation, Community Support ✓ ✓
Backup and Restore, Column Access Control ✓
Encryption, Kerberos, LDAP Support ✓
24/7 Support via Web and Phone ✓
Complimentary Account Management Services ✓
Splice Machine Proprietary and Confidential
Try it at scale immediately on AWS Sandbox
5 Click Sand Box Cluster has full system deployed SSH for CLI URL to Management Consoles Open SQL connection on any
node Customize template
Splice Machine Proprietary and Confidential
Community
Slack channel - #splicecommunity Video and code tutorials GitHub
Splice Machine Proprietary and Confidential 41
Advisory BoardAdvisory Board includes luminaries in databases and technology
Roger BamfordFormer Principal Architect at Oracle
Father of Oracle RAC
Mike FranklinChair,Dept of Computer Science, UChicago
Director, UC Berkeley AMPLabFounder of Apache Spark
Marie-Anne NeimatCo-Founder, Times-Ten Database
Former VP, Database Eng. at Oracle
Ken RudinHead of Growth and Analytics for Google Search
Head of Analytics at Facebook
Abhinav Gupta Co-Founder, Rocket FuelRuns 15PB HBase Cluster
Splice Machine Proprietary and Confidential 42
WE ARE HIRING
Splice Machine Proprietary and Confidential
Seasoned Team
43
Monte Zweben
Co-Founder & Chief Executive
Officer
John LeachCo-Founder &
Chief Technology Officer
St. Louis Hadoop User Group
KrishnanParasuramanVP of Sales and
Business Development
Eran PilovskyChief Financial
Officer
Gene DavisCo-Founder & VP
of Products & Operations
Eric KalabacosVP of Customer
Solutions
Splice Machine Proprietary and Confidential 44
Next Steps
Try Us!splicemachine.com/get-started
GitHub • Tutorials • Sandbox
Splice Machine Proprietary and Confidential
Powering Real-Time Applications & Analytics
Enabling Decisions in the Moment
May 1, 2023