vertica & hpe big datahpeanalyticstour.com › wp-content › uploads › 2016 › 04 ›...

Vertica & HPE Big DataStructured Data InsightsThe Vertica Architecture Advantage

#HighPerformanceAnalytics #SeizeTheData

Vertica: Analytics made Actionable

SQL relational database...

– Structured data

– Tables consisting of rows and columns

– Standard: Structured Query Language

– Finding

– Aggregating

– Analyzing

– Joining data from multiple tables

– “Slice & Dice”

– …

– Leverage Tools / Skills

– ODBC, .net, JDBC, Python

– BI, Reporting, ETL (Transformations)

– Ad-Hoc & Discovery

But big and fast!Designed from scratch for analytics

– Tens of trillions of records (thousands per man, woman, and child in the world)

– Terabytes to Petabytes of data

– Hundreds of computers with tens of thousands of CPUs to crunch the data

– Overnight becomes Hourly / Stream

– Batch becomes Interactive

– Impossible becomes x86 Economical

Leading customers across industries finding answers

– Promotional testing

– Claims analyses

– Patient records analyses

– Clinical data analyses

– Fraud monitoring

– Financial tracking

– Tick data back-testing

– Behavior analytics

– Click stream analyses

– Network analyses

– Customer analytics

– Compliance testing

– Loyalty analysis

– Campaign management

ZyngaWinning analytics in a data-driven culture

Challenge

– Provide near real-time analysis on 40-60 billion rows of data ingested per day for 1,000+ employees

Solution

– HPE Vertica Analytics Platform

Result

– Ability to proactively determine what is analyzable, then structure collected data for fast results from HPE Vertica

– Analytics cluster scales 70 times for both Poker and Words With Friends in their fifth year

– 400-600 A/B tests running concurrently with clear metrics

Accelerating health information with an analytics platform

Used by an IT healthcare

provider’s platform to detect

how long it takes certain

application functions to run.

Is the improvement on how long it

took to analyze a single client’s

timers; with HPE Vertica; it now

takes only 20 seconds.

Prior to HPE, Cerner was

collecting 6 billion timers a month.

Now it’s 10 billion.

Greater scale6,000%2,000 timers

General Purpose Data Analytics Framework…which happens to have a SQL interface

Design goals / basic architecture

– SQL, for the ecosystem and knowledge pool

– Clusters of commodity hardware

– Linux, x86, Ethernet

– Software-only solution (for flexibility)

– Special-purpose hardware has poor track record in databases

– Shared-Nothing MPP

– Cheaper, but puts more complexity in the software

– Run large queries many times faster than a legacy DB, load as fast, but feel free to snarl and growl at UPDATEs and DELETEs

– Sorted, compressed column store for cost and speed, no in-place updates

– Smart algorithms, query optimizer, etc.

Architecture & Extensibility

Load

Query

WOS

ROS

OptimizerExecution

EngineStorageAccess

AVRO, JSON, CSV,File, Stream (Kafka, Spark), S3 / Swift,User Defined Parse / Source (Marketplace)

ODBC, .net, JDBC, Native Python

Consolidate “chatty” Writes / Updates

High Efficiency Native Columnar

Cost based, Resource ReservationNode and Column PruningStats on External Tables

Distributed, Appropriate Threads per NodePartition Pruning, Skip within ColumnMulti-Phase Distributed for Network SavingsSQL, Java (Scala), R – User Defined Functions

Fault ToleranceHDFS Access

Start from how data is stored on disk…SELECT SUM(volume) FROM trades WHERE symbol = 'HPQ' AND date = '5/13/2011'

Symbol Date Time Price Volume Etc

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

IMB 05/13/11 01:02:03 PM 171.22 10 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

Sorted dataSort by symbol, date, and time

Symbol Date Time Price Volume Etc

… … … … … …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

… … … … … …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

… … … … … …

IBM 05/13/11 01:02:03 PM 171.22 10 …

… … … … … …

Column filesSplit into columns

Symbol

…

AAPL

AAPL

…

GOOG

GOOG

…

HPQ

HPQ

…

IBM

…

Date

…

05/13/11

05/13/11

…

05/13/11

05/13/11

…

05/13/11

05/13/11

…

05/13/11

…

Time

…

01:02:07 PM

01:02:03 PM

…

01:02:04 PM

01:02:07 PM

…

01:02:02 PM

01:02:05 PM

…

01:02:03 PM

…

Price

…

338.02

338.02

…

524.03

524.02

…

40.01

39.97

…

171.22

…

Volume

…

20

5

…

150

40

…

100

40

…

10

…

Etc

…

…

…

…

…

…

…

…

…

…

…

…

Compression + RLE

Symbol Date Volume

GOOG (x18M)

HPQ (x22M)

IBM (x19M)

…

05/13/2011 (x150K)

…

…

05/13/2011 (x220K)

…

…

05/13/2011 (x150K)

…

…

22

150

40

…

…

99

100

40

…

…

200

10

18

…

(8K distinct) (250/yr)

True column store on disk

Row store

Row store with blocks organized by column

True column storeData transferred from storage (or cached)

Data processed by CPU

Data needed for query

Increasing column selectivity for 4 row selectivities

Clustering/MPP/scale-out

– Parallel design enables distributed storage and workload

– “Active” redundancy

– Automatic replication, failover, and recovery

– Shared-nothing database architecture provides high scalability on clusters of commodity hardware

– Add nodes to achieve optimal capacity and performance

– Lower data center costs, higher density, scale-out

– No specialized nodes

– All nodes are peers

– Query/Load to any node

– Continuous/ real-time load and query

Client network

Private data network (IP)

Node 1

– 2 * 12 core– 256 GB RAM

Node 1

– 2 * 12 core– 256 GB RAM

Node 1

– 2 * 12 core– 256 GB RAM

Nodes are peers

10+ TB 10+ TB 10+ TB

Distributed query execution

– Client connects to a node and issues a query

– Node the client is connected to becomes the initiator node

– Other nodes in the cluster become executor nodes

– Initiator node parses the query and picks an execution plan

– Initiator node distributes query plan to executor nodes

select sum(volume) from fact;

EXECUTOR

INITIATOR

EXECUTOR

Distributed query execution

– All nodes execute the query plan locally

– Nodes exchange data during aggregation and joins

– Executor nodes send partial query results back to initiator node

– Initiator node aggregates results from all nodes

– Initiator node returns final result to the user

EXECUTOR EXECUTOR

select sum(volume) from trades;

3

103

4

1010

INITIATOR

Transactions

– Vertica offers full ACID (just at low TPS)

– Queries take a snapshot of the relevant list of files, and need no locks at READ COMMITTED isolation

– Loads do not conflict with each other

– COMMIT – keep the new files

– ROLLBACK – discard them

– Table level locks for SERIALIZABLE

– Database is essentially its own undo/redo log

– Recovery can be as simple as file copies

*All Operations are on-line

A

B

B

C

C

D

D

A

Changes

Ch

an

ge

s

Simple query processing

–Optimal data storage and physical schema

– True columnar, Sorted, Compressed + Encoded

– Segmented, Cosegmented, and Replicated

– Partitioning with Partition Elimination

– Large I/O reads + writes

–Lock-free queries

–Optimized, Vectorized, JIT compiled code

– Fast data types designed for modern CPUs

–Fast predicate application

– Expression Analysis for sorted/partitioned data

Complex query processing

– Sort, segmentation, and RLE Optimizations for expressions, predicates, aggregation, and joins

– Sophisticated query optimizer designed for columnar query execution

– Subqueries flattened into joins

– Segment data around cluster nodes and CPUs for parallelism

– Two-pass algorithms that are skew-tolerant and reduce reliance on optimizer decisions

– Passes of and joining are interleaved by the planner/executor, so the most effective strategy is chosen at run time

– Special join implementations for “late materialization,” range lookups, and event series

– Detection and optimization of “Top K” queries

Automatic database physical designVertica Database Designer (DBD)

Schema

Data

Queries/DML

Segmentation

Sort Order

Compression

DBD (Magic)

Workload management

– Don't want reports to take over the entire system, preventing loads or tactical queries

– Keep some resources (e.g. memory) reserved so that high-priority queries can always begin

– Apply run-time prioritization to manage CPU and I/O

System Loader Web refresh General

Short query bias

Independent: A=60s, B=1s

Sequential

“Linear” Interleave

Short Query Bias

Dynamic prioritization

Q: Are optimizer cost model estimates really that bad?

A: Doesn’t matter!

0 50 100 150 200

0

20

40

60

80

100

120

TIME (S)

CU

MU

LA

TIV

E C

OM

PL

ET

ION

(%

)

Unprioritized Dynamic Priority

Analytics platform extensions

– Event Series Extensions

– Sessionization

– Pattern Matching

– Gap Filling and Interpolation

– Event Series Joins

– User-Defined Extensions

– Load source, stream filtering, and parsing

– Scalar functions, aggregates, transforms

– A growing variety of languages to choose from

– Packs/examples for

– Geospatial

– Sentiment

– Data Mining, Logistic Regression, etc.

– Data Variety: Flex Tables, files, integration

– Analytics Packs

When not to use Vertica

Vertica is NOT an OLTP system

– Single/few record retrievals are, in theory and in practice, way worse in column stores

– While Vertica is ACID compliant, transaction throughput is in the 10s-100s of TPS

– INSERTs must be batched, or use the COPY command

– UPDATEs, and DELETEs are run serially within a table

– Referential integrity constraints are not enforced

– Instead, use Vertica in conjunction…

– Keep a log of what happened in the OLTP DBMS system, or NoSQL “eventually consistent” system

Vertica is not for huge numbers of small queries

– Data sets much less than a terabyte may not warrant an analytic database

– Use an in-memory database/tool (Membase, Memcached, etc.) with Vertica to handle large numbers of tiny point queries

Keep the environment simple

– Linux x86 64-bit only

– While they “should work,” use of shared storage, filers, etc., will add cost, add potential bottlenecks, and perplex our support department if anything goes wrong

– As it is a bit silly to break machines up into VMs, only to stitch them back together with an MPP database; virtualization is not recommended

– Reasonable network performance is essential

– Loads and some queries may use all-to-all bandwidth

– Do not attempt to span WANs

Thank you

Month day, year #SeizeTheData

[email protected]

#SeizeTheData

vertica & hpe big datahpeanalyticstour.com › wp-content › uploads › 2016 › 04 ›...

Documents