5 imdb talk martin faust

30
8/10/2019 5 IMDB Talk Martin Faust http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 1/30 In-Memory Data Management Martin Faust Research Assistant Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University of Potsdam

Upload: vijay-baalu

Post on 02-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 1/30

In-Memory Data Management 

Martin Faust

Research Assistant

Research Group of Prof. Hasso Plattner

Hasso Plattner Institute for Software EngineeringUniversity of Potsdam

Page 2: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 2/30

Agenda

2

1.  Changed Hardware

2.  Advances in Data Processing

3.  Todays Enterprise Applications

4.  The In-Memory Data Management for

Enterprise Applications

5.  Impact on Enterprise Applications

6.  Influence on Processes 

Page 3: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 3/30

All Areas have to be taken into account

3

Changed

Hardware

Advances in

data processing

(software)

ComplexEnterprise Applications

Our focus

Page 4: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 4/30

Why a New Data Management?

DBMS architecture has not changed over decades

!  Redesign needed to handle the changes in:

Hardware trends (CPU/cache/memory)

Changed workloads"  Data characteristics

Data amount 

Some academic prototypes:MonetDB, C-store, HyPer, HYRISE

!  Several database vendors picked up

the idea and have new databases in place

(e.g., SAP, Vertica, Greenplum, Oracle)

4

Buffer pool

Query engine

Traditional DBMS Architecture

Page 5: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 5/30

A

Changes in Hardware…

! give an opportunity to re-think the assumptions of

yesterday because of what is possible today.

Main Memory becomes cheaper and larger

!  Multi-Core Architecture

(96 cores per server)

!  One blade ~$50.000 =

1 Enterprise Class Server

!  Parallel scaling across

blades

!  64 bit address space

2TB in current servers

!  25GB/s per core

!  Cost-performance ratio

rapidly declining

Memory hierarchies 

5

Page 6: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 6/30

In the MeantimeResearch as come up with…

!  Column-oriented data organization

(the column-store) 

Sequential scans allow best bandwidth utilizationbetween CPU cores and memory

"  Independence of tuples within columns allows easy

partitioning and therefore parallel processing

Lightweight Compression" 

Reducing data amount, while..

"  Increasing processing speed through late materialization

And more, e.g., parallel scan/join/aggregation

6 … several advance in software for processing data

Page 7: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 7/30

Memory Hierarchy

7

Memory Page

Nehalem Quadcore

Core 0 Core 1 Core 2 Core 3

L3 Cache

L2

L1

TLB

Main Memory Main Memory

QPI

Nehalem Quadcore

Core 0Core 1 Core 2 Core 3

L3 Cache

L2

L1

TLB

QPI

L1 Cacheline

L2 Cacheline

L3 Cacheline

Page 8: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 8/30

Two Different Principles of PhysicalData Storage: Row- vs. Column-Store

Row-store:"  Rows are stored consecutively

"  Optimal for row-wise access (e.g. *)

Column-store:

"  Columns are stored consecutively

Optimal for attribute focused access (e.g. SUM, GROUP BY)!  Note: concept is independent from storage type

"  But only in-memory implementation allows fast tuplereconstruction in case of a column store

8

DocNum

DocDate

Sold-To

ValueStatus

SalesOrg

Row

4

Row3

Row2

Row1

Row-Store Column-store

Page 9: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 9/30

OLTP- and OLAP-style QueriesFavor Different Storage Patterns

Column StoreRow Store

SELECT *

FROM Sales OrdersWHERE Document Number = ‘95779216’

SELECT SUM(Order Value)FROM Sales Orders

WHERE Document Date > 2009-01-20

Row4

Row3

Row2

Row1

Row4

Row3

Row2

Row1

DocNum

DocDate

Sold-To

ValueStatus

SalesOrg

DocNum

DocDate

Sold-To

ValueStatus

SalesOrg

99

Page 10: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 10/30

Motivationfor Compression in Databases

10 ! 

Main memory access is the bottleneck

!  Idea: Trade CPU time to compress and decompress data

!  Lightweight Compression

Lossless!  Reduces I/O operations to main memory

!  Leads to less cache misses due to more information on a

cache line

Enables operations directly on compressed data

!  Allows to offset by the use of fixed-length data types 

Page 11: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 11/30

Lightweight Dictionary Encoding forCompression and Late Materialization

Store distinct values once in separate mapping table (thedictionary)

!  Associate unique mapping key (valueID) for each distinct value

!  Store valueID instead of value in attribute vector

!  Enables offsetting with bit-encoded fixed-length data types

11

Attribute Vector

RecId  ValueId

4

13

 

2

4  4

5 3

6 3

7 5

8 4

Dictionary

Inverted Index

………

 !3INTELAUG

 !4SIEMENSJUL

 !5IBMJUN

 !5IBMMAY

 !4INTELAPR

 !2HPMAR

 !2ABBFEB

 !1INTELJANRecId 1

RecId 2

RecId 3

RecId 4

RecId 5

RecId 6

RecId 7

RecId 8

ValueId  RecIdList

1  2

3

3  5,6

1,4,8

7

ValueId   Value

1 ABB

2 HP

3 IBM

4 Intel

5 SIEMENS

Table

 Attribute:Company Name

•  Typical compression factor forenterprise software 10

•  In financial applications up to 50

Page 12: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 12/30

Data Modificationsin a Compressed Store

12

Merge Process

Insert/Update Select

(union) 

Differential Store: two separate in-memory partitions!  Read-optimized main partition (ROS)

!  Write-optimized delta partition (WOS)

!  Both represent the current state of the data

!  WOS/Delta as an intermediate storage for several

modifications

!  Re-compression costs are shared among all recent

modifications (merge process) 

 WOS ROS

(asynchronously)

Page 13: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 13/30

Todays Enterprise Applications

!  Enterprise applications have evolved:

not just OLTP vs. OLAP

"  Demand for real-time analytics on transactional data

"  High throughput analytics ! completely in memory 

!  Examples

"  Available-To-Promise Check – Perform real-time ATP check

directly on transactional data during order entry, withoutmaterialized aggregates of available stocks.

Dunning – Search for open invoices interactively instead ofscheduled batch runs.

"  Operational Analytics – Instant customer sales analytics

with always up-to-date data.

!  Data integration as big challenge (e.g. POS data)

13

Page 14: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 14/30

 Enterprise Workloads are Read-Mostly

14 • 

It is a myth that OLTP is write-oriented, and OLAP is read-oriented•  Real world is more complicated than single tuple access, lots

of range queries

0 %

10 %

20 %

30 %

40 %

50 %60 %

70 %

80 %

90 %

100%

OLTP OLAP

    W   o   r    k    l   o   a    d

0 %

10 %

20 %

30 %

40 %

50 %60 %

70 %

80 %

90 %

100%

TPC-C

    W   o   r    k    l   o   a    d

Select

Insert

Modification

Delete

Write:

Read:

Lookup

Table Scan

Range Select

Insert

Modification

Delete

Write:

Read:

14

Page 15: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 15/30

Enterprise Data is Typically Sparse

15 • 

Enterprise data is wide and sparse•  Most columns are empty or have a low cardinality of

distinct values•  Sparse distribution facilitates high compression

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 - 32 33 - 1023 1024 - 100000000

13%9%

78%

24%

12%

64%

Number of Distinct Values

Inventory ManagementFinancial Accounting

   %   o

   f   C  o   l  u  m  n  s

15

Page 16: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 16/30

TransactionalData Entry

Sources: Machines,Transactional Apps, UserInteraction, etc.

Text Analytics,

Unstructured Data

Sources: web, social, logs,support systems, etc.

Event Processing,

Stream Data

Sources: machines, sensors,high volume systems

Real-time Analytics,Structured Data

Sources: Reporting, ClassicalAnalytics, planning,simulation

Data

Management

CPUs(multi-Core +

Cache + Memory)

Challenge 1 for Enterprises:Dealing with all Sorts of Data

16

Page 17: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 17/30

… create different application-specific silos withredundant data that reduce real-time behavior &increase complexity.

ETL

Transactional ANALYTICAL

ANALYTICAL CUBES

RDB on Disk(Tuples)

RDB on Disk(Star Schemas)

Column Store In-Memory

(Fully Cached Result Sets)

Accelerator

Text Processing

Blobs and TextColumns In-Memory

Challenge 2 for Enterprises:Current application architectures…

17

Page 18: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 18/30

Drawbacks of this Separation

Historically, OLTP and OLAP system are separated because

of resource contention and hardware limitations.

But, this separation has several disadvantages:

!  OLAP system does not have the latest data

!  OLAP system does only have predefined subset of the data

!  Cost-intensive ETL process has to keep both systems

in synch

There is a lot of redundancy

!  Different data schemas introduce complexity for applications

combining sources

1818

Page 19: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 19/30

Approach

Change overall data management systemassumption

In-Memory only

"  Vertically partitioned (column store)

"  CPU-cache optimized

Only one optimization objective – main memory

access

!  Rethink how enterprise application persistence isbuild

"  Single data management system

No redundant data, no materialized views, cubes

Computational application logic closer to the

database

(i.e. complex queries, stored procedures)

19

Backup

Column

ROW TEXT

In-Memory Column + Row

OLTP + OLAP + Text

Page 20: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 20/30

Intermezzo

20 ! 

Hardware advances

"  More computing power through multi-core CPU’s

"  Larger and cheaper main memory

"  Algorithms need to be aware of the “memory wall”

Software advances

"  Column Stores superior for analytic style queries

"  Light-weight compression schemes utilize modern hardware

!  Enterprise applications

"  Need to execute complex queries in real-time

"  One single source of truth is needed

Page 21: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 21/30

How does it all come together?

21 1. Mixed Workload combining OLTP andanalytic-style queries!  Column Stores are best suited for analytic-style queries

!  In-memory database enables fast tuple re-construction

!  In-memory column store allows aggregation on the fly

2. Sparse enterprise data

!  Lightweight compression schemes are optimal

!  Increases query execution

!  Improves feasibility of in-memory database

3. Mostly read workload!  Read-optimized stores provide best throughput

!  i.e. compressed in-memory column-store

!  Write-optimized store as delta partition

to handle data changes is sufficient

ChangedHardware

Advances indata processing

(software)

ComplexEnterprise Applications

Our focus

Page 22: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 22/30

SanssouciDB: An In-MemoryDatabase for Enterprise Applications

In-Memory Database (IMDB)

!  Data resides permanently

in main memory

!  Main Memory is the

primary “persistence”!  Still: logging to disk /recovery

from disk

!  Main memory access is

the new bottleneck

Cache-conscious algorithms/

data structures are crucial

(locality is king) 

22

!"#$ !&'()*"+ ,-".& ! 

/(0

1$"234(+35"33#6& 7"+" 89#3+()*:

;($<=(-"+#-&!&'()*

>&?(6&)*/(00#$0@#'&+)"6&-

7"+""0#$0

AB&)* CD&?B+#($   !&+"."+"   @E !"$"0&) 

F$+&)G"?& 1&)6#?&3 "$. 1&33#($ !"$"0&'&$+

7#3+)#HB+#($ /"*&)"+ ,-".& ! 

!"#$ 1+()&   7#GG&)&$+#"-

1+()&

 E?+#6& 7"+"

     !    &    )    0    &     I

    (     -    B    '    $

     I    (     -    B    '    $

     I    (    '     H     #    $    &     .

     I    (     -    B    '    $

     I    (     -    B    '    $

     I    (     -    B    '    $

     I    (    '     H     #    $    &     .

     I    (     -    B    '    $

F$.&D&3

F$6&)+&.

JHK&?+7"+" LB#.&

Page 23: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 23/30

Impact onApplication Development

TraditionalIn-Memory Column-Store

 Applicationcache

Materializedviews

Prebuiltaggregates

Raw data

"  Less caches needed

"  No redundant objects

No maintenance ofmaterialized views or

aggregates

"  Minimized index

maintenance"  Data movements are

minimized

2323

Page 24: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 24/30

24

Impact on Enterprise Applications:Financials as of Today

24

Page 25: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 25/30

25

Impact on Enterprise Applications:Simplified Financials on In-Memory DB

Only base tables, algorithms, and some indexes!

  Reduces complexity

Lowers TCO

!  While adding more flexibility, integration, and functionality

25

Page 26: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 26/30

IMDB Conclusion

26!

  In-memory column stores are better suited as database

management system (DBMS) for enterprise applications

than conventional DBMS

"  In-memory column stores utilizes modern hardware optimally

Several data processing techniques leverage in-memory only

data processing

!  Enterprise applications show specific characteristics:

"  Sparsely filled data tables

Complex read-mostly workload

!  Real-world experiences have proven the feasibility

of the in-memory column-store

Page 27: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 27/30

Challenge the status quo

27!

  Don’t ask: Where Do we currently have performance

problems

!  Systems are productive with majority of performance issues

addressed through user training, process adaptions,

workarounds …

!  People are used to situations and have adapted

!  Don’t only think incremental

Only making things faster will not create the full potential

!  Design new processes around the assumption of

‘available information’

Page 28: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 28/30

Ask indirectly (1/2)

28!

  Where did we define processes around performance

!  Where did we train people

What did we forbid/disable to do

Where do we have workaround processes!

 

Where do we have exceptions driven processes deriving

from lack of information

!  Where do we have the most change requests in reporting

Which batch processes run multiple times a day!  Where would I love to drill down from reporting into

transactions

Page 29: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 29/30

Ask indirectly (2/2)

29!

  Where do my users work with the pattern data gathering,

intermediate storage in Excel

!  Where have you simplified the data model

!  Automation of information worker tasks

!  Which processes include massive re-work due to data

latency

!  Which processes are too slow

!  Which processes have not been implemented due to

performance requirements & data quantity

!  Which processes require variations of existing aggregates

Page 30: 5 IMDB Talk Martin Faust

8/10/2019 5 IMDB Talk Martin Faust

http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 30/30

Thanks!

Questions?

30

Martin FaustHasso Plattner Institute 

Martin Faust@hpi uni-potsdam de