5 imdb talk martin faust
TRANSCRIPT
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 1/30
In-Memory Data Management
Martin Faust
Research Assistant
Research Group of Prof. Hasso Plattner
Hasso Plattner Institute for Software EngineeringUniversity of Potsdam
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 2/30
Agenda
2
1. Changed Hardware
2. Advances in Data Processing
3. Todays Enterprise Applications
4. The In-Memory Data Management for
Enterprise Applications
5. Impact on Enterprise Applications
6. Influence on Processes
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 3/30
All Areas have to be taken into account
3
Changed
Hardware
Advances in
data processing
(software)
ComplexEnterprise Applications
Our focus
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 4/30
Why a New Data Management?
!
DBMS architecture has not changed over decades
! Redesign needed to handle the changes in:
"
Hardware trends (CPU/cache/memory)
"
Changed workloads" Data characteristics
"
Data amount
!
Some academic prototypes:MonetDB, C-store, HyPer, HYRISE
! Several database vendors picked up
the idea and have new databases in place
(e.g., SAP, Vertica, Greenplum, Oracle)
4
Buffer pool
Query engine
Traditional DBMS Architecture
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 5/30
A
Changes in Hardware…
! give an opportunity to re-think the assumptions of
yesterday because of what is possible today.
!
Main Memory becomes cheaper and larger
! Multi-Core Architecture
(96 cores per server)
! One blade ~$50.000 =
1 Enterprise Class Server
! Parallel scaling across
blades
! 64 bit address space
!
2TB in current servers
! 25GB/s per core
! Cost-performance ratio
rapidly declining
!
Memory hierarchies
5
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 6/30
In the MeantimeResearch as come up with…
! Column-oriented data organization
(the column-store)
"
Sequential scans allow best bandwidth utilizationbetween CPU cores and memory
" Independence of tuples within columns allows easy
partitioning and therefore parallel processing
!
Lightweight Compression"
Reducing data amount, while..
" Increasing processing speed through late materialization
!
And more, e.g., parallel scan/join/aggregation
6 … several advance in software for processing data
+
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 7/30
Memory Hierarchy
7
Memory Page
Nehalem Quadcore
Core 0 Core 1 Core 2 Core 3
L3 Cache
L2
L1
TLB
Main Memory Main Memory
QPI
Nehalem Quadcore
Core 0Core 1 Core 2 Core 3
L3 Cache
L2
L1
TLB
QPI
L1 Cacheline
L2 Cacheline
L3 Cacheline
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 8/30
Two Different Principles of PhysicalData Storage: Row- vs. Column-Store
!
Row-store:" Rows are stored consecutively
" Optimal for row-wise access (e.g. *)
!
Column-store:
" Columns are stored consecutively
"
Optimal for attribute focused access (e.g. SUM, GROUP BY)! Note: concept is independent from storage type
" But only in-memory implementation allows fast tuplereconstruction in case of a column store
8
DocNum
DocDate
Sold-To
ValueStatus
SalesOrg
Row
4
Row3
Row2
Row1
Row-Store Column-store
+
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 9/30
OLTP- and OLAP-style QueriesFavor Different Storage Patterns
Column StoreRow Store
SELECT *
FROM Sales OrdersWHERE Document Number = ‘95779216’
SELECT SUM(Order Value)FROM Sales Orders
WHERE Document Date > 2009-01-20
Row4
Row3
Row2
Row1
Row4
Row3
Row2
Row1
DocNum
DocDate
Sold-To
ValueStatus
SalesOrg
DocNum
DocDate
Sold-To
ValueStatus
SalesOrg
99
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 10/30
Motivationfor Compression in Databases
10 !
Main memory access is the bottleneck
! Idea: Trade CPU time to compress and decompress data
! Lightweight Compression
!
Lossless! Reduces I/O operations to main memory
! Leads to less cache misses due to more information on a
cache line
!
Enables operations directly on compressed data
! Allows to offset by the use of fixed-length data types
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 11/30
Lightweight Dictionary Encoding forCompression and Late Materialization
!
Store distinct values once in separate mapping table (thedictionary)
! Associate unique mapping key (valueID) for each distinct value
! Store valueID instead of value in attribute vector
! Enables offsetting with bit-encoded fixed-length data types
11
Attribute Vector
RecId ValueId
1
4
2
13
2
4 4
5 3
6 3
7 5
8 4
Dictionary
Inverted Index
………
!3INTELAUG
!4SIEMENSJUL
!5IBMJUN
!5IBMMAY
!4INTELAPR
!2HPMAR
!2ABBFEB
!1INTELJANRecId 1
RecId 2
RecId 3
RecId 4
RecId 5
RecId 6
RecId 7
RecId 8
#
ValueId RecIdList
1 2
2
3
3 5,6
4
1,4,8
5
7
ValueId Value
1 ABB
2 HP
3 IBM
4 Intel
5 SIEMENS
Table
Attribute:Company Name
• Typical compression factor forenterprise software 10
• In financial applications up to 50
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 12/30
Data Modificationsin a Compressed Store
12
Merge Process
Insert/Update Select
(union)
!
Differential Store: two separate in-memory partitions! Read-optimized main partition (ROS)
! Write-optimized delta partition (WOS)
! Both represent the current state of the data
! WOS/Delta as an intermediate storage for several
modifications
! Re-compression costs are shared among all recent
modifications (merge process)
WOS ROS
(asynchronously)
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 13/30
Todays Enterprise Applications
! Enterprise applications have evolved:
not just OLTP vs. OLAP
" Demand for real-time analytics on transactional data
" High throughput analytics ! completely in memory
! Examples
" Available-To-Promise Check – Perform real-time ATP check
directly on transactional data during order entry, withoutmaterialized aggregates of available stocks.
"
Dunning – Search for open invoices interactively instead ofscheduled batch runs.
" Operational Analytics – Instant customer sales analytics
with always up-to-date data.
! Data integration as big challenge (e.g. POS data)
13
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 14/30
Enterprise Workloads are Read-Mostly
14 •
It is a myth that OLTP is write-oriented, and OLAP is read-oriented• Real world is more complicated than single tuple access, lots
of range queries
0 %
10 %
20 %
30 %
40 %
50 %60 %
70 %
80 %
90 %
100%
OLTP OLAP
W o r k l o a d
0 %
10 %
20 %
30 %
40 %
50 %60 %
70 %
80 %
90 %
100%
TPC-C
W o r k l o a d
Select
Insert
Modification
Delete
Write:
Read:
Lookup
Table Scan
Range Select
Insert
Modification
Delete
Write:
Read:
14
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 15/30
Enterprise Data is Typically Sparse
15 •
Enterprise data is wide and sparse• Most columns are empty or have a low cardinality of
distinct values• Sparse distribution facilitates high compression
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 - 32 33 - 1023 1024 - 100000000
13%9%
78%
24%
12%
64%
Number of Distinct Values
Inventory ManagementFinancial Accounting
% o
f C o l u m n s
15
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 16/30
TransactionalData Entry
Sources: Machines,Transactional Apps, UserInteraction, etc.
Text Analytics,
Unstructured Data
Sources: web, social, logs,support systems, etc.
Event Processing,
Stream Data
Sources: machines, sensors,high volume systems
Real-time Analytics,Structured Data
Sources: Reporting, ClassicalAnalytics, planning,simulation
Data
Management
CPUs(multi-Core +
Cache + Memory)
Challenge 1 for Enterprises:Dealing with all Sorts of Data
16
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 17/30
… create different application-specific silos withredundant data that reduce real-time behavior &increase complexity.
ETL
Transactional ANALYTICAL
ANALYTICAL CUBES
RDB on Disk(Tuples)
RDB on Disk(Star Schemas)
Column Store In-Memory
(Fully Cached Result Sets)
Accelerator
Text Processing
Blobs and TextColumns In-Memory
Challenge 2 for Enterprises:Current application architectures…
17
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 18/30
Drawbacks of this Separation
!
Historically, OLTP and OLAP system are separated because
of resource contention and hardware limitations.
But, this separation has several disadvantages:
! OLAP system does not have the latest data
! OLAP system does only have predefined subset of the data
! Cost-intensive ETL process has to keep both systems
in synch
!
There is a lot of redundancy
! Different data schemas introduce complexity for applications
combining sources
1818
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 19/30
Approach
!
Change overall data management systemassumption
"
In-Memory only
" Vertically partitioned (column store)
" CPU-cache optimized
"
Only one optimization objective – main memory
access
! Rethink how enterprise application persistence isbuild
" Single data management system
"
No redundant data, no materialized views, cubes
"
Computational application logic closer to the
database
(i.e. complex queries, stored procedures)
19
Backup
Column
ROW TEXT
In-Memory Column + Row
OLTP + OLAP + Text
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 20/30
Intermezzo
20 !
Hardware advances
" More computing power through multi-core CPU’s
" Larger and cheaper main memory
" Algorithms need to be aware of the “memory wall”
!
Software advances
" Column Stores superior for analytic style queries
" Light-weight compression schemes utilize modern hardware
! Enterprise applications
" Need to execute complex queries in real-time
" One single source of truth is needed
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 21/30
How does it all come together?
21 1. Mixed Workload combining OLTP andanalytic-style queries! Column Stores are best suited for analytic-style queries
! In-memory database enables fast tuple re-construction
! In-memory column store allows aggregation on the fly
2. Sparse enterprise data
! Lightweight compression schemes are optimal
! Increases query execution
! Improves feasibility of in-memory database
3. Mostly read workload! Read-optimized stores provide best throughput
! i.e. compressed in-memory column-store
! Write-optimized store as delta partition
to handle data changes is sufficient
ChangedHardware
Advances indata processing
(software)
ComplexEnterprise Applications
Our focus
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 22/30
SanssouciDB: An In-MemoryDatabase for Enterprise Applications
In-Memory Database (IMDB)
! Data resides permanently
in main memory
! Main Memory is the
primary “persistence”! Still: logging to disk /recovery
from disk
! Main memory access is
the new bottleneck
!
Cache-conscious algorithms/
data structures are crucial
(locality is king)
22
!"#$ !&'()*"+ ,-".& !
/(0
1$"234(+35"33#6& 7"+" 89#3+()*:
;($<=(-"+#-&!&'()*
>&?(6&)*/(00#$0@#'&+)"6&-
7"+""0#$0
AB&)* CD&?B+#($ !&+"."+" @E !"$"0&)
F$+&)G"?& 1&)6#?&3 "$. 1&33#($ !"$"0&'&$+
7#3+)#HB+#($ /"*&)"+ ,-".& !
!"#$ 1+()& 7#GG&)&$+#"-
1+()&
E?+#6& 7"+"
! & ) 0 & I
( - B ' $
I ( - B ' $
I ( ' H # $ & .
I ( - B ' $
I ( - B ' $
I ( - B ' $
I ( ' H # $ & .
I ( - B ' $
F$.&D&3
F$6&)+&.
JHK&?+7"+" LB#.&
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 23/30
Impact onApplication Development
TraditionalIn-Memory Column-Store
Applicationcache
Materializedviews
Prebuiltaggregates
Raw data
" Less caches needed
" No redundant objects
"
No maintenance ofmaterialized views or
aggregates
" Minimized index
maintenance" Data movements are
minimized
2323
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 24/30
24
Impact on Enterprise Applications:Financials as of Today
24
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 25/30
25
Impact on Enterprise Applications:Simplified Financials on In-Memory DB
!
Only base tables, algorithms, and some indexes!
Reduces complexity
!
Lowers TCO
! While adding more flexibility, integration, and functionality
25
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 26/30
IMDB Conclusion
26!
In-memory column stores are better suited as database
management system (DBMS) for enterprise applications
than conventional DBMS
" In-memory column stores utilizes modern hardware optimally
"
Several data processing techniques leverage in-memory only
data processing
! Enterprise applications show specific characteristics:
" Sparsely filled data tables
"
Complex read-mostly workload
! Real-world experiences have proven the feasibility
of the in-memory column-store
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 27/30
Challenge the status quo
27!
Don’t ask: Where Do we currently have performance
problems
! Systems are productive with majority of performance issues
addressed through user training, process adaptions,
workarounds …
! People are used to situations and have adapted
! Don’t only think incremental
!
Only making things faster will not create the full potential
! Design new processes around the assumption of
‘available information’
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 28/30
Ask indirectly (1/2)
28!
Where did we define processes around performance
! Where did we train people
!
What did we forbid/disable to do
!
Where do we have workaround processes!
Where do we have exceptions driven processes deriving
from lack of information
! Where do we have the most change requests in reporting
!
Which batch processes run multiple times a day! Where would I love to drill down from reporting into
transactions
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 29/30
Ask indirectly (2/2)
29!
Where do my users work with the pattern data gathering,
intermediate storage in Excel
! Where have you simplified the data model
! Automation of information worker tasks
! Which processes include massive re-work due to data
latency
! Which processes are too slow
! Which processes have not been implemented due to
performance requirements & data quantity
! Which processes require variations of existing aggregates
8/10/2019 5 IMDB Talk Martin Faust
http://slidepdf.com/reader/full/5-imdb-talk-martin-faust 30/30
Thanks!
Questions?
30
Martin FaustHasso Plattner Institute
Martin Faust@hpi uni-potsdam de