big data 2.0 - milwaukee big data user group presentation

Big Data 2.0

Milwaukee Big Data Users Group

12.1.2014

DBMS Technology OverviewGoal

• Provide a technology recommendation for serving reporting needs for the next 3 – 5 years

• Explore different technologies for their suitability for a strategic reporting data platform

DBMS Technology OverviewVendor agnostic approach

• Vendor agnostic DBMS technologies evaluated– Categories

• RDBMS– Row based vs column based

• In-memory data base– Row based vs column based

• NoSQL– Document based (Disk)– Key value based (IMDG)– Graph based (IMDG)

– Criteria• Overall design• Pros/cons

DBMS Technology OverviewRepresentative vendor evaluation criteria

• …followed by quick evaluation of two vendors representing each technology– Thought leadership

– Market share / # of production customers

– Capacity / scalability

– Functionality

– Expertise availability

– Resilience

– Cost (license, infrastructure & expertise)

– Interface compatibility (drop-in-ability)

DBMS Technology OverviewOpen Discussion on drop-in-ability

• Re-tooling interfaces is expensive

– Focus is on query/reporting tools (in my evaluation)

– List of possible solutions drastically reduced by this criterion

– SQL compatibility (very important syntactic sugar)

– ACID compliance (dual use technology for OLTP needs)

• A cost-effective, performant, resilient solution that requires interface re-tooling is DOA for my client’s environment

Striking phrases

• Disk is the new tape, memory is the new disk

• IMDG’s are increasingly being referred to as Big Data 2.0

RDBMSrow based

• OLAP needs typically serviced by partitioning (row & column)

• 30 years old (proven technology)• IMDB implementations typically have same pros/cons,

although cost and performance characteristics are different

• Pros– Great OLTP performance– Efficient at whole-row operations

• Cons– Inefficient at data set operations– Scalability is typically not linear

Row-based

Data Cols Time Location Product Vendor

Block 1

Block 2

Block3

2/23 0900 IL023 Gown112 ML

2/23 0423 OH12 Mask221 123

2/24 1543 CN881 Swab993 456

RDBMScolumn based

• Optimized for OLAP needs as it’s optimized to answer questions on data characteristics

• Great performance on aggregate functions (avg, count, sum, min, max)• IMDB implementations typically have same pros/cons, although cost and

performance characteristics are different

• Pros– Aggregate functions are very fast as entire column can be fetched quickly– Efficient at data set operations– Easily compressed, especially for data that is sparsely populated

• Cons– Inefficient at retrieving many columns of a single row– Row functions are slower

Column-based

Block 0 Time 2/23 0900 2/23 0423 2/24 1543

Block 1

Block 2

Block3

Location IL023 OH12 CN881

Product Gown112 Mask221 Swab993

Vendor ML 123 456

NoSQLdocument/XML based

• Focus is typically on sharding strategy as opposed to up-front data modeling (models typically evolve greatly during construction)

• Similar to key-value stores, where values are stored in a standardized structure (although document stores keep metadata as well)

• An example of data in a document database:– {officeName:”3Pillar Noida”,

{Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”}}

– {officeName:”3Pillar Timisoara”,{Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode: 300011”}}

– {officeName:”3Pillar Cluj”,{Latitude:”40.748328”, Longitude:”-73.985560”}}

• Pros– Not limited to querying by keys (can query inside documents using JSON/XML query

mechanisms)– Maps well to semi-structured or variable structured data

• Cons– Sharding strategy can be challenging– Doesn’t support relations (no RI), as opposed to key value or graph stores

NoSQLIMDG (common to key-value / graph)

• IMDG’s referred to as “Big Data 2.0”• Host data in memory and distribute across cluster of commodity

servers• Employ an object-oriented data model that provides read/write

times << 1 ms• As data is stored in virtual memory pool, parallel data computations

are easily performed • As in document databases, focus is on sharding strategy as opposed

to up-front physical data modeling• Majority of implementations utilize JVM’s (although a handful of

.Net are out there)• GC, specifically the unpredictability of GC, is a major concern

– Vendors utilize off-heap storage to alleviate this by moving LRU data to off-heap JVM’s, relying on high-speed messaging for data transport

NoSQLIMDG (key-value)

• Typically stored as a set of distributable maps• Pros

– Data distribution is designed from the ground up– Keys and values are Java (or .Net) objects– No bias between OLTP and OLAP

• Cons– Alternate lookup mechanisms require a map with an

alternate key (although main data payload can be shared as values are objects that support multiple pointers)

– Expertise is typically harder to find (on characteristics of memory structure behavior at larger sizes)

NoSQLIMDG (graph)

• Allow a set of nodes (object instances) with dynamic properties (cols/attributes) to be arbitrarily linked to other nodes through edges (associations)

• Each node only knows its adjacent nodes• As the number of nodes increases, cost of a local hop

remains constant• Whereas a RDBMS is optimized for aggregation, a

graph database is optimized for connections• Fastest growth area in NoSQL in the last year – 250%

NoSQLIMDG (graph cont’d)

• 60% of Facebook graph is hosted on one instance of Neo4J

• Pros– Powerful general purpose (reusable) data model– Connected data locally indexed– Easy to query– Optimized for recursive structures (think BoM)– Great at use cases with complex relationships (supply

chain management)

• Cons– Sharding strategy is difficult– Requires re-wiring your brain (think object model

instead of data model)

Particular vendor evaluations

<<Vendor evaluation.xls>>

Recap

• In addition to normal criteria (scalability, functionality, cost, etc.), drop-in-ability should be considered as well

• Niche-technologies are available for more mainstream use cases, due to falling hardware prices

Questions/Comments

?

Thank you

… for your time

Michael VogtDirector, Data [email protected](o) 414.347.1303 or 312.985.8100(c) 312.772.4762

mailto:[email protected]

big data 2.0 - milwaukee big data user group presentation

Data & Analytics

data set operationsscalability

data set operationseasily

wellan example of data

technology recommendation

upfront data modeling

milwaukee big data users

keyvalue stores

diskkey value based