big data 2.0 - milwaukee big data user group presentation
TRANSCRIPT
Big Data 2.0
Milwaukee Big Data Users Group
12.1.2014
DBMS Technology OverviewGoal
• Provide a technology recommendation for serving reporting needs for the next 3 – 5 years
• Explore different technologies for their suitability for a strategic reporting data platform
DBMS Technology OverviewVendor agnostic approach
• Vendor agnostic DBMS technologies evaluated– Categories
• RDBMS– Row based vs column based
• In-memory data base– Row based vs column based
• NoSQL– Document based (Disk)– Key value based (IMDG)– Graph based (IMDG)
– Criteria• Overall design• Pros/cons
DBMS Technology OverviewRepresentative vendor evaluation criteria
• …followed by quick evaluation of two vendors representing each technology– Thought leadership
– Market share / # of production customers
– Capacity / scalability
– Functionality
– Expertise availability
– Resilience
– Cost (license, infrastructure & expertise)
– Interface compatibility (drop-in-ability)
DBMS Technology OverviewOpen Discussion on drop-in-ability
• Re-tooling interfaces is expensive
– Focus is on query/reporting tools (in my evaluation)
– List of possible solutions drastically reduced by this criterion
– SQL compatibility (very important syntactic sugar)
– ACID compliance (dual use technology for OLTP needs)
• A cost-effective, performant, resilient solution that requires interface re-tooling is DOA for my client’s environment
Striking phrases
• Disk is the new tape, memory is the new disk
• IMDG’s are increasingly being referred to as Big Data 2.0
RDBMSrow based
• OLAP needs typically serviced by partitioning (row & column)
• 30 years old (proven technology)• IMDB implementations typically have same pros/cons,
although cost and performance characteristics are different
• Pros– Great OLTP performance– Efficient at whole-row operations
• Cons– Inefficient at data set operations– Scalability is typically not linear
Row-based
Data Cols Time Location Product Vendor
Block 1
Block 2
Block3
2/23 0900 IL023 Gown112 ML
2/23 0423 OH12 Mask221 123
2/24 1543 CN881 Swab993 456
RDBMScolumn based
• Optimized for OLAP needs as it’s optimized to answer questions on data characteristics
• Great performance on aggregate functions (avg, count, sum, min, max)• IMDB implementations typically have same pros/cons, although cost and
performance characteristics are different
• Pros– Aggregate functions are very fast as entire column can be fetched quickly– Efficient at data set operations– Easily compressed, especially for data that is sparsely populated
• Cons– Inefficient at retrieving many columns of a single row– Row functions are slower
Column-based
Block 0 Time 2/23 0900 2/23 0423 2/24 1543
Block 1
Block 2
Block3
Location IL023 OH12 CN881
Product Gown112 Mask221 Swab993
Vendor ML 123 456
NoSQLdocument/XML based
• Focus is typically on sharding strategy as opposed to up-front data modeling (models typically evolve greatly during construction)
• Similar to key-value stores, where values are stored in a standardized structure (although document stores keep metadata as well)
• An example of data in a document database:– {officeName:”3Pillar Noida”,
{Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”}}
– {officeName:”3Pillar Timisoara”,{Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode: 300011”}}
– {officeName:”3Pillar Cluj”,{Latitude:”40.748328”, Longitude:”-73.985560”}}
• Pros– Not limited to querying by keys (can query inside documents using JSON/XML query
mechanisms)– Maps well to semi-structured or variable structured data
• Cons– Sharding strategy can be challenging– Doesn’t support relations (no RI), as opposed to key value or graph stores
NoSQLIMDG (common to key-value / graph)
• IMDG’s referred to as “Big Data 2.0”• Host data in memory and distribute across cluster of commodity
servers• Employ an object-oriented data model that provides read/write
times << 1 ms• As data is stored in virtual memory pool, parallel data computations
are easily performed • As in document databases, focus is on sharding strategy as opposed
to up-front physical data modeling• Majority of implementations utilize JVM’s (although a handful of
.Net are out there)• GC, specifically the unpredictability of GC, is a major concern
– Vendors utilize off-heap storage to alleviate this by moving LRU data to off-heap JVM’s, relying on high-speed messaging for data transport
NoSQLIMDG (key-value)
• Typically stored as a set of distributable maps• Pros
– Data distribution is designed from the ground up– Keys and values are Java (or .Net) objects– No bias between OLTP and OLAP
• Cons– Alternate lookup mechanisms require a map with an
alternate key (although main data payload can be shared as values are objects that support multiple pointers)
– Expertise is typically harder to find (on characteristics of memory structure behavior at larger sizes)
NoSQLIMDG (graph)
• Allow a set of nodes (object instances) with dynamic properties (cols/attributes) to be arbitrarily linked to other nodes through edges (associations)
• Each node only knows its adjacent nodes• As the number of nodes increases, cost of a local hop
remains constant• Whereas a RDBMS is optimized for aggregation, a
graph database is optimized for connections• Fastest growth area in NoSQL in the last year – 250%
NoSQLIMDG (graph cont’d)
• 60% of Facebook graph is hosted on one instance of Neo4J
• Pros– Powerful general purpose (reusable) data model– Connected data locally indexed– Easy to query– Optimized for recursive structures (think BoM)– Great at use cases with complex relationships (supply
chain management)
• Cons– Sharding strategy is difficult– Requires re-wiring your brain (think object model
instead of data model)
Particular vendor evaluations
<<Vendor evaluation.xls>>
Recap
• In addition to normal criteria (scalability, functionality, cost, etc.), drop-in-ability should be considered as well
• Niche-technologies are available for more mainstream use cases, due to falling hardware prices
Questions/Comments
?
Thank you
… for your time
Michael VogtDirector, Data [email protected](o) 414.347.1303 or 312.985.8100(c) 312.772.4762