big data and the growing relevance of nosql
DESCRIPTION
TRANSCRIPT
![Page 1: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/1.jpg)
Big Data trends and the rising importance of NOSQL
Abhijit Sharma, Architect,
Innovation & Incubation Lab, BMC Software
![Page 2: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/2.jpg)
Trends in cloud, web, and even enterprise scale apps Unprecedented growth in -
Data set sizes which need to be stored, analyzed Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg,
Foursquare Connectedness and democratization of data
social networks, feeds, blogs, wiki, tags, semantic web Data API’s - mash up data - use Twitter, FB, Flickr API’s
Semi structured or unstructured data
Performance requirements of these apps Humongous R/W Scalability High Availability Trading consistency for availability – ACID not mandatory
![Page 3: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/3.jpg)
RDBMS woes Challenge - Storing and scaling humongous amounts of data
and remaining highly available Vertical scaling mostly - upper limit & expensive Horizontal scaling – no automatic sharding, no rebalancing – no
infrastructure Distributed transactions & joins due to normalization inhibit
performance, availability Schema less data models – rigid schema – alter table, null
columns Deeply connected data – not designed for this
![Page 4: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/4.jpg)
NOSQL is NOT
No SQL
The NOSQL Alternative
![Page 5: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/5.jpg)
NOSQL is simply
Not only SQL
The NOSQL Alternative
![Page 6: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/6.jpg)
NOSQL – So what else is it? “One size fits all” RDBMS is not working NOSQL alternatives are polyglot solutions that
better fit the new requirements thrown up by the trends. They can be categorized along these axes -
Data Model - simple to complex Scalability – single to horizontal Persistence
![Page 7: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/7.jpg)
NOSQL categories Graph Databases
Based on Graph theory Data model – graph, nodes, edges, properties Scalability – single node – high performance Persistence – On disk data structures Examples – Neo4J, AllegroGraph
Document Databases Based loosely on documents/Lotus Notes Data model – collections of documents Scalability – horizontal, auto-sharding & replication Persistence – B-Tree Examples – mongoDB, CouchDB
![Page 8: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/8.jpg)
NOSQL categories Column Stores
Based on Google’s BigTable design Data model - big table, column families Scalability – horizontal, auto-sharding & replication Persistence – Memory + File (on DFS) Examples – HBase, Cassandra
Key Value Stores Based on DHT, Amazon’s Dynamo design Data model – collection of key value pairs Scalability – horizontal, auto-sharding & replication Persistence – Memory or File Examples – Redis, Amazon Dynamo, Voldemort
![Page 9: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/9.jpg)
Graph Databases
![Page 10: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/10.jpg)
Graph oriented data Graphs are ubiquitous – Social networks,
wikis, the web, recommendation engines et. al.
Deep trees, complex networks Graph traversal - apt for expressing graph
related problems (shortest path, network size etc.)
![Page 11: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/11.jpg)
LinkedIn Social Graph
![Page 12: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/12.jpg)
Why not RDBMS for large scale graphs? Difficult to model and traverse graphs in
RDBMS recursive approaches - slow SQL queries that span
many table joins Hacks like storing paths for trees
node name
1 abhijit
2 sameer
from to
1 2
2 3
![Page 13: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/13.jpg)
Graph Databases Designed for efficient storage & traversal of
large scale graphs Natural modeling of graph network - nodes,
relationships and their properties Neo4J is a leading graph db
Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS
Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs
REST API and various language bindings Graph pattern matching, Cypher Query language,
Indexer – Lucene
![Page 14: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/14.jpg)
Graph basics
![Page 15: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/15.jpg)
All Paths & My Network size
![Page 16: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/16.jpg)
Shortest path between …
![Page 17: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/17.jpg)
Is connected to?
![Page 18: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/18.jpg)
You may know…
![Page 19: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/19.jpg)
Mining your network Centrality Algorithms
Closeness – who has the most followers on twitter Betweenness – who has more influential people following them Eigenvector – PageRank
![Page 20: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/20.jpg)
Document Databases
![Page 21: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/21.jpg)
Flexible document oriented data Document style unstructured data - schema
less – e.g. JSON documents No alter table needed like in an RDBMS, de-normalized data Useful for iterative/agile development
Humongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availability
mongoDB is a leading document database
![Page 22: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/22.jpg)
Document Database – Use cases Use cases :
Archiving of historic data which has undergone many schema changes
Flexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documents
Track near real time metrics - optimized increment of perf counters
Geo Loc based mobile and gaming apps (Geospatial indices can be key here)
![Page 23: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/23.jpg)
Craigslist Archival Database Premium service to
customers allowed search over their historical postings
Archival (no purging) of 10 years of postings - billions of documents Schema changes across
versions
MySQL based archival database ALTER TABLE took a month to
complete
![Page 24: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/24.jpg)
Foursquare Find a venue whose
name is Starbucks and mayor is Abhijit
Geo : Optimized for geo location queries – Find Starbucks near my current GPS location
![Page 25: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/25.jpg)
mongoDB Architecture
ShardShard
ShardMongo RouterMongo
Router
Mongo Configuration Server
Mongo Configuration Server
Client
![Page 26: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/26.jpg)
mongoDB Features JSON documents, collection oriented storage Rich, document-based queries Indexes on document attributes Fast in-place updates Scalability features
Horizontal scalability Configurable replication and high-availability Auto-sharding & rebalancing
Language specific drivers – Java, Scala, Ruby etc.
![Page 27: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/27.jpg)
Column Stores
![Page 28: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/28.jpg)
Column Store Reasonably rich data model –
sparse, distributed, persistent multi-dimensional sorted map Sorted row keys, columns
Use cases - Large scale data storage and analysis like - Time series data along with associated dimension data
Row keys are timestamps and thus sorted – helps time range queries Google analytics
Provides aggregate statistics, # unique visitors/day, page views/URL/day Raw click table has a row for each URL + user session time ~200 TB –
ensures contiguous URLs chronologically sorted
Data Cube - CPU
Time
OS
DC
![Page 29: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/29.jpg)
Column Store Performance
Excellent R/W performance – large storage – PB’s High scalability - horizontal scaling, auto-sharding High Availability - transparent replication of data
HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence
![Page 30: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/30.jpg)
Column Store - HBase Table defines Column Families - groups similar attributes , vertical
partitioning (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell -
value Table is split into multiple equal distributed regions each of which is a range
of sorted keys (partitioned automatically by the key) Ordered Rows by key, Ordered columns in a Column Family Rows can have different number of columns Columns have value and versions (any number) Row range & column range and key range queries
Row Key Column Family (dimensions) Column Family (metric)
112334-7782 server : host1 dc : PUNE value:20
112334-7783 server:host2 value:10
![Page 31: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/31.jpg)
HBase Architecture
![Page 32: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/32.jpg)
Key Value Stores
![Page 33: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/33.jpg)
Key Value Stores Simplest possible data model
Caching a user’s personalized, rendered page – avoid DB
S3 bucket storage for blob data against a unique id
Range of KV stores Distributed, scaleable persistent key-value storage
– Dynamo, Voldemort Auto-Partitioned key space Replicated KV Highly Available
Largely in-memory KV stores – Redis, memcached Redis blazing fast for cache and other interesting
operations
![Page 34: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/34.jpg)
Redis In memory KV store
Blazing fast – 100 K/sec R/W Async snapshot to disk
More than KV store – a data structure store – Supports lists, queues, sets and operations on
them Sorted list range operations Set operations UNION, INTERSECTION, DIFF
![Page 35: Big Data and the growing relevance of NoSQL](https://reader035.vdocuments.net/reader035/viewer/2022081413/5496f47aac7959222e8b523f/html5/thumbnails/35.jpg)
Redis – Use Cases Web session caching with EXPIRE set for
session expiry Live real time bit.ly URL stats like clicks etc –
fast increments of counters Auto Complete – Type first few characters –
maps to a sort list and a range query is fired Publish / Subscribe – fan out a message to
subscribers Set operations – My Twitter <Followers
INTERSECTION Followees> - tells me who all I follow but they don’t follow me back