nosql and the big data hullabaloo

44
A Practical Look at the NOSQL and Big Data Hullabaloo Level: Intermediate Andrew J. Brust CEO and Founder Blue Badge Insights Sam Bisbee Senior Doing Stuff Person Cloudant (In Absentia)

Upload: andrew-brust

Post on 24-Dec-2014

646 views

Category:

Technology


3 download

DESCRIPTION

Presentation given at The Microsoft Business Intelligence + Big Data User Group of NYC, January 14, 2013.

TRANSCRIPT

Page 1: NoSQL and The Big Data Hullabaloo

A Practical Look at theNOSQL and Big Data Hullabaloo

Level: Intermediate

Andrew J. BrustCEO and FounderBlue Badge Insights

Sam BisbeeSenior Doing Stuff Person

Cloudant(In Absentia)

Page 2: NoSQL and The Big Data Hullabaloo

• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair VSLive! and 17 years as a speaker• Founder, Microsoft BI User Group of NYC

– http://www.msbinyc.com

• Co-moderator, NYC .NET Developers Group– http://www.nycdotnetdev.com

• “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News

• brustblog.com, Twitter: @andrewbrust

Meet Andrew

Page 3: NoSQL and The Big Data Hullabaloo

My New Blog (bit.ly/bigondata)

Page 4: NoSQL and The Big Data Hullabaloo

Read all about it!

Page 5: NoSQL and The Big Data Hullabaloo

Meet Sam

• Wait…you can’t. He’s not here.• Sam Bisbee

– Director of Technical Business Development, Cloudant

– He prefers “Senior Doing Stuff Person”Which is ironic

• I’ve preserved a few of his slides.• Look for: From Sam in upper-right-hand corner

Page 6: NoSQL and The Big Data Hullabaloo

Agenda

• Why NoSQL?• NoSQL Definition(s)• Concepts• NoSQL Categories• Provisioning, market, applicability• Take-aways

Page 8: NoSQL and The Big Data Hullabaloo

NoSQL Data Fodder

Addresses Preferences

NotesFriends,

Followers

Documents

Page 9: NoSQL and The Big Data Hullabaloo

“Web Scale”• This the term used to justify

NoSQL• Scenario is simple needs

but “made up for in volume”– Millions of concurrent users

• Think of sites like Amazon or Google

• Think of non-transactional tasks like loading catalog data to display product page, or environment preferences

Page 10: NoSQL and The Big Data Hullabaloo

NOSQL DEFINITION(S)

Page 11: NoSQL and The Big Data Hullabaloo

What is NOSQL?

• “Not Only SQL” - this is not a holy war

• 1870: Modern study of set theory begins

• 1970: Codd writes “A Relational Model of Data for Large Shared Data Banks”

• 1970 – 1980: Commercial implementations of Codd's theory are released

From Sam

Page 12: NoSQL and The Big Data Hullabaloo

What is NOSQL?

• 1970 - ~2000: the same sorts of databases were made (plus a few niche products)

• Dot-Com Bubble forced the same data tier problems but at a new scale (Amazon), forcing innovation out of necessity

• 2000 – present: innovations are becoming open source and “main stream” (Hadoop)

From Sam

Page 13: NoSQL and The Big Data Hullabaloo

So What is NOSQL Really?

New ways of looking at dynamic data storage

and querying for larger scale systems.

(scale = concurrent users and data size)

From Sam

Page 14: NoSQL and The Big Data Hullabaloo

NoSQL Common Traits

• Non-relational• Non-schematized/schema-free• Open source• Distributed• Eventual consistency• “Web scale”• Developed at big Internet companies

Page 15: NoSQL and The Big Data Hullabaloo

CONCEPTS

Page 16: NoSQL and The Big Data Hullabaloo

Consistency

• CAP Theorem– Databases may only excel at two of the following

three attributes: consistency, availability and partition tolerance

• NoSQL does not offer “ACID” guarantees– Atomicity, consistency, isolation and durability

• Instead offers “eventual consistency”– Similar to DNS propagation

Page 17: NoSQL and The Big Data Hullabaloo

Consistency

• Things like inventory, account balances should be consistent– Imagine updating a server in Seattle that stock was depleted– Imagine not updating the server in NY– Customer in NY goes to order 50 pieces of the item– Order processed even though no stock

• Things like catalog information don’t have to be, at least not immediately– If a new item is entered into the catalog, it’s OK for some

customers to see it even before the other customers’ server knows about it

• But catalog info must come up quickly– Therefore don’t lock data in one location while waiting to update

the other

• Therefore, OK to sacrifice consistency for speed, in some cases

Page 18: NoSQL and The Big Data Hullabaloo

CAP Theorem

Consistency

AvailabilityPartition Tolerance

Relational

NoSQL

Page 19: NoSQL and The Big Data Hullabaloo

Indexing

• Most NoSQL databases are indexed by key• Some allow so-called “secondary”

indexes• Often the primary key indexes are

clustered• HBase uses HDFS (the Hadoop Distributed

File System), which is append-only– Writes are logged– Logged writes are batched– File is re-created and sorted

Page 20: NoSQL and The Big Data Hullabaloo

Queries

• Typically no query language• Instead, create procedural program• Sometimes SQL is supported• Sometimes MapReduce code is used…

Page 21: NoSQL and The Big Data Hullabaloo

MapReduce

• Map step: pre-processes data• Reduce step: summarizes/aggregates data• Most typical of Hadoop and used with

Wide Column Stores, esp. HBase• Amazon Web Services’ Elastic MapReduce

(EMR) can read/write DynamoDB, S3, Relational Database Service (RDS)

• “Hive” offers a HiveQL (SQL-like) abstraction over MR– Use with Hive tables– Use with HBase

Page 22: NoSQL and The Big Data Hullabaloo

Sharding

• A partitioning pattern where separate servers store partitions

• Fan-out queries supported• Partitions may be duplicated, so

replication also provided– Good for disaster recovery

• Since “shards” can be geographically distributed, sharding can act like a CDN

• Good for keeping data close to processing– Reduces network traffic when MapReduce splitting

takes place

Page 23: NoSQL and The Big Data Hullabaloo

NOSQL CATEGORIES

Page 24: NoSQL and The Big Data Hullabaloo

Key-Value Stores

• The most common; not necessarily the most popular

• Has rows, each with something like a big dictionary/associative array– Schema may differ from row to row

• Common on cloud platforms– e.g. Amazon SimpleDB, Azure Table Storage

• MemcacheDB, Voldemort, Couchbase• DynamoDB (AWS), Dynomite, Redis and Riak

Page 25: NoSQL and The Big Data Hullabaloo

Key-Value Stores

Table: CustomersRow ID: 101

First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501

Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502

Table: Orders

Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Database

Page 26: NoSQL and The Big Data Hullabaloo

Wide Column Stores

• Has tables with declared column families– Each column family has “columns” which are KV pairs that

can vary from row to row

• These are the most foundational for large sites– BigTable (Google)– HBase (Originally part of Yahoo-dominated Hadoop project)– Cassandra (Facebook)Calls column families “super columns” and tables “super

column families”

• They are the most “Big Data”-ready– Especially HBase + Hadoop

Page 27: NoSQL and The Big Data Hullabaloo

Wide Column Stores

Table: CustomersRow ID: 101

Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501

Table: Orders

Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428

Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502

Page 28: NoSQL and The Big Data Hullabaloo

Wide Column Stores

Page 29: NoSQL and The Big Data Hullabaloo

Document Stores• Have “databases,” which are akin to tables• Have “documents,” akin to rows

– Documents are typically JSON objects– Each document has properties and values– Values can be scalars, arrays, links to documents in other databases or

sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)

– Can have attachments as well

• Old versions are retained– So Doc Stores work well for content management

• Some view doc stores as specialized KV stores• Most popular with developers, startups, VCs• The biggies:

– CouchDB– Derivatives

– MongoDB

Page 30: NoSQL and The Big Data Hullabaloo

Document StoreApplication Orientation

• Documents can each be addressed by URIs

• CouchDB supports full REST interface• Very geared towards JavaScript and JSON

– Documents are JSON objects– CouchDB/MongoDB use JavaScript as native

language

• In CouchDB, “view functions” also have unique URIs and they return HTML– So you can build entire applications in the database

Page 31: NoSQL and The Big Data Hullabaloo

Document Stores

Database: CustomersDocument ID: 101

First_Name: AndrewLast_Name: BrustAddress:

Orders:

Database: Orders

Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Number: 123Street: Main Street

Most_recent: 1501

Document ID: 202First_Name: JaneLast_Name: DoeAddress:

Orders:

Number: 321Street: Elm Street

Most_recent: 1502

Page 32: NoSQL and The Big Data Hullabaloo

Document Stores

Page 33: NoSQL and The Big Data Hullabaloo

Graph Databases

• Great for social network applications and others where relationships are important

• Nodes and edges– Edge like a join– Nodes like rows in a table

• Nodes can also have properties and values

• Neo4j is a popular graph db

Page 34: NoSQL and The Big Data Hullabaloo

Graph Databases

Database

Sent invitation to

Commented on photo by

Friend of

Address

Placed order

Item2

Item1

Joe Smith Jane Doe

Andrew Brust

Street: 123 Main StreetCity: New YorkState: NYZip: 10014

ID: 52134Type: DressColor: Blue

ID: 24457Type: ShirtColor: Red

ID: 252Total Price: 300 USD

George Washington

Page 35: NoSQL and The Big Data Hullabaloo

PROVISIONING, MARKET, APPLICABILITY

Page 36: NoSQL and The Big Data Hullabaloo

NoSQL on Windows Azure

• Platform as a Service– Cloudant: https://cloudant.com/azure/– MongoDB (via MongoLab):

http://blog.mongolab.com/2012/10/azure/

• MongoDB, DIY: – On an Azure Worker Role:

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles

– On a Windows VM: http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

– On a Linux VM: http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/

Page 38: NoSQL and The Big Data Hullabaloo

The High-Level Shake Out

• Hadoop will continue to crush data warehousing

• MongoDB will be the top MySQL / on-prem alternative

• Cloudant will be the top as-a-Service / Cloud database

• Basho [Riak] is pivoting toward cloud object store

From Sam

Page 39: NoSQL and The Big Data Hullabaloo

NoSQL + BI

• NoSQL databases are bad for ad hoc query and data warehousing

• BI applications involve models; models rely on schema

• Extract, transform and load (ETL) may be your friend

• Wide-column stores, however are good for “Big Data”– See next slide

• Wide-column stores and column-oriented databases are similar technologically

Page 40: NoSQL and The Big Data Hullabaloo

NoSQL + Big Data• Big Data and NoSQL are interrelated• Typically, Wide-Column stores used in Big

Data scenarios• Prime example:

– HBase and Hadoop

• Why?– Lack of indexing not a problem– Consistency not an issue– Fast reads very important– Distributed file systems important too– Commodity hardware and disk assumptions also

important– Not Web scale but massive scale-out, so similar concerns

Page 41: NoSQL and The Big Data Hullabaloo

TAKE-AWAYS

Page 42: NoSQL and The Big Data Hullabaloo

Compromises

• Eventual consistency• Write buffering• Only primary keys can be indexed• Queries must be written as programs• Tooling

– Productivity (= money)

Page 43: NoSQL and The Big Data Hullabaloo

Summing Up

• Line of Business -> Relational• Large, public (consumer)-facing sites -> NoSQL

• Complex data structures -> Relational• Big Data -> NoSQL

• Transactional -> Relational• Content Management -> NoSQL

• Enterprise->Relational • Consumer Web -> NoSQL

Page 44: NoSQL and The Big Data Hullabaloo

Thank you

[email protected]• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”

Text “bluebadge” to 22828