an introduction to big data, nosql and mongodb

89
1 Open source, high performance database Introduction to NoSQL and MongoDB Will LaForest Senior Director of 10gen Federal [email protected] @WLaForest

Upload: william-laforest

Post on 12-Nov-2014

8.088 views

Category:

Technology


6 download

DESCRIPTION

 

TRANSCRIPT

Page 1: An Introduction to Big Data, NoSQL and MongoDB

1

Open source, high performance database

Introduction to NoSQL and MongoDBWill LaForestSenior Director of 10gen [email protected]@WLaForest

Page 2: An Introduction to Big Data, NoSQL and MongoDB

2

Dawn of Databases to Present

Brewer’s CapbornWWW

born

10genfounded

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

IBM’s IMS

Codd publishes relational model paper

in 1970

SQLinvented

Oraclefounded

PC’s gaintraction

Client Server

DynamicWeb Content

3 tierarchitecture

Web applications

SOA

CloudComputing

released

NoSQLMovement

BigTable

Page 3: An Introduction to Big Data, NoSQL and MongoDB

3

Relational Databases

Attribute

Relation

Tuple

Page 4: An Introduction to Big Data, NoSQL and MongoDB

4

Relational Databases

Page 5: An Introduction to Big Data, NoSQL and MongoDB

5

• Data stored in a RDBMS is very compact (disk was more expensive)

• SQL and RDBMS made queries flexible with rigid schemas

• Rigid schemas helps optimize joins and storage• Massive ecosystem of tools, libraries and integrations• Been around 40 years!

Relational Database Strengths

Page 6: An Introduction to Big Data, NoSQL and MongoDB

6

• Gartner uses the 3Vs to define• Volume - Big/difficult/extreme volume is relative• Variety

– Changing or evolving data– Uncontrolled formats– Does not easily adhere to a single schema– Unknown at design time

• Velocity– High or volatile inbound data– High query and read operations– Low latency

Enter Big Data

Page 7: An Introduction to Big Data, NoSQL and MongoDB

7

RDBMS Challenges

DATA VARIETY &VOLATILITY• Extremely difficult to

find a single fixed schema

• Don’t know data schema a-priori TRANSACTIONAL MODEL

• N x Inserts or updates• Distributed transactions

• Systems scaling horizontally, not vertically

• Commodity servers• Cloud Computing

VOLUME & NEW ARCHITECTURES

Page 8: An Introduction to Big Data, NoSQL and MongoDB

8

• Non-relational been hanging around (MUMPS?)• Modern NoSQL theory and offerings started in early

2000s• Modern usage of term introduced in 2009• NoSQL = Not Only SQL• A collection of very different products• Alternatives to relational databases when they are a bad

fit• Motives

– Horizontally scalable (commodity server/cloud computing)– Flexibility

Enter NoSQL

Page 9: An Introduction to Big Data, NoSQL and MongoDB

9

• Value (Data) mapped to a key (think primary)• Some typed some just BLOBs• Redis, MemcachDB, Voldemort

NoSQL Databases – Key-Value

• “@WLaForest”“Will”• 12“Chris”• BLOB“Robert”

Page 10: An Introduction to Big Data, NoSQL and MongoDB

10

• Data stored on disk in a column oriented fashion• Predominantly hash based indexing• Data partitioned by range or consistent hashing• Google BigTable, HBase, Cassandra, Accumulo

NoSQL Databases – Big Table Descendents

Page 11: An Introduction to Big Data, NoSQL and MongoDB

11

• Key-Value Stores– Key-Value Stores– Value (Data) mapped to a key (think primary)

• Big Table Descendants– Looks like a distributed multi-dimension map– Data stored in a column oriented fashion– Predominantly hash based indexing

• Document oriented stores– Data stored as either JSON or XML documents

• What do they all have in common?

NoSQL Databases – Key-Value

Page 12: An Introduction to Big Data, NoSQL and MongoDB

12

• Not a database• Map reduce on HDFS or other data source• When you want to use it

– Can’t use a index – Distributing custom algorithms– ETL

• Great for grinding through data• Many NoSQL offerings have native map reduce

functionality

Hadoop

Page 13: An Introduction to Big Data, NoSQL and MongoDB

13

MongoDB & 10gen

Page 14: An Introduction to Big Data, NoSQL and MongoDB

14

• 2007– Eliot Horowitz & Dwight Merriman tired of reinventing the

wheel– 10gen founded– MongoDB Development begins

• 2009 – Initial release of MongoDB

• 73M+ in funding– Funded by Sequoia, NEA, Union Square Ventures, Flybridge

Capital

10gen & MongoDB

Page 15: An Introduction to Big Data, NoSQL and MongoDB

15

• Pre-production Subscriptions for MongoDB– Fixed cost = $30k– 6 month term, unlimited servers

• MongoDB Subscriptions– 32 month term, $4k per server– 24x7 support for development and production, 1 hour response time– Onboarding call and quarterly review

• Consulting– Cost is $300/hr

• Training– MongoDB Developer - $1500/student, 2 day course– MongoDB Administrator - $1500/student, 2 day course– MongoDB Essential - $2250/student, 3 day course

• MongoDB Monitoring Service

10gen Products

Page 16: An Introduction to Big Data, NoSQL and MongoDB

16

• 4 servers with 2 processors each• Total processors equals 8• Oracle Enterprise Edition

– $47,500 per processor plus maintenance– 8 x 47,500 = $380,000 + $76,000– Total Cost = $456,000

• MongoDB Subscription– $4,000 per server (no processor count)– No license fee or maintenance charge– 4 x 4,000 = $16,000– Total Cost = $16,000

• MongoDB is $440,000 less expensive

Cost Comparison

Page 17: An Introduction to Big Data, NoSQL and MongoDB

17

MongoDB is the leading NoSQL solution

Google Searches451 Group

“MongoDB increasing its dominance”

#2 on Indeed’s Fastest Growing Jobs Jaspersoft BigData Index

Demand for MongoDB, the document-oriented NoSQL database, saw the biggest spike with over 200% growth in 2011.

Page 18: An Introduction to Big Data, NoSQL and MongoDB

18

#2 ON INDEED’S FASTEST GROWING JOBS

MongoDB is the leading NoSQL solution

Page 19: An Introduction to Big Data, NoSQL and MongoDB

19

MongoDB is the leading NoSQL solution.

“MongoDB INCREASING ITS DOMINANCE”

Page 20: An Introduction to Big Data, NoSQL and MongoDB

20

• Scale horizontally over commodity hardware• RDBMSs great so keep what works

– Rich data models– Adhoc queries– Fully featured indexes

• What doesn’t distribute well?– Long running multi-row transactions– Joins– Both artifacts of the relational data model

• Do not homogenize programming interfaces• Local storage first class citizen for DB storage

Different Assumptions

Page 21: An Introduction to Big Data, NoSQL and MongoDB

21

• Data stored as documents (JSON)• Schema free• CRUD operations – (Create Read Update Delete)• Atomic document operations• Ad hoc Queries like SQL

– Equality– Regular expression searches– Ranges– Geospatial

• Secondary indexes• Sharding (sometimes called partitioning) for scalability• Replication – HA and read scalability

MongoDB Key Features

Page 22: An Introduction to Big Data, NoSQL and MongoDB

22

• MongoDB does not need any defined data schema.• Every document could have different data!

Document Oriented and Schema Free

{name: “jeff”, eyes: “blue”, height: 72, boss: “ben”}

{name: “brendan”, aliases: [“el diablo”]}

{name: “ben”, hat: ”yes”}

{name: “matt”, pizza: “DiGiorno”, height: 72, boss: 555.555.1212}

{name: “will”, eyes: “blue”, birthplace: “NY”, aliases: [“bill”, “la ciacco”], gender: ”???”, boss: ”ben”}

Page 23: An Introduction to Big Data, NoSQL and MongoDB

23

Seek = 5+ ms Read = really really fast

AuthorComment

Disk seeks and data locality

Post

Page 24: An Introduction to Big Data, NoSQL and MongoDB

24

Post

Author

CommentCommentCommentCommentComment

Disk seeks and data locality

Page 25: An Introduction to Big Data, NoSQL and MongoDB

25

RDBMS MongoDB

Database Database

Table Collection

Row Document

Terminology

Page 26: An Introduction to Big Data, NoSQL and MongoDB

26

• We are running instances on the order of:- 100B objects- 50TB storage- 50K qps per server- ~1200 servers

MongoDB scale

Page 27: An Introduction to Big Data, NoSQL and MongoDB

27

• SSL– between client and server– Intra-cluster communication

• Authorization at the database level– Read Only/Read+Write/Administrator

• Security Roadmap (tentative)– Pluggable authentication 2.4– Auditing 2.4– Cell level security 2.6– Common Criteria certification

MongoDB Security

Page 28: An Introduction to Big Data, NoSQL and MongoDB

28

Usage Examples

Page 29: An Introduction to Big Data, NoSQL and MongoDB

29

var p = { author: “roger”, date: new Date(), text: “Spirited Away”, tags: [“Tezuka”, “Manga”]}

> db.posts.save(p)

Documents

Page 30: An Introduction to Big Data, NoSQL and MongoDB

30

>db.posts.find()

{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ] } Notes: - _id is unique, but can be anything you’d like

Querying

Page 31: An Introduction to Big Data, NoSQL and MongoDB

31

Secondary Indexes

Create index on any Field in Document

// 1 means ascending, -1 means descending

>db.posts.ensureIndex({author: 1})

>db.posts.find({author: 'roger'}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", ... }

Page 32: An Introduction to Big Data, NoSQL and MongoDB

32

• Conditional Operators – $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type– $lt, $lte, $gt, $gte

// find posts with any tags > db.posts.find( {tags: {$exists: true }} )

// find posts matching a regular expression> db.posts.find( {author: /^rog*/i } )

// count posts by author> db.posts.find( {author: ‘roger’} ).count()

Query Operations

Page 33: An Introduction to Big Data, NoSQL and MongoDB

33

• $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit

> comment = { author: “fred”, date: new Date(), text: “Best Movie Ever”}

> db.posts.update( { _id: “...” }, $push: {comments: comment} );

Atomic Operations

Page 34: An Introduction to Big Data, NoSQL and MongoDB

34

{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ], comments : [{author : "Fred",date : "Sat Jul 24 2010 20:51:03 GMT-0700 (PDT)",text : "Best Movie Ever"} ]}

Nested Documents

Page 35: An Introduction to Big Data, NoSQL and MongoDB

35

// Index nested documents> db.posts.ensureIndex( “comments.author”:1 )db.posts.find({‘comments.author’:’Fred’})

// Index on tags> db.posts.ensureIndex( tags: 1)> db.posts.find( { tags: ’Manga’ } )

// geospatial index> db.posts.ensureIndex( “author.location”: “2d” )> db.posts.find( “author.location” : { $near : [22,42] } )

Secondary Indexes

Page 36: An Introduction to Big Data, NoSQL and MongoDB

36

Aggregates, Statistics, and Analytics

Page 37: An Introduction to Big Data, NoSQL and MongoDB

37

• Native Map/Reduce in JS in MongoDB– Distributes across the cluster with good data locality

• New aggregation framework– Declarative (no JS required)– Pipeline approach (like Unix ps -ax | tee processes.txt | more)

• Hadoop– Intersect the indexing of MongoDB with the brute force parallelization

of hadoop– Hadoop MongoDB connector

Computations & Aggregates Over MongoDB

Page 38: An Introduction to Big Data, NoSQL and MongoDB

38

Native MapReduce

Page 39: An Introduction to Big Data, NoSQL and MongoDB

39

Aggregation Framework

db.article.aggregate( { $project : { author : 1, tags : 1, }}, { $unwind : "$tags" }, { $group : { _id : “$tags”, authors : { $addToSet : "$author" } }});

{ title : “this is my title” , author : “bob” , posted : new Date () , pageViews : 5 , tags : [ “fun” , “good” , “fun” ] , comments : [ { author :“joe” , text : “this is cool” } , { author :“sam” , text : “this is bad” } ], other : { foo : 5 }}

$project $match $limit $skip

$unwind $group $sort

Page 40: An Introduction to Big Data, NoSQL and MongoDB

40

Aggregation Framework

Page 41: An Introduction to Big Data, NoSQL and MongoDB

41

Hadoop

2

1

3

2

1

3

MAP REDUCEInput data Intermediate data Output data

Page 42: An Introduction to Big Data, NoSQL and MongoDB

42

Hadoop with Database

2

1

3

2

1

3

Input data Intermediate data Output dataMAP REDUCE

Page 43: An Introduction to Big Data, NoSQL and MongoDB

43

Deployment & Scaling

Page 44: An Introduction to Big Data, NoSQL and MongoDB

44

Asynchronous Replication

Read

Write

Read

ReadDriv

er

Replica Sets

Primary

Secondary

Secondary

Page 45: An Introduction to Big Data, NoSQL and MongoDB

45

Read

ReadDriv

er

Replica Sets

Primary

Secondary

Secondary

Page 46: An Introduction to Big Data, NoSQL and MongoDB

46

Read

ReadDriv

er

Replica Sets

Primary

Secondary

PrimaryWrite

AutomaticLeader Election

Page 47: An Introduction to Big Data, NoSQL and MongoDB

47

Read

ReadDriv

er

Replica SetsAddition Details

Secondary

PrimaryWrite

ReadSecondary

Page 48: An Introduction to Big Data, NoSQL and MongoDB

48

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range0..30

Key Range31..60

Key Range61..90

Key Range91.. 100

MongoS MongoS MongoS Config

Config

Config

Sharding Additional Details

Clients

Page 49: An Introduction to Big Data, NoSQL and MongoDB

49

• Sharding Details• Replica Set Details• Consistency Details• Common Deployment Scenarios• Citations

Additional Slides

Page 50: An Introduction to Big Data, NoSQL and MongoDB

50

Replica Set Details

Page 51: An Introduction to Big Data, NoSQL and MongoDB

51

Read

Write

Driv

er

Read

Eventual Consistency

Primary

Secondary

Secondary

Page 52: An Introduction to Big Data, NoSQL and MongoDB

52

Read

Write

Driv

er

Strong Consistency

Primary

Secondary

Secondary

Page 53: An Introduction to Big Data, NoSQL and MongoDB

53

2. Read

1. Write

Driv

er

2. Read

Strong Consistency

1. Replicate

Primary

Secondary

Secondary

Page 54: An Introduction to Big Data, NoSQL and MongoDB

54

• Fire and forget• Wait for error • Wait for fsync• Wait for journal sync • Wait for replication

Durability

Page 55: An Introduction to Big Data, NoSQL and MongoDB

55

Driver Primary

write

apply in memory

Fire and forget

Page 56: An Introduction to Big Data, NoSQL and MongoDB

56

Driver Primary

getLastError apply in memory

write

Get last error

Page 57: An Introduction to Big Data, NoSQL and MongoDB

57

Driver Primary

getLastError apply in memory

write

j:trueWrite to journal

Wait for Journal Sync

Page 58: An Introduction to Big Data, NoSQL and MongoDB

58

Driver Primary

getLastError apply in memory

write

fsync:true

fsync

Wait for fsync

Page 59: An Introduction to Big Data, NoSQL and MongoDB

59

Driver Primary

getLastError apply in memory

write

w:2

Secondary

replicate

Wait for replication

Page 60: An Introduction to Big Data, NoSQL and MongoDB

60

Write Concern Options

Value Meaning

<n:integer> Replicate to N members of replica set

“majority” Replicate to a majority of replica set members

<m:modeName> Use cutom error mode name

Page 61: An Introduction to Big Data, NoSQL and MongoDB

61

Sharding Details

Page 62: An Introduction to Big Data, NoSQL and MongoDB

62

{ name: “Jared”, email: “[email protected]”,}{ name: “Scott”, email: “[email protected]”,}{ name: “Dan”, email: “[email protected]”,}

> db.runCommand( { shardcollection: “test.users”, key: { email: 1 }} )

Keys

Page 63: An Introduction to Big Data, NoSQL and MongoDB

63

-∞ +∞

Chunks

Page 65: An Introduction to Big Data, NoSQL and MongoDB

65

-∞ +∞

[email protected]

[email protected]

[email protected]

Split!

Chunks

Page 66: An Introduction to Big Data, NoSQL and MongoDB

66

-∞ +∞

[email protected]

[email protected]

[email protected]

Split!This is a chunk

This is a chunk

Chunks

Page 68: An Introduction to Big Data, NoSQL and MongoDB

68

-∞ +∞

[email protected]

[email protected]

[email protected]

Split!

Chunks

Page 69: An Introduction to Big Data, NoSQL and MongoDB

69

• Stored in the config serers• Cached in mongos• Used to route requests and keep cluster balanced

Chunks

-∞

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

+∞

11

1

1

Page 70: An Introduction to Big Data, NoSQL and MongoDB

70

Shard 1 Shard 2 Shard 3 Shard 4

5

9

1

6

10

2

7

11

3

8

12

4

17

21

13

18

22

14

19

23

15

20

24

16

29

33

25

30

34

26

31

35

27

32

36

28

41

45

37

42

46

38

43

47

39

44

48

40

mongos

balancerconfig

config

config

Chunks!

Balancing

Page 71: An Introduction to Big Data, NoSQL and MongoDB

71

mongos

balancerconfig

config

config

Shard 1 Shard 2 Shard 3 Shard 4

5

9

1

6

10

2

7

11

3

8

12

4

21 22 23 24 33 34 35 36 45 46 47 48

ImbalanceImbalance

Balancing

Page 72: An Introduction to Big Data, NoSQL and MongoDB

72

mongos

balancer

Move chunk 1 to Shard 2

config

config

config

Shard 1 Shard 2 Shard 3 Shard 4

5

9

1

6

10

2

7

11

3

8

12

4

21 22 23 24 33 34 35 36 45 46 47 48

Balancing

Page 73: An Introduction to Big Data, NoSQL and MongoDB

73

mongos

balancerconfig

config

config

Shard 1 Shard 2 Shard 3 Shard 4

5

9

1

6

10

2

7

11

3

8

12

4

21 22 23 24 33 34 35 36 45 46 47 48

Balancing

Page 74: An Introduction to Big Data, NoSQL and MongoDB

74

mongos

balancer

Chunks 1,2, and 3 have migrated

config

config

config

Shard 1 Shard 2 Shard 3 Shard 4

5

9

16

10

7

11

38

12

4

21 22 23 24 33 34 35 36 45 46 47 48

2

Balancing

Page 75: An Introduction to Big Data, NoSQL and MongoDB

75

Sharding Details

Page 76: An Introduction to Big Data, NoSQL and MongoDB

76

mongos

Shard 1 Shard 2 Shard 3

1

2

3

4

1.Query arrives at mongos

2.mongos routes query to a single shard

3.Shard returns results of query

4.Results returned to client

Routed Request

Page 77: An Introduction to Big Data, NoSQL and MongoDB

77

mongos

Shard 1 Shard 2 Shard 3

1

4

1.Query arrives at mongos

2.mongos broadcasts query to all shards

3.Each shard returns results for query

4.Results combined and returned to client

2 2

33

2

3

Scatter Gather

Page 78: An Introduction to Big Data, NoSQL and MongoDB

78

mongos

Shard 1 Shard 2 Shard 3

1

3

6

1.Query arrives at mongos

2.mongos broadcasts query to all shards

3.Each shard locally sorts results

4.Results returned to mongos

5.mongos merge sorts individual results

6.Combined sorted result returned to client

2 2

3 3

4 4

5

2

4

Distributed Merge Sort

Page 79: An Introduction to Big Data, NoSQL and MongoDB

79

Inserts Requires shard key

db.users.insert({ name: “Jared”, email: “[email protected]”})

Removes Routed db.users.delete({ email: “[email protected]”})

Scattered db.users.delete({name: “Jared”})

Updates Routed db.users.update( {email: “[email protected]”}, {$set: { state: “CA”}})

Scattered db.users.update( {state: “FZ”}, {$set:{ state: “CA”}} )

Writes

Page 80: An Introduction to Big Data, NoSQL and MongoDB

80

By Shard Key

Routed db.users.find( {email: “[email protected]”})

Sorted by shard key

Routed in order db.users.find().sort({email:-1})

Find by non shard key

Scatter Gather db.users.find({state:”CA”})

Sorted by non shard key

Distributed merge sort

db.users.find().sort({state:1})

Queries

Page 81: An Introduction to Big Data, NoSQL and MongoDB

81

Common Deployment Scenarios

Page 82: An Introduction to Big Data, NoSQL and MongoDB

82

Data Center

Typical

Primary Secondary Secondary

Page 83: An Introduction to Big Data, NoSQL and MongoDB

83

Data Center

backups

Backup Node

Primary Secondary Secondaryhidden=true

Page 84: An Introduction to Big Data, NoSQL and MongoDB

84

Active Data Center Standby Data Center

Disaster Recovery

SecondarySecondarypriority = 1

Primarypriority = 1

Page 85: An Introduction to Big Data, NoSQL and MongoDB

85

West Coast DC Central DC East Coast DC

Multi Data Center

Secondary Primarypriority = 1

Secondary

Page 86: An Introduction to Big Data, NoSQL and MongoDB

86

mongo sharding

Page 87: An Introduction to Big Data, NoSQL and MongoDB

87

• History of Database Management (http://bit.ly/w3r0dv)• EMC IDC Study (http://bit.ly/y1mJgJ)• Gartner & Big Data (http://bit.ly/xvRP3a)• SQL (http://en.wikipedia.org/wiki/SQL)

• Database Management Systems http://en.wikipedia.org/wiki/Dbms)• Dynamo: Amazon’s Highly Available Key-value Store

(http://bit.ly/A8F8oy)

• CAP Theorem (http://bit.ly/zvA6O6)• NoSQL Google File System and BigTable (http://oreil.ly/wOXliP)• NoSQL Movement whitepaper (http://bit.ly/A8RBuJ)

• Sample ERD diagram (http://bit.ly/xV30v)

Citations

Page 88: An Introduction to Big Data, NoSQL and MongoDB

88

Backup Slides

Page 89: An Introduction to Big Data, NoSQL and MongoDB

89

• Impossible for a distributed computer system to simultaneously provide all three of the following guarantees– Consistency - All nodes see the same data at the same

time.– Availability - A guarantee that every request receives a

response about whether it was successful or failed.– Partition tolerance - No set of failures less than total

network failure is allowed to cause the system to respond incorrectly

Brewer’s CAP Theorem