big data nosql 1017

102
Copyright © 2011 LOGTEL NoSQL (big data) Samuel Dratwa [email protected]

Upload: samuel-dratwa

Post on 23-Jan-2018

68 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

NoSQL (big data)

Samuel [email protected]

Page 2: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Agenda

Big Data / No SQL – technology aspect How big is BIG ?

The motivation behind NoSQL

CAP theorem

Partitions / fragmentation

The different NoSQL models

Key Value

Column-Based

Document store

Big Table

Graph

The NoSQL way of thinking (using graphs)

Big Data - Applicative (what can we do with it)

2

Page 3: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

It’s a hype (!)

3

Page 4: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Big Data Definition

No single standard definition…

“Big Data” is data whose scale, diversity, and

complexity require new architecture,

techniques, algorithms, and analytics to

manage it and extract value and hidden

knowledge from it…

4

Page 5: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

NoSQL humor

5

http://geekandpoke.typepad.com/

Page 6: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

10GB ? 10TB ? 10 PB ?

How big is BIG ?

6

Page 7: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

The 4 V’s

7

Page 8: Big Data NoSQL 1017

Characteristics of Big Data:

1-Scale (Volume)

• Data Volume

• 44x increase from 2009 2020

• From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

8

Exponential increase in

collected/generated data

Page 9: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

The 4 V’s

9

Page 10: Big Data NoSQL 1017

Characteristics of Big Data:

2-Complexity (Varity)

• Various formats, types, and structures

• Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…

• Static data vs. streaming data

• A single application can be generating/collecting many types of data

10

Page 11: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

The 4 V’s

11

Page 12: Big Data NoSQL 1017

Characteristics of Big Data:

3-Speed (Velocity)

• Data is begin generated fast and need to be processed

fast

• Online Data Analytics

• Late decisions missing opportunities

• Examples

• E-Promotions: Based on your current location, your purchase

history, what you like send promotions right now for store next to

you

• Healthcare monitoring: sensors monitoring your activities and body

any abnormal measurements require immediate reaction12

Page 13: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

The 4 V’s

13

Page 14: Big Data NoSQL 1017

Who’s Generating Big Data

Social media and networks

(all of us are generating data)Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and

networks

(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable

fashion

14

Page 15: Big Data NoSQL 1017

The Model Has

Changed…• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming

data

15

Page 16: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

NoSQL motivation

16

Page 17: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Why Now?

Explosion of social media sites (Facebook, Twitter) with large data needs

Explosion of storage needs in large web sites such as Google, Yahoo

Much of the data is not files

Rise of cloud-based solutions such as Amazon S3 (simple storage solution)

Shift to dynamically-typed data with frequent schema changes

Open-source community

Page 18: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Parallel Databases and Data Stores

Relational Databases – mainstay of business

Web-based applications caused spikes

Especially true for public-facing e-Commerce sites

Many application servers, one database

Easy to parallelize application servers to 1000s of servers, harder to parallelize databases to same scale

First solution: memcache (in-memory) or other caching mechanisms to reduce database access

Page 19: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Scaling Up

What if the dataset is huge, and very high number of transactions per second

Use multiple servers to host database

‘scaling out’ or ‘horizontal scaling’

Parallel databases have been around for a while

But expensive, and designed for decision support

not OLTP (Online Transaction Processing)

Page 20: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Scaling RDBMS – Master/Slave

Master-Slave

All writes are written to the master. All reads performed against the replicated slave databases

Good for mostly read, very few update applications

Critical reads may be incorrect as writes may not have been propagated down

Large data sets can pose problems as master needs to duplicate data to slaves

Page 21: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Scaling RDBMS - Partitioning

Partitioning

Divide the database across many machines

E.g. hash or range partitioning

Handled transparently by parallel databases but they are expensive

“Sharding” Divide data amongst many cheap databases

(MySQL/PostgreSQL)

Manage parallel access in the application

Scales well for both reads and writes

Not transparent, application needs to be partition-aware

Page 22: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

What is NoSQL?

Stands for Not Only SQL

Class of non-relational data storage systems

E.g. BigTable, Dynamo, PNUTS/Sherpa, ..

Usually do not require a fixed table schema nor do

they use the concept of joins

All NoSQL offerings relax one or more of the ACID

properties (will talk about the CAP theorem)

Not a backlash/rebellion against RDBMS

SQL is a rich query language that cannot be rivaled

by the current list of NoSQL offerings

Page 23: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

NoSQL Data Storage: Classification

NoSQL solutions fall into 4 major areas:

Uninterpreted key/value or ‘the big hash table’. Amazon S3 (Dynamo)

Voldemort

Scalaris

Column-based, with interpreted keys Cassandra, BigTable, HBase, Sherpa/PNuts

Others CouchDB (document-based)

Neo4J (graph-based)

Page 24: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

NoSQL ecosystem

24

Page 25: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

25

Page 26: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Big Data Landscape 2015

26

Page 27: Big Data NoSQL 1017

Copyright © 2011 LOGTEL 27

Page 28: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Architecture

28

Page 29: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

MapReduce

29

Page 30: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

ACID

Atomic: Either the whole process of a transaction is done or none is.

Consistency: Database constraints (application-specific) are preserved.

Isolation: It appears to the user as if only one process executes at a time. (Two concurrent transactions will not see on another’s transaction while “in flight”.)

Durability: The updates made to the database in a committed transaction will be visible to future transactions. (Effects of a process do not get lost if the system crashes.)

Page 31: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

CAP Theorem

Three properties of a system

Consistency (all copies have same value)

Availability (system can run even if parts have failed)

Partitions (network can break into two or more parts, each

with active systems that can’t talk to other parts)

Brewer’s CAP “Theorem”: You can have at most two

of these three properties for any system

Very large systems will partition at some point

Choose one of consistency or availability

Traditional database choose consistency

Most Web applications choose availability

Except for specific parts such as order processing

Page 32: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

The reminder

Dial 1-800-remind ......

Available , Consist – not portioned

Not available ...

Available , Partitioned – not Consistent

Consistent, Partitioned – not Available

32

Page 33: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

The proof…

33

Page 34: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

CAP Theorem

Three properties of a system

Consistency (all copies have same value)

Availability (system can run even if parts have failed)

Partitions (network can break into two or more parts, each

with active systems that can’t talk to other parts)

Brewer’s CAP “Theorem”: You can have at most two

of these three properties for any system

Very large systems will partition at some point

Choose one of consistency or availability

Traditional database choose consistency

Most Web applications choose availability

Except for specific parts such as order processing

Page 35: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Availability

Traditionally, thought of as the server/process available five 9’s (99.999 %).

However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.

Want a system that is resilient in the face of network disruption

Page 36: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Eventual Consistency

When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent

For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service

Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID

Soft state: copies of a data item may be inconsistent

Eventually Consistent – copies becomes consistent at some later time if there are no more updates to that data item

Page 37: Big Data NoSQL 1017
Page 38: Big Data NoSQL 1017
Page 39: Big Data NoSQL 1017
Page 40: Big Data NoSQL 1017
Page 41: Big Data NoSQL 1017

BASE in Cassandra

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Digest QueryDigest Response Digest Response

Result

Client

Read repair if digests differ

Page 42: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Common Advantages

Cheap, easy to implement (open source)

Data are replicated to multiple nodes (therefore identical and fault-tolerant) and can be partitioned

When data is written, the latest version is on at least one node and then replicated to other nodes

Down nodes easily replaced

No single point of failure

Easy to distribute

Don't require a schema

Page 43: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

What am I giving up?

joins

group by

order by

ACID transactions

SQL as a sometimes frustrating but still powerful query language

easy integration with other applications that support SQL

Page 44: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Distributed Key-Value Data Stores

Distributed key-value data storage systems allow key-value pairs to be stored (and retrieved on key) in a massively parallel system E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon

Dynamo, ..

Partitioning, high availability etc. completely transparent to application

Sharding systems and key-value stores don’t support many relational features

No join operations (except within partition)

No referential integrity constraints across partitions

etc.

Page 45: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Flexible Data Model

Rockets

Key Value

1

2

3

Name Value

tooninventoryQtybrakes

Rocket-Powered Roller SkatesReady, Set, Zoom5false

name

Name Value

tooninventoryQtybrakes

Little Giant Do-It-Yourself Rocket-Sled KitBeep Prepared4false

Name Value

tooninventoryQtywheels

Acme Jet Propelled UnicycleHot Rod and Reel11

name

name

Page 46: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

HBase

46

Page 47: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Google

Tables are sorted by Row

Table schema only define its column families . Each family consists of any number of columns

Each column consists of any number of versions

Columns only exist when inserted, NULLs are free.

Columns within a family are sorted and stored together

Everything except table names are byte[]

(Row, Family: Column, Timestamp) Value

Row key

Column Family

valueTimeStamp

Page 48: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Splunk – Document base

48

Page 49: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Splunk – log analysis

49

Page 50: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

PNUTS Data Storage Architecture

Page 51: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

01

1/2

F

E

D

C

B

A N=3

h(key2)

h(key1)

52

Partitioning And Replication

Page 52: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Should I be using NoSQL Databases?

For almost all of us, regular relational databases are THE correct solution

NoSQL Data storage systems makes sense for applications that need to deal with very large semi-structured data

Log Analysis

Social Networking Feeds

Page 53: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Graph in practice(thanks to Luca Garulli)

54

Page 54: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

...how to think «graphically» with

one of the most common domains

in the enterprise world:

The old-classic CRM* domain

* today in 99% of the cases a RDBMS is used

Lets take a real example - CRM

Page 55: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Every developer knows

the Relational Model(?),

but who knows the

Graph one?

Page 56: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Back to school:

Graph Theory crash course

Page 57: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

SamNoSQL

lectureLikes

Basic Graph

Page 58: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Sam

name: Samuel

surname: Dratwa

company: SADOT

NoSQL

Lecture

editions: [Comverse, Tel-Aviv]

Likes

since: 2012

Vertices and Edges can have properties

Vertices and Edges can have properties

Vertices and Edges can have properties

Vertices are directed

* https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model

Property Graph Model*

Page 59: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

SamNoSQL

lecture

An Edge connects 2 vertices: use multiple

vertices to represents 1-N and N-M relationships

Edges - Arcs

Page 60: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Likes

Avital

Sam

FriendOf

NoSQL

lecture

Doron

Joins

Page 61: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Compliments,

this is your diploma in

«Graph Theory»

Page 62: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Customer Address

Order Stock

Registry system

Order system

Domain: minimal CRM

Page 63: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Stock

Registry system

Order

Order system

Customer Address

How doesRelational DBMS

manage relationships?

Page 64: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

JOIN Customer.Address -> Address.Id

Customer

Id Name Address

10 Samuel 34

11 Katja 44

34 Sylvia 54

56 Mark 66

88 Steve 68

Address

Id Location

34 Rome, London

44 Cologne

54 Rome

66 New Mexico

68 Palo Alto

Relational World: 1-1 Relationships

Page 65: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Inverse JOIN Address.Customer -> Customer.Id

Customer

Id Name

10 Samuel

11 Katja

34 Sylvia

56 Mark

88 Steve

Address

Id Customer Location

24 10 Rome

33 10 London

44 34 Rome

66 11 Cologne

68 88 Palo Alto

Relational World: 1-N Relationships

Page 66: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Additional table with 2 JOINs

(1) CustomerAddress.Id -> Customer.Id and

(2) CustomerAddress.Address -> Address.Id

Customer

Id Name

10 Samuel

11 Katja

34 Sylvia

56 Mark

88 Steve

Address

Id Location

24 Rome

33 London

44 Rome

66 Cologne

68 Palo Alto

CustomerAddress

Id Address

10 24

10 33

34 24

Relational World: N-M Relationships

Page 67: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

What’s wrong with the

Relational Model?

Page 68: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

These are all JOINs executedeverytime you traverse a

relationship

Customer

Id Name

10 Samuel

11 Katja

34 Sylvia

56 Mark

88 Steve

Address

Id Location

24 Rome

33 London

44 Rome

66 Cologne

68 Palo Alto

These are all JOINs executedeverytime you traverse a

relationship

These are all JOINs executedeverytime you traverse a

relationship

These are all JOINs executedeverytime you traverse a

relationship!

CustomerAddress

Id Address

10 24

10 33

34 24

The JOIN is the evil!

Page 69: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Why not JOIN

• A JOIN means searching for a key in another table

• The first rule to improve performance is

indexing all the keys

• Index speeds up searches but slows down insert, updates and deletes

• So in the best case a JOIN is a lookup into in an

index

• This is done per single join!

• If you traverse hundreds of relationships

you’re executing hundreds of JOINs

Page 70: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Index Lookup

it is really that fast?

Page 71: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

A-Z

A-L M-Z

A-L

A-D E-L

M-Z

M-R S-Z

A-D

A-B C-D

E-L

E-G H-L

E-G

E-F G

H-L

H-J K-L

Index algorithms are all similar and based on

balanced trees

Index Lookup: how does it works?

Think to an Address Book

where we have to find Samuel’s phone number

Page 72: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

A-Z

A-L M-Z

A-L

A-D E-L

M-Z

M-R S-Z

A-D

A-B C-D

E-L

E-G H-L

E-G

E-F G

H-L

H-J K-L

Found! Each lookup takes X steps, where Xgrows with the

index size!

Page 73: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

An index lookup is executed

for each JOIN

Querying more tables can easily

produce millions of JOINs/Lookups!

Here the rule: more entries

= more lookup steps = slower JOIN

Page 74: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Is there a better way to

manage relationships?

Page 75: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

How does GraphDB manage

index-free relationships?

Page 76: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

an Open Source (Apache 2)

document-graph NoSQL dbms

supports: transactions, extended-SQL,Multi-Master replication, etc

Page 77: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

SamLives

out : [#14:54]

label : ‘Customer’

name : ‘Sam’

out: [#13:35]

in: [#13:100]

Label : ‘Lives’

RID =

#13:35

RID =

#14:54

RID =

#13:100

in: [#14:54]

label = ‘Address’

name = ‘Rome’

The Record ID (RID)

is a Physical position

Rome

OrientDB: traverse a relationship

Page 78: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

GraphDB handles relationships as a

physical LINK to the record

assigned when the edge is created

on the other side

RDBMS computes the

relationship every time you query a database

Is not that crazy?!

Page 79: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

This means jumping from a

O(log N) algorithm to a near O(1)

traversing cost is not more affected

by database size!

This is huge in the BigData age

Page 80: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

$luca> cd bin

$luca> ./console.sh

OrientDB console v.1.2.0-SNAPSHOT (www.orientdb.org)

Type 'help' to display all the commands supported.

orientdb> create vertex V set name = ‘Sam’, label = ‘Customer’

Created vertex #13:35 in 0.03 secs

orientdb> create vertex V set name = ‘Rome’, label = ‘Address’

Created vertex #13:100 in 0.02 secs

orientdb> create edge E from #13:35 to #13:100 set label = ‘Lives’

Created edge #14:54 in 0.02 secs

Create the graph in SQL

Page 81: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

OGraphDatabase graph = new OGraphDatabase("local:/tmp/db/graph”);

ODocument sam= graph.createVertex();

sam.field(“name", “Sam");

sam.field(“label", “Customer");

ODocument rome = graph.createVertex();

rome.field(“name", “Rome”);

rome.field(“label", “Address”);

ODocument edge = graph.createEdge(sam, rome).field(“label”, “Lives”);

edge.save();

graph.close();

Create the graph in Java

Page 82: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

orientdb> select in[label=‘Lives’].out from V where

label = ‘Address’ and name = ‘Rome’

---+--------+--------------------+--------------------+--------------------+

#| REC ID |label |out |in |

---+--------+--------------------+--------------------+--------------------+

0| 13:35|Sam |[#14:54] | |

---+--------+--------------------+--------------------+--------------------+

1 item(s) found. Query executed in 0.007 sec(s).

orientdb> select * from V where label = ‘Address’ AND

in[label=‘Lives’].size() > 0

---+--------+--------------------+--------------------+--------------------+

#| REC ID |label |out |in |

---+--------+--------------------+--------------------+--------------------+

0| 13:100| Rome | |[#14:54] |

---+--------+--------------------+--------------------+--------------------+

1 item(s) found. Query executed in 0.007 sec(s).

Query the graph in SQL

Page 83: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

OGraphDatabase graph = new

OGraphDatabase("local:/tmp/db/graph”);

// GET ALL THE THE CUSTOMER FROM ROME, ITALY

List<ODocument> result = graph.command( new OCommandSQL (

“select in[label=‘Lives’].out from V where label = ‘Address’

and name = ?”)

).execute( “Rome”);

for( ODocument v : result ) {

System.out.println(“Result: “ + v.field(“label”) );

}

-----------------------------------------------------------------------------------

----Result: Sam

Query the graph in Java

Page 84: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Query vs. traversal

Once you’ve a well connected database in the form of a Super Graph you can cross records instead of query them!

All you need is some root vertices where to start to traverse

Page 85: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Customers

Sam John Sylvia

Order

2332

Order

8834

White

Soap

StocksSpecial

Customers

This is a

root

vertex

Query vs. traversal

Page 86: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Supposing that the root node #30:0 links all the

Customer vertices

Get all the customers:

orientdb> select out.in from #30:0

Get all the customers who bought at least one ‘White Soap’

product:

orientdb> select * from ( select out.in from #30:0) where

out.in.out[label=‘Bought’].in.name = ‘White Soap’

Customers

#30:0

Query the graph in SQL

Page 88: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

Should I be using NoSQL Databases?

For almost all of us, regular relational databases are THE correct solution

NoSQL Data storage systems makes sense for applications that need to deal with very verylarge semi-structured data

Log Analysis

Social Networking Feeds

Page 89: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

WHAT CAN WE DO WITH BIG DATA ?

90

Page 90: Big Data NoSQL 1017

What’s driving Big Data

- Ad-hoc querying and reporting

- Data mining techniques

- Structured data, typical sources

- Small to mid-size datasets

- Optimizations and predictive analytics

- Complex statistical analysis

- All types of data, and many sources

- Very large datasets

- More of a real-time

91

Page 91: Big Data NoSQL 1017

Value of Big Data Analytics

• Big data is more real-time in

nature than traditional DW

applications

• Traditional DW architectures (e.g.

Exadata, Teradata) are not well-

suited for big data apps

• Shared nothing, massively parallel

processing, scale out

architectures are well-suited for

big data apps

92

Page 92: Big Data NoSQL 1017

Copyright © 2011 LOGTEL 93

Page 93: Big Data NoSQL 1017

Copyright © 2011 LOGTEL

USN with related technical areas

94

Page 94: Big Data NoSQL 1017

What is collecting all this data?Web Browsers Search Engines

Microsoft’s

Internet Explorer

Mozilla’s FireFox

Google’s Chrome

Apple’s Safari

Google’s

Microsoft’s

Yahoo’s

IAC Search’s

Time-Warner’s AOL

Explorer

(Non-profit foundation,

used to be Netscape)

Page 95: Big Data NoSQL 1017

What is collecting all this data?

Smartphones & Apps

Apple’s iPhone

(Apple O/S)

Samsung, HTC.

Nokia, Motorola

(Android O/S)

RIM Corp’s Blackberry

(BlackBerry O/S)

Tablet Computers & Apps

Apple’s iPad

Samsung’s Galaxy

Amazon’s Kindle Fire

Page 96: Big Data NoSQL 1017

What is collecting all this data?

Hospitals & Other Medical Systems Banking & Phone Systems

Can you hear me now?

(Heh heh heh!)

Pharmacies

Laboratories

Imaging Centers

Emergency Medical Services (EMS)

Hospital Information Systems

Doc-in-a-Box

Electronic Medical Records

Blood Banks

Birth & Death Records

Page 97: Big Data NoSQL 1017

What is collecting all this data?

A real pain in the apps! What are they collecting?• Restaurant reservations

(Open Table)

• Weather in L.A. in 3 days (Weather+)

• Side effects of medications (MedWatcher)

• 3-star hotels in New Orleans (Priceline)

• Which PC should I buy and where (PriceCheck)

Page 98: Big Data NoSQL 1017

Big Brother Needs Big DataIn March 2012, the Obama Administration announced the Big Data Research and Development Initiative, $200 million in new R&D investments, which will explore how Big Data could be used to address important problems facing the government. The initiative was composed of 84 different Big Data programs spread across six departments.

http://tinyurl.com/85oytkj

The U.S. Federal Government owns six of the ten most powerful supercomputers in the world.

Page 99: Big Data NoSQL 1017

How Companies Like Use BigData To Make You Love Them

Last month, I talked to Amazon customer service about my malfunctioning Kindle, and it was great. Thirty seconds after putting in a service request on Amazon’s website, my phone rang, and the woman on the other end--let’s call her Barbara--greeted me by name and said, "I understand that you have a problem with your Kindle." We resolved my problem in under two minutes, we got to skip the part where I carefully spell out my last name and address, and she didn’t try to upsell me on anything. After nearly a decade of ordering stuff from Amazon, I never loved the company as much as I did at that moment.

The fact is, Amazon has been collecting my information for years--not just addresses and payment information but the identity of everything I’ve ever bought or even looked at. And while dozens of other companies do that, too, Amazon’s doing something remarkable with theirs. They’re using that data to build our relationship.

Article by Sean Madden, May 2012, an expert in service design and innovation strategy.

Page 100: Big Data NoSQL 1017

How Can You Avoid Big Data?

• Pay cash for everything!

• Never go online!

• Don’t use a telephone!

• Don’t use Kroger or Harris Teeter cards!

• Don’t fill any prescriptions!

• Never leave your house!

Page 101: Big Data NoSQL 1017

Key concept of Big Data

• Store everything

• Don’t delete anything

• Schema is a bottleneck

• Think always on parallel

• Remember the CAP theorem

Page 102: Big Data NoSQL 1017

Thank You!!!

…and please fill the evaluation form

103