big data and nosqlpages.di.unipi.it/ghelli/bd2/15.bigdata.pdf · –google bigtable –amazon...

Post on 06-Apr-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Big Data and NoSQL

Sources

• P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Very short history of DBMSs

• The seventies:

– IMS – end of the sixties, built for the Apollo program (today: Version 15) and IDS (then IDMS), hierarchical and network DBMSs, navigational

• The eighties – for twenty years:

– Relational DBMSs

• The nineties: client/server computing, three tiers, thin clients

Object Oriented Databased

• In the nineties, Object Oriented databases were proposed to overcome the impedance mismatch

• They influenced Relational Databases, and disappeared

Big Data

• Mid 2000s, Big Data:

– Volume:

• DBMSs do not scale enough for some applications

– Velocity:

• Computational speed

• Development velocity:– DBMS require upfront schema design and data cleaning

– Variety:

• Schemas conflict with variety

Big Data Examples

• Managing and analysing:

– Google searches

– Twitter feeds

– Facebook posts

– Amazon sales

– Connection data for a mobile phone company

– Location data for a car-black-box company

Big Data platforms

• The google stack:– Hardware: each Google Modular Data Center houses

1.000 Linux servers with AC and disks– GFS: distributed and redundant FS– MapReduce– BigTable, on top of GFS

• Hadoop – open source– HDFS, Hadoop MapReduce– HBase– SQL on Hadoop: Apache

Hive, IBM Jaql, Apache Pig,Cloudera Impala

Big Data systems: NoSQL systems

• NoSQL: Giving up something to get something more

• Giving up:– ACID transactions, to gain distribution

– Upfront schema, to gain• Velocity

• Variety

– First normal form, to reduce the need for joins

• Different from NewSQL

Types of NoSQL systems

• Key-value stores (Amazon Dynamo, Riak, Voldemort…)

• Document databases:– XML databases: MarkLogic, eXist

– JSON databases:• CouchDB, Membase, Couchbase

• MongoDB

• Sparse table databases:– Hbase

• Graph databases (not really about BigData):– Neo4j

NewSQL

• NewSQL is a different approach to Velocity, much less disruptive than NoSQL

– Column databases

– In memory databases

NoSQL

Why NoSQL

• Impedance mismatch

• The schema problem:

– Restrictive

– Heavy to set up

• Integration databases -> application databases

• Cluster architecture

– Google BigTable

– Amazon Dynamo

NoSQL: reasons of success

• Support cluster architecture (Velocity, Volume)

– Google BigTable

– Amazon Dynamo

• Remove schema restriction (Variety, Velocity)

• Simple for simple tasks

NoSQL

• A set of ill-defined systems that are not RDMBS

• Usually do not support SQL

• Are usually Open Source (not always)

• Often cluster-oriented (not always), hence no ACID

• Recent (after 2000)

• Schema free

• Oriented toward a single application

• It is more a ‘movement’ than a technology

Aggregate data models

• From many simple tables -> to just one collection of aggregated objects (simplified object data model)

• Aggregate data model is essential in order to work without transactions and without joins

Aggregate data models

• NoSQL data models:

– Aggregate data models:

• Key-value

• Document

• Column family

– Graph model

Graph model

• Set of triples <nodeid, property, nodeid> (FlockDB, Neo4J)

Aggregate orientation

http://martinfowler.com/bliki/AggregateOrientedDatabase.html

Aggregate data models

• Key value stores: the database is a collection of <key,value> pairs, where the value is opaque (Dynamo, Riak, Voldemort)

• Document database: a collection of documents (XML or JSON) that can be searched by content (MarkLogic, MongoDB)

• Column-family stores: a set of <key, record> pair (BigTable, HBase, Cassandra)– Columns are grouped in ‘column families’

Key-value stores implementation

• Implementation model:

– Key-based distribution of the pairs on a huge farm of inexpensive machines

– Constant time access

– Constant time parallel execution on all the pairs

– Flexible fault-tolerance

– MapReduce execution model

– Amazon Dynamo, Riak, Voldemort

Schemaless databases

• Schema first vs. schema later

• Homogeneous vs. non homogeneous

Materialized views

• OLAP applications greatly benefit from materialized views

• Materialized views can be used to regain the flexibility of the relational model

Key-Value distribution

• Sharding + replication• Sharding: splitting data among nodes according

to a key• Master-slave replication

– No update conflict– Read resilience– Master election

• P2P replication– No single point of failure

• The distributed consistency problem

Levels of Consistency

• Wrt. to write-write conflicts: avoiding to lose an update

• Read consistency:– Fresh data

– No intermediate data

– Session consistency

• Transactional consistency– Only write values that are based on currently valid

data

The CAP Theorem: example

• Would like: Consistency + Availability + Partition tolerance• Store three copies of a value for Availability• 1 read 3 writes:

– Read from any 1 node– Before committing an update wait for three writes to be

completed

• 1 write 3 reads:– As soon as one write is ok, commit– Always read 3 copies and return newest value

• 2 writes 2 reads: – If you read 2, at least one is current

• Consistency + Availability + Partition tolerance = Impossible

The CAP Theorem

• You cannot have all of:

– Consistency

– Availability

– Partition tolerance

• A trade-off between consistency and latency

• Relaxing consistency

– Two writes in the same cart

• Relaxing durability

Consistency: single operation atomicity

• The problem: avoiding r/w and w/w conflicts on a single operation

• Quorum: in a P2P system, an operation is successful if it gets a quorum of confirmations

– The write quorum:

• W > N/2

– The read quorum:

• R+W > N

Consistency: update consistency

• The problem: only update a value if nobodyelse did change it in the meanwhile

• Optimistic approach:

– You read the data item with a version stamp

– Every time you update, you change the version

– The update operation has the previous-versionparameter, and fails if the stamp changed: Compare-And-Set (CAS)

Consistency: replication optimism

• Assume we do not have quorum, and two copieswith versionid are updated in parallel: what doeshappen?– When version is a counter– When version is a random GUID

• P2P consistency problem: deciding the temporalrelationship between two different versions

• Local counter or GUID does not help • The vector clock:

– Assume nodes A,B,C, the version stamp is[A:7;B:5;C:9]

Parallelism: Map-Reduce

• Map(m): apply m in parallel to each object, to get a set of <key, value> pairs

• Shuffle-sort: collect all pairs with the same <key> to the same node, get sets with shape {<k,v1>,…,<k,vn>}

• Reduce(r): apply r to each set {<k,v1>,…,<k,vn>} to produce a result

Map-Reduce

Data 1

Data 2

Data 3

Data 4

Mapm: d → seq(k,v)

K3 v11K2 v12K3 v13

K4 v21K1 v22K3 v23

K4 v31K5 v32K4 v33

K3 v41K2 v42K3 v43

K1 v22

K3 v11K3 v43K3 v23K3 v13K3 v41

K4 v33K4 v31k4v21

K2 v42K2 v12

K5 v32

ShuffleAnd Sort

K1 r(v22)

K3 r(v11,v43,…)

K4 r(v33,v31,v21)

K2 r(v42,v12)

K5 r(v32)

INPUT OUTPUTReduce

r: seq(k,v) → k,v

Example: word count

• Problem: counting the number of occurrences for each word in a big collection of documents

• Map:– takes a couple (k, document), ignores k, returns a pair

(w,1) for each word w in document

• Shuffle&Sort:– groups the Map output by w and produces pairs of

the form (w, [1, …,1])

• Reduce:– takes a pair (w, [1, …,1]), and outputs (w, 1+…+1)

Example: word count

NoSQL Parallel NoSQL

Velocity NoSQL DBMS

Velocity Map Velocity

NoSQL Parallel NoSQL

Map(m)

NoSQL 1Parallel 1NoSQL 1

Velocity 1NoSQL 1DBMS 1

Velocity 1Map 1Velocity 1

NoSQL 1Parallel 1NoSQL 1

DBMS 1

NoSQL 1NoSQL 1NoSQL 1NoSQL 1NoSQL 1

Velocity 1Velocity 1Velocity 1

Parallel 1Parallel 1

Map 1

ShuffleAnd Sort

DBMS 1

NoSQL 5

Velocity 3

Parallel 2

Map 1

Reduce(r)

INPUT OUTPUT

Pseudo code

• Map( _, v ):

– for each w in v do emit(w, 1)

• Reduce(k, v):

– c=0;

– for x in v do c = c +1;

– emit(k, c)

Exercises

• Sales(Date,StoreId,ProdId,Amount)

• How to compute group_by({Date},{sum(Amount)})?

• Sales+Stores(StoreId,Region)

• How to compute join(Sales,Stores)?

Implementing map-reduce: Hadoop

• Input and output of each phase are stored in a distributed file system that manages the partitioning and the replication

• Spark approach: when possible, input and output are just kept in main memory

• The computation is divided among many small tasks

• A task manager assigns the task and, when a task fails re-executes it

Dataflow systems

• Dataflow systems are similar to map-reduce systems but they implement a wider range of parallel patterns, with vertices that generalizethe map and reduce vertices and edges thatgeneralize the key-based communicationbetween map and reduce

Key-Value Databases

• Basically, a persistent hash table• Sharding + replication• Consistency

– Single object– Riak: for each bucket (data space):

• Newest write wins / create siblings• Setting read / write quorum

• Query– By key– Full store scan (not always provided)

• Uses: session information, user profiles, shopping cart data by userid…

Document Databases: MongoDB

• One instance, many databases, many collections

• JSON documents with _id field

• Sharding + replication

Consistency

• Master/slave replication

– Automated failover, server maintenance, disaster recovery, read scaling

• Master is dynamically re-elected over fail

• One can specify a write quorum

• One can specify whether reads can be directed to slaves

Querying

• CouchDB:

– query via views (virtual or materialized)

• MongoDB:

– Selection, projection, aggregation

Column-family Stores

• A ‘column-family’ (similar to a ‘table’ in relational databases) is a set of <key,record> pairs

• If can be vertically divided in keyspaces

• Records are not necessarily homogeneous

Consistency

• In Cassandra:

– The DBA fixes the number of replicas for each keyspace

– the programmer decides the quorum for read and write operations (1, majority, all…)

– Transactions:

• Atomicity at the row level

• Possibility to use external transactional libraries

Queries (Cassandra)

• Row retrieval:

– GET Customer[‘johnsmith00012’]

• Field (column) retrieval:

– GET Customer[‘johnsmith00012’][‘age’]

• After you create an index on age:

– GET Customer WHERE age = 35

• Cassandra supports CQL:

– Select-project (no join) SQL

Graph Databases

• A graph database stores a graph

• A graph is, essentially, a database with one ternary table:

– Edges(NodeId1, NodeId2, EdgeAttributes)

– You may also have Nodes(NodeId, NodeAttributes)(optional)

• Example: Neo4J

Graph model

Consistency

• Graph databases are usually not sharded and transactional

• Neo4J supports master-slave replication

• Data can be sharded at the application level with no database support, which is quite hard

Querying: Cypher

MATCH (me {name:"Giorgio"})

RETURN me

Querying: Cypher

MATCH (expert)

-[:WORKED_WITH]->

(neodb:Database{name:"Neo4j"})

RETURN neodb, expert

Querying: Cypher

MATCH (me {name:"Giorgio"})

MATCH (expert)

-[:WORKED_WITH]->

(neodb:Database {name:"Neo4j"})

MATCH path = shortestPath( (me)-[:FRIEND*..5]-(expert) )

RETURN neodb, expert, path

Querying: Cypher

MATCH pattern matches

WHERE filtering conditions

RETURN what to return

ORDER BY properties to order by

SKIP nodes to skip from the top

LIMIT limit results

NoSQL systems advantages

• Support for cluster architecture:

– Volume and Velocity

• Aggregate model, schemaless architecture:

– Velocity of development for simple applications

• Schemaless architecture:

– Supports Variability

• Flexible consistency:

– Supports Velocity

NoSQL systems problems

• Transactional support is limited to a single aggregate

• Flexible consistency is hard to manage

• No SQL, no optimization:

– Complex data needs to be pre-aggregated

– different queries require the construction of different re-aggregations of the same data

Big Data architectural trends

• The data lake

• Polyglot systems

The Data Lake

• Standard Data Warehouse architecture:

– Long phase of data design to decide the schema

– Complex phase of data cleaning to get high qualitydata

– Ready to play

• The Data Lake:

– Just collect all data you have in the Data Lake

– Run ML algorithms on the Lake

Polyglot systems

• Combine transactional RDBMSs, DSSs and NoSQL systems

• Advantages: pay the price of schemas and transactions only where they are needed

• Problems: maintenance and security

SQL on top of MapReduce

• Serdar Yegulalp compiled this list in 2014 (ask Google):– Apache Hive: The original SQL-on-Hadoop solution– Stinger: Hortonworks development of Apache Hive– Apache Drill: An open source implementation of Google's Dremel (aka

BigQuery), to access multiple types of data stores– Spark SQL: Apache's Spark project is for real-time, in-memory,

parallelized processing of Hadoop data.– Apache Phoenix: Its developers call it a "SQL skin for HBase".– Cloudera Impala: another implementation of Dremel/Apache Drill for

Hadoop.– HAWQ for Pivotal HD: Pivotal version for its own Hadoop distribution – Presto: Built by Facebook's engineers, reminiscent of Apache– Oracle Big Data SQL– IBM BigSQL

Conclusion

• There is no ‘winner’: DBMSs, DSSs, parallel and distributed DBs, NoSQL systems: they are all hereto stay

• There is a terrible trend of moving everything to NoSQL and Machine Learning due to hype: greatoccasion for consultants, and for waste

• The only way of making a good choice is having a real understanding of:– The business problem to be solved

– The current state of the technology

top related