big data - architectural concerns for the new age

49
Big Data architectural concerns for the new age Sunday, 2 December 12

Upload: debasish-ghosh

Post on 26-Jan-2015

106 views

Category:

Technology


1 download

DESCRIPTION

A brief introduction to Big Data and why should care about polyglot storage

TRANSCRIPT

Page 1: Big Data - architectural concerns for the new age

Big Dataarchitectural concerns for the

new age

Sunday, 2 December 12

Page 2: Big Data - architectural concerns for the new age

Debasish GhoshCTO

(a Nomura Research Institute group company)

Sunday, 2 December 12

Page 3: Big Data - architectural concerns for the new age

@debasishg on Twitter

code @ http://github.com/debasishg

blog @ Ruminations of a Programmer http://debasishg.blogspot.com

Sunday, 2 December 12

Page 4: Big Data - architectural concerns for the new age

some numbers ..

Sunday, 2 December 12

Page 5: Big Data - architectural concerns for the new age

Facebook reaches 1 billion active users

Sunday, 2 December 12

Page 6: Big Data - architectural concerns for the new age

Sunday, 2 December 12

Page 7: Big Data - architectural concerns for the new age

Sunday, 2 December 12

Page 8: Big Data - architectural concerns for the new age

some more numbers ..

Sunday, 2 December 12

Page 9: Big Data - architectural concerns for the new age

• Walmart handles 1M transactions per hour

• Google processes 24PB of data per day

• AT&T transfers 30PB of data per day

• 90 trillion emails are sent every year

• World of Warcraft uses 1.3PB of storage

Sunday, 2 December 12

Page 10: Big Data - architectural concerns for the new age

Big Data - the positive feedback cycle

new technologiesmake using big data

efficientmore adoption

of big data

generationof morebig data

1

2

3

Sunday, 2 December 12

Page 11: Big Data - architectural concerns for the new age

new technologies

.. new architectural concerns

Sunday, 2 December 12

Page 12: Big Data - architectural concerns for the new age

new ways to store data

Sunday, 2 December 12

Page 13: Big Data - architectural concerns for the new age

new techniques to retrieve data

Sunday, 2 December 12

Page 14: Big Data - architectural concerns for the new age

new ways to scale reads & writes

Sunday, 2 December 12

Page 15: Big Data - architectural concerns for the new age

transparent to the application

Sunday, 2 December 12

Page 16: Big Data - architectural concerns for the new age

new ways to consume data

Sunday, 2 December 12

Page 17: Big Data - architectural concerns for the new age

new techniques to analyze data

Sunday, 2 December 12

Page 18: Big Data - architectural concerns for the new age

new ways to visualize data

Sunday, 2 December 12

Page 19: Big Data - architectural concerns for the new age

at Web scale

Sunday, 2 December 12

Page 20: Big Data - architectural concerns for the new age

The Database Landscape so far ..

• relational database - the bedrock of enterprise data

• irrespective of application development paradigm

• object-relational-mapping considered to be the panacea for impedance mismatch

Sunday, 2 December 12

Page 21: Big Data - architectural concerns for the new age

“Object Relational Mapping is the Vietnam of Computer Science”

- Ted Neward (2006)

blogger, big geek and architectural consultant

Sunday, 2 December 12

Page 22: Big Data - architectural concerns for the new age

RDBMS & Big Data

• once the data volume crosses the limit of a single server, you shard / partition

• sharding implies a lookup node for the hash code => SPOF

• cross shard joins, transactions don’t scale

Sunday, 2 December 12

Page 23: Big Data - architectural concerns for the new age

RDBMS & Big Data

• Cost of distributed transactions

• synchronization overhead

• 2 phase commit is a blocking protocol (can block indefinitely)

• as slow as the slowest DB node + network latency

Sunday, 2 December 12

Page 24: Big Data - architectural concerns for the new age

RDBMS & Big Data

• Master/Slave replication

• synchronous replication => slow

• asynchronous replication => can lose data

• writing to master is a bottleneck and SPOF

Sunday, 2 December 12

Page 25: Big Data - architectural concerns for the new age

Need Distributed Databases

• data is automatically partitioned

• transparent to the application

• add capacity without downtime

• failure tolerant

Sunday, 2 December 12

Page 26: Big Data - architectural concerns for the new age

2 famous papers ..

• Bigtable: A distributed storage system for structured data, 2006

• Dynamo: Amazon’s highly scalable key/value store, 2007

Sunday, 2 December 12

Page 27: Big Data - architectural concerns for the new age

Addressing 2 Approaches

• Bigtable: “how can we build a distributed database on top of GFS ?”

• Dynamo: “how can we build a distributed hash table appropriate for data center ?”

Sunday, 2 December 12

Page 28: Big Data - architectural concerns for the new age

Big Data recommendations

• reduce accidental complexity in processing data

• be less rigid (no rigid schema)

• store data in a format closer to the domain model

• hence no universal data model ..

Sunday, 2 December 12

Page 29: Big Data - architectural concerns for the new age

Polyglot Storage

• unfortunately came to be known as NoSQL databases

• document oriented (MongoDB, CouchDB)

• key/value (Dynamo, Bigtable, Riak, Cassandra, Voldemort)

• data structure based (redis)

• graph based (Neo4J)

Sunday, 2 December 12

Page 30: Big Data - architectural concerns for the new age

richer modeling capabilities

closer to domain model

reduced impedancemismatch

Sunday, 2 December 12

Page 31: Big Data - architectural concerns for the new age

Asynchronous Replication to RDBMS using Message Oriented Middleware

Sunday, 2 December 12

Page 32: Big Data - architectural concerns for the new age

Hybrid Oracle MongoDB storage over Messaging backbone

Sunday, 2 December 12

Page 33: Big Data - architectural concerns for the new age

Relational Database is just another option, not the only option when data set is BIG and

semantically rich

Sunday, 2 December 12

Page 34: Big Data - architectural concerns for the new age

10 things never to do with a Relational Database

• Search

• Recommendation

• High Frequency Trading

• Product Cataloging

• User group / ACLs

• Log Analysis

• Media Repository

• Email

• Classification ad

• Time Series / Forecasting

Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-database-206944?page=0,0

Sunday, 2 December 12

Page 35: Big Data - architectural concerns for the new age

Scalability, Availability ..• ACID => BASE

• CAP Theorem & Eventual Consistency

• Consistent Hashing

• Vector Clocks

• Hinted Hand-off & Read repair

• Anti-entropy

• Gossip Protocol

Sunday, 2 December 12

Page 36: Big Data - architectural concerns for the new age

CAP Theorem

• Consistency, Availability & Partition Tolerance

• You can have only 2 of these in a distributed system

• Eric Brewer postulated this quite some time back

Sunday, 2 December 12

Page 37: Big Data - architectural concerns for the new age

ACID => BASE

• Basic Availability Soft-state Eventual consistency

• Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state.

• It’s ok to use stale data and it’s ok to give approximate answers

Sunday, 2 December 12

Page 38: Big Data - architectural concerns for the new age

Consistent Hashing

Sunday, 2 December 12

Page 39: Big Data - architectural concerns for the new age

Big Data in the wild

• Hadoop

• started as a batch processing engine (HDFS & Map/Reduce)

• with bigger and bigger data, you need to make them available to users at near real time

• stream processing, CEP ..

Sunday, 2 December 12

Page 40: Big Data - architectural concerns for the new age

a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems

Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing

Cloudera Impala

complementingMap/Reduce

in Hadoop

Sunday, 2 December 12

Page 41: Big Data - architectural concerns for the new age

Real time queries in Hadoop

• currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop

• expensive and may need lots of data movement between the database & the Hadoop clusters

Sunday, 2 December 12

Page 42: Big Data - architectural concerns for the new age

.. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current

base ..

Sunday, 2 December 12

Page 43: Big Data - architectural concerns for the new age

Shark from UC Berkeley

• a large scale data warehouse system for Spark, compatible with Hive

• supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3

Sunday, 2 December 12

Page 44: Big Data - architectural concerns for the new age

BI and Analytics

• making Big Data available to developers

• API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps)

• analyzing user behaviors, network monitoring, log processing, recommenders, AI ..

Sunday, 2 December 12

Page 45: Big Data - architectural concerns for the new age

Machine Learning

• personalization

• social network analysis

• pattern discovery - click patterns, recommendations, ratings

• apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..

Sunday, 2 December 12

Page 46: Big Data - architectural concerns for the new age

Summary

• Big Data will grow bigger - we need to embrace the changes in architecture

• An RDBMS is NOT the panacea - pick your data model that’s closest to your domain

• It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware

Sunday, 2 December 12

Page 47: Big Data - architectural concerns for the new age

Summary

• Go for decentralized architectures, avoid SPOFs

• With the big volumes of data, streaming is your friend

Sunday, 2 December 12

Page 48: Big Data - architectural concerns for the new age

Thank You!

Sunday, 2 December 12