big data - architectural concerns for the new age

Post on 26-Jan-2015

106 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

A brief introduction to Big Data and why should care about polyglot storage

TRANSCRIPT

Big Dataarchitectural concerns for the

new age

Sunday, 2 December 12

Debasish GhoshCTO

(a Nomura Research Institute group company)

Sunday, 2 December 12

@debasishg on Twitter

code @ http://github.com/debasishg

blog @ Ruminations of a Programmer http://debasishg.blogspot.com

Sunday, 2 December 12

some numbers ..

Sunday, 2 December 12

Facebook reaches 1 billion active users

Sunday, 2 December 12

Sunday, 2 December 12

Sunday, 2 December 12

some more numbers ..

Sunday, 2 December 12

• Walmart handles 1M transactions per hour

• Google processes 24PB of data per day

• AT&T transfers 30PB of data per day

• 90 trillion emails are sent every year

• World of Warcraft uses 1.3PB of storage

Sunday, 2 December 12

Big Data - the positive feedback cycle

new technologiesmake using big data

efficientmore adoption

of big data

generationof morebig data

1

2

3

Sunday, 2 December 12

new technologies

.. new architectural concerns

Sunday, 2 December 12

new ways to store data

Sunday, 2 December 12

new techniques to retrieve data

Sunday, 2 December 12

new ways to scale reads & writes

Sunday, 2 December 12

transparent to the application

Sunday, 2 December 12

new ways to consume data

Sunday, 2 December 12

new techniques to analyze data

Sunday, 2 December 12

new ways to visualize data

Sunday, 2 December 12

at Web scale

Sunday, 2 December 12

The Database Landscape so far ..

• relational database - the bedrock of enterprise data

• irrespective of application development paradigm

• object-relational-mapping considered to be the panacea for impedance mismatch

Sunday, 2 December 12

“Object Relational Mapping is the Vietnam of Computer Science”

- Ted Neward (2006)

blogger, big geek and architectural consultant

Sunday, 2 December 12

RDBMS & Big Data

• once the data volume crosses the limit of a single server, you shard / partition

• sharding implies a lookup node for the hash code => SPOF

• cross shard joins, transactions don’t scale

Sunday, 2 December 12

RDBMS & Big Data

• Cost of distributed transactions

• synchronization overhead

• 2 phase commit is a blocking protocol (can block indefinitely)

• as slow as the slowest DB node + network latency

Sunday, 2 December 12

RDBMS & Big Data

• Master/Slave replication

• synchronous replication => slow

• asynchronous replication => can lose data

• writing to master is a bottleneck and SPOF

Sunday, 2 December 12

Need Distributed Databases

• data is automatically partitioned

• transparent to the application

• add capacity without downtime

• failure tolerant

Sunday, 2 December 12

2 famous papers ..

• Bigtable: A distributed storage system for structured data, 2006

• Dynamo: Amazon’s highly scalable key/value store, 2007

Sunday, 2 December 12

Addressing 2 Approaches

• Bigtable: “how can we build a distributed database on top of GFS ?”

• Dynamo: “how can we build a distributed hash table appropriate for data center ?”

Sunday, 2 December 12

Big Data recommendations

• reduce accidental complexity in processing data

• be less rigid (no rigid schema)

• store data in a format closer to the domain model

• hence no universal data model ..

Sunday, 2 December 12

Polyglot Storage

• unfortunately came to be known as NoSQL databases

• document oriented (MongoDB, CouchDB)

• key/value (Dynamo, Bigtable, Riak, Cassandra, Voldemort)

• data structure based (redis)

• graph based (Neo4J)

Sunday, 2 December 12

richer modeling capabilities

closer to domain model

reduced impedancemismatch

Sunday, 2 December 12

Asynchronous Replication to RDBMS using Message Oriented Middleware

Sunday, 2 December 12

Hybrid Oracle MongoDB storage over Messaging backbone

Sunday, 2 December 12

Relational Database is just another option, not the only option when data set is BIG and

semantically rich

Sunday, 2 December 12

10 things never to do with a Relational Database

• Search

• Recommendation

• High Frequency Trading

• Product Cataloging

• User group / ACLs

• Log Analysis

• Media Repository

• Email

• Classification ad

• Time Series / Forecasting

Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational-database-206944?page=0,0

Sunday, 2 December 12

Scalability, Availability ..• ACID => BASE

• CAP Theorem & Eventual Consistency

• Consistent Hashing

• Vector Clocks

• Hinted Hand-off & Read repair

• Anti-entropy

• Gossip Protocol

Sunday, 2 December 12

CAP Theorem

• Consistency, Availability & Partition Tolerance

• You can have only 2 of these in a distributed system

• Eric Brewer postulated this quite some time back

Sunday, 2 December 12

ACID => BASE

• Basic Availability Soft-state Eventual consistency

• Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state.

• It’s ok to use stale data and it’s ok to give approximate answers

Sunday, 2 December 12

Consistent Hashing

Sunday, 2 December 12

Big Data in the wild

• Hadoop

• started as a batch processing engine (HDFS & Map/Reduce)

• with bigger and bigger data, you need to make them available to users at near real time

• stream processing, CEP ..

Sunday, 2 December 12

a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems

Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processing

Cloudera Impala

complementingMap/Reduce

in Hadoop

Sunday, 2 December 12

Real time queries in Hadoop

• currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop

• expensive and may need lots of data movement between the database & the Hadoop clusters

Sunday, 2 December 12

.. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current

base ..

Sunday, 2 December 12

Shark from UC Berkeley

• a large scale data warehouse system for Spark, compatible with Hive

• supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3

Sunday, 2 December 12

BI and Analytics

• making Big Data available to developers

• API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps)

• analyzing user behaviors, network monitoring, log processing, recommenders, AI ..

Sunday, 2 December 12

Machine Learning

• personalization

• social network analysis

• pattern discovery - click patterns, recommendations, ratings

• apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..

Sunday, 2 December 12

Summary

• Big Data will grow bigger - we need to embrace the changes in architecture

• An RDBMS is NOT the panacea - pick your data model that’s closest to your domain

• It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardware

Sunday, 2 December 12

Summary

• Go for decentralized architectures, avoid SPOFs

• With the big volumes of data, streaming is your friend

Sunday, 2 December 12

Thank You!

Sunday, 2 December 12

top related