big data strategy for the relational world

35
Big Data Strategy for the Relational World Embracing Disruption, Avoiding Regression Andrew J. Brust Founder & CEO, Blue Badge Insights Big Data correspondent, ZDNet Big Data Analyst, GigaOM Research

Upload: andrew-brust

Post on 19-Jan-2015

1.015 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Big Data Strategy for the Relational World

Big Data Strategyfor the Relational World

Embracing Disruption, Avoiding Regression

Andrew J. Brust

Founder & CEO, Blue Badge InsightsBig Data correspondent, ZDNet

Big Data Analyst, GigaOM Research

Page 2: Big Data Strategy for the Relational World

Bio• CEO and Founder, Blue Badge Insights• Big Data blogger for ZDNet• Microsoft Regional Director, MVP• Co-chair, Visual Studio Live! and 18 years as a speaker• Founder, Microsoft BI User Group of NYC

– http://www.msbinyc.com• Co-moderator, NYC .NET Developers Group

– http://www.nycdotnetdev.com• “Redmond Review” columnist for

Visual Studio Magazine and Redmond Developer News• Twitter: @andrewbrust

Page 3: Big Data Strategy for the Relational World

Andrew on ZDNet (bit.ly/bigondata)

Page 4: Big Data Strategy for the Relational World

Read all about it!

Page 5: Big Data Strategy for the Relational World

Big Data: Why Should You Care?

• Because analytics (i.e. BI) has always been important, but it was expensive and obscure

• Because the economics of processing and storage make Big Data feasible

Page 6: Big Data Strategy for the Relational World

Big Data: Why Should You be Cautious?

• Too many vendors; too much churn• Designed for the lab, not for mainstream

business• Immature technology and tooling– Results in serious recruiting and dev costs

• So, you can’t ignore Big Data, but you can’t just pursue with abandon, either.– That’s hard!

Page 7: Big Data Strategy for the Relational World

Agenda• Trends• Technologies

– NoSQL– Hadoop– SQL Convergence– NewSQL– In-Memory

• Forecasts• Risks• Recommendations

Page 8: Big Data Strategy for the Relational World

Database Trends• Mongo and Cassandra, primarilyNoSQL

• aka “unstructured data”Late-bound schema

• Especially HDFSFile-based table handling

• And Massively Parallel ProcessingColumnar storage

• Very few throwing them awayCo-existence with RDBMS, OLAP databases

• Still expect tables or cubesLittle change in tools/clients

Page 9: Big Data Strategy for the Relational World

NoSQL

Key-ValueStore

• Couchbase• Riak• Redis• Voldemort• DynamoDB• Azure tables

DocumentStore

• MongoDB• CouchDB

• Cloudant• Couchbase

Wide ColumnStore

• HBase• Cassandra

Graph Database

• Neo4J

SQLSQL

Page 10: Big Data Strategy for the Relational World

Consistency

• CAP Theorem–Databases may only excel at two of the following

three attributes: consistency, availability and partition tolerance

• NoSQL does not offer “ACID” guarantees–Atomicity, consistency, isolation and durability

• Instead offers “eventual consistency”–Similar to DNS propagation

Page 11: Big Data Strategy for the Relational World

CAP Theorem

Consistency

AvailabilityPartition Tolerance

Relational

NoSQL

Page 12: Big Data Strategy for the Relational World

NoSQL Upside

• Distributed by default• Open source lets you peg costs to personnel,

more than to customers• Developer enthusiasm

Page 13: Big Data Strategy for the Relational World

Hadoop

• Open source, petabyte-scale data analysis and processing framework

• Runs on commodity hardware• Lots of ecosystem• Two main components:– Hadoop Distributed File System (HDFS)– MapReduce engine

Page 14: Big Data Strategy for the Relational World

Hadoop

• Open source, petabyte-scale data analysis and processing framework

• Runs on commodity hardware• Lots of ecosystem• Two main components:– Hadoop Distributed File System (HDFS)– MapReduce engine

Page 15: Big Data Strategy for the Relational World

Why MapReduce is Cool

• Extremely flexible – full power of a procedural programming language

• Map step, essentially, allows ad hoc ETL• With Reduce step, aggregation is a first-class

concept• Growing ecosystem of tools that generate

MapReduce code

Page 16: Big Data Strategy for the Relational World

Why MapReduce Sucks

• It’s a batch mode technology• It’s not declarative• Most BI products don’t work with MR natively– They connect via Hive instead (by and large)

• It’s good for a group of use cases, but it’s not a good general framework

Page 17: Big Data Strategy for the Relational World

The Google DNA

• Hadoop and HBase came from Google– MapReduce, GFS– BigTable

• Hadoop was built for their use cases, and they don’t use it as extensively now

• So why is the world going Hadoop-crazy?

Page 18: Big Data Strategy for the Relational World

Benefits of Schema-Free

• Variable schema is accommodated– Great for product catalogs, content management

and the like• Simple for archival storage• For analysis:– Avoids politics of achieving consensus on structure– Allows different schema for different applications

Page 19: Big Data Strategy for the Relational World

Cloud Effect

• Database as a service and SaaS BI/Analytics gets companies excited– Cloudant– Amazon: DynamoDB, RDS, RedShift, Jaspersoft

• Elastic capabilities of cloud provide small customers with access to huge clusters– Amazon EMR, Microsoft Windows Azure HDInsight now– Google Compute Engine, Rackspace/Hortonworks to come

• Cloud-borne reference data adds value• But casualties emerging: e.g. Xeround

Page 20: Big Data Strategy for the Relational World

SQL Skillset and Ecosystem

• Making recruiting faster and cheaper

DBAs, most devs know it

ORMs expect it

• Even if they also talk to MDX and NoSQL sources

Reporting/analysis tools are premised on it

Companies are invested in it

Abandoning it is naive

Page 21: Big Data Strategy for the Relational World

MPP is Big Data(via acquisition)

• Acquired Aster DataTeradata

• IBMNetezza

• HPVertica

• EMCPivotal/Greenplum

• ActianParAccel

• Microsoft-DATAllegro acquisitionSQL Server Parallel Data Warehouse

Page 22: Big Data Strategy for the Relational World

SQL – BD Convergence

• Brings the SQL language and data warehouse products, on one side, together with Hadoop, on the other

• Goal is to make Hadoop interactive, non-batch• May involve Hive and its APIs• May involve direct access to HDFS– Bypassing MapReduce

• Think of the “database” as HDFS, and MapReduce as merely an access method.

Page 23: Big Data Strategy for the Relational World

One Repository, Multiple Access Methods

HCatalog

Page 24: Big Data Strategy for the Relational World

Cloudera Impala (v1.0 shipped April 30)

Hortonworks “Stinger” initiative•Make Hive 100x faster

EMC Pivotal

Microsoft PolyBase, Data Explorer

Teradata Aster SQL-H

ParAccel (Actian) ODI

SQL – BD Convergence

Page 25: Big Data Strategy for the Relational World
Page 26: Big Data Strategy for the Relational World

NuoDB

VoltDB

Clustrix

TransLattice

NewSQL Entrants

Page 27: Big Data Strategy for the Relational World

Dremel and Drill

• Dremel is Google’s column store analytical database– Proprietary; available publicly as BigQuery

• Hierarchical/nested too– Allows schema variance without anarchy

• “…scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.”

• Uses SQL, has growing BI tool support• Petabyte scale• Drill:Dremel as Hadoop:MapReduce+GFS• And then there’s Spanner

Page 28: Big Data Strategy for the Relational World

In-Memory

• SAP HANA– And Sybase IQ

• Data Warehouse Appliances• VoltDB• Oracle TimesTen• IBM solidDB– Also TM1 (in-memory OLAP)

• Coming: SQL Server’s “Hekaton” engine

Page 29: Big Data Strategy for the Relational World

The Truth About In-Memory

• Judicious use of in-memory database technology can speed analytical queries – Combine with columnar technology, rinse, repeat

• Can also eliminate need for deferred writes• A RAM-only strategy like HANA’s seems impractical• Keep in mind:

– SSD is memory too. It’s slower, but it’s memory.– Conversely, L1, L2 and L3 cache is faster than RAM. Single

Instruction, Multiple Data (SIMD) makes things faster still.• Hybrid approaches are most sensible

Page 30: Big Data Strategy for the Relational World

What’s Ahead?

• Consolidation! We can’t have this many vendors:– Some will go out of business– Some will get acquired– A few will stay independent (but may merge with each

other)• Hadoop recedes into the service layer• NoSQL shakes out, matures, coexists• NewSQL gets adopted or acquired• In-memory becomes a standard option

Page 31: Big Data Strategy for the Relational World

Risks and Considerations

• Pick an esoteric database now and you may be forced to migrate later

• SQL Server and Oracle could add features that make the specialty products superfluous– Or new products

• Conversely, NoSQL products may acquire ACID-like features themselves

• More convergence

Page 32: Big Data Strategy for the Relational World

Recommendations

• NoSQL has its use cases. But it also has its abuses.

• Look carefully at the number of customers• Look also at how widely deployed the product

is within those customer companies

Page 33: Big Data Strategy for the Relational World

Recommendations

• If you haven’t looked seriously at Hadoop, do so. But remember, it’s infrastructure.

• You can reach out to Big Data now, or you can wait for it to reach out to you– Cost/benefit of earlier adoption vs. late following

• For repeatable big problems, MapReduce works well; for iterative query, “SQL” technologies are much better– akin to standard reports versus ad hoc queries

Page 34: Big Data Strategy for the Relational World

Parting Thoughts

• NoSQL and Big Data are disruptive• You ignore them at your peril• But if they can’t, ultimately, blend into current

technology environments then they’re destined to fail

• You can embrace the change without being sacrificed. Just watch your back.

Page 35: Big Data Strategy for the Relational World

Thank You!

• Email• [email protected]

• Blog:• http://www.zdnet.com/blog/big-data

• Twitter• @andrewbrust on twitter