big data anti-patterns: lessons from the front line

Post on 01-Jul-2015

964 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

October 2014 Strata NYC presentations

TRANSCRIPT

Big Data Anti-Patterns:

Lessons from the Front Lines

Strata NYC

October 17, 2014

Douglas Moore

| 2

Think Big – 3 Years

- Delivery

• BDW, Search, Streaming

- Roadmaps

- Tech Assessments

About Douglas Moore

2

Before Big Data

- Data Warehousing

- OLTP

- Systems Architecture

- Electricity

- High End Graphics

- Supercomputers

- Numerical Analysis

@douglas_ma

Contact me at:

| 3

Think Big

3

4yr Old “Big Data” Professional Services Firm

- Roadmaps

- Engineering

- Data Science

- Hands on Training

Recently acquired by Teradata

• Maintaining Independence

| 4

Content Drawn From Vast Amounts of Experience

4

50+ Clients

Leading

security

software

vendor

Leading

Discount

Retailer

| 5

I started out with just 3 topics…

Then while on the road to Strata,

I met 7 big data architects

- Who had 7 clients

• Who had 7 projects

• That demonstrated 7 Anti-Patterns

Introduction

5

Big Data Anti-pattern:

“Commonly applied but bad solution”

I95 Wikipedia

| 6

• Hardware and Infrastructure

• Tooling

• Big Data Warehousing

Three Focus Areas

6

| 7

Reference Architecture Driven

- 90’s & 00’s data center patterns

- Servers MUST NOT FAIL

- Standard Server Config

• $35,000/node

• Dual Power supply

• RAID

• SAS 15K RPM

• SAN

• VMs for Production

• Flat Network

Hardware & Infrastructure

7

[Image source: HP: The transformation

to HP Converged Infrastructure]Automated provisioning is a good thing!

| 8

Locality Locality Locality

- Bring Computation to Data

#1 Locality

8

Co-locate data and compute

Locally Attached Storage

Localize & isolate network traffic

Rack Awareness

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

...disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

diskdisk

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

CPU

coreCPU

coreCPU

coreCPU

core

VS.

Hadoop Cluster VM Cluster

| 9

Sequential IO >> Random Access

#2 Sequential IO

9

http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

Image credit: Wikipedia.org

Large block IO

Append only writes

JBOD

| 10

Increase # parallel components

- Reduce component cost

Data block replication

- Availability

- Performance

Commodity++ (2014)

- High density data nodes

- $8-12,000

- ~12 drives

- ~12-16 cores

- Buy 4-5 servers for the cost of 1

• 4-5x spindles

• 4-5x cores

#3 Increase parallelism

10

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

...

| 11

Expect Failure1,2 Rack Awareness

Data Block Replication

Task Retry

Node Black Listing

Monitor Everything

Name Node HA

#4 Failure

11

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

coredisk

CPU

core

disk

CPU

core

disk

CPU

core

disk

CPU

core

...

| 12

Hadoop Ecosystem Tools

Tooling

12

| 13

“If it came in the box then I should use it”

Example

- Oozie for scheduling

Tooling: Just looking inside the box

13

Best Practice:

• Use your current enterprise scheduler

| 14

Tooling: NoSQL

14

• “Now I have all of my log data in NoSQL, let’s do analytics over it”

Example

- Streaming data into Mongo DB

• Running aggregates

• Running MR jobs

| 15

Best Practice

15

Best Practice:

• Split the stream

• Real-time access in NoSQL

• Batch analytics in Hadoop

| 18

Key Purpose

- Integrate legacy code

- Integrate analytic tools

• Data science libs

Hadoop supports integrating any type of application tooling

- Hadoop Streaming

• Python

• R

• C, C++

• Fortran

• Cobol

• Ruby

Right Framework, Right Need…

18

| 19

Got to love Ruby

- Very Cool (or it was)

- Dynamic Language

- Expressive

- Compact

- Fast Iteration

Got to Hate Ruby

- Slow

- Hard to follow & debug

- Does not play well with threading

Right Use Case – ETL, Wrong Framework

19

“It’s much faster to develop in,

developer time is valuable,

just throw a couple more boxes at it”

Bench tested at 5,000 records /

second

| 20

Right Use Case – ETL, Wrong Framework…

20

Best Practice:

• Write new code in fastest execution framework

• High value legacy code, analytic tools use Hadoop Streaming

DO THE MATH:

Storm Java: ~ 1MM+ events / second / Server

Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server

= 16.67 times more servers

“Test and Learn!”

| 21

#1 ETL Offload

#2 Data Warehousing

Big Data Warehousing

21

| 22

Right Schema

22

order

order line

customer

product

contract

sales_person

3NF - Transactional Source System Schema

order

contractcustomer

product

order

line

sales_person

Dimensional Schema

Hadoop

Data Warehouse

OLTP

order lineordercontractcustomer product sales_person

De-normalized schema

| 23

Workload Hadoop NoSQL MPP, Reporting

DBs, Mainframe

ETL

Business Intelligence

Cross business reporting

Sub-set analytics

Full scan analytics

Decision Support TBs-PBs GB-TBs

Operational Reports

Complex security requirements

Search

Fast Lookup

Right Workload, Right Tool

| 24

Understand strengths & weaknesses of each choice

- Get help if needed

Deploy the right tool for the right workload

Test and Learn

Summary

24

| 25

Thank You

25

Work with the best on a wide variety of cool projects:

• recruiting@thinkbiganalytics.com

@douglas_ma

Douglas Moore

DATA SCIENTISTS

DATA ARCHITECTS

DATA SOLUTIONS

Think Big Start Smart Scale Fast

Work with the

Leading Innovator in Big Data

26

top related