![Page 1: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/1.jpg)
Big Data Anti-Patterns:
Lessons from the Front Lines
Strata NYC
October 17, 2014
Douglas Moore
![Page 2: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/2.jpg)
| 2
Think Big – 3 Years
- Delivery
• BDW, Search, Streaming
- Roadmaps
- Tech Assessments
About Douglas Moore
2
Before Big Data
- Data Warehousing
- OLTP
- Systems Architecture
- Electricity
- High End Graphics
- Supercomputers
- Numerical Analysis
@douglas_ma
Contact me at:
![Page 3: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/3.jpg)
| 3
Think Big
3
4yr Old “Big Data” Professional Services Firm
- Roadmaps
- Engineering
- Data Science
- Hands on Training
Recently acquired by Teradata
• Maintaining Independence
![Page 4: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/4.jpg)
| 4
Content Drawn From Vast Amounts of Experience
4
…
50+ Clients
Leading
security
software
vendor
Leading
Discount
Retailer
![Page 5: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/5.jpg)
| 5
I started out with just 3 topics…
Then while on the road to Strata,
I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
Introduction
5
Big Data Anti-pattern:
“Commonly applied but bad solution”
I95 Wikipedia
![Page 6: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/6.jpg)
| 6
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
Three Focus Areas
6
![Page 7: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/7.jpg)
| 7
Reference Architecture Driven
- 90’s & 00’s data center patterns
- Servers MUST NOT FAIL
- Standard Server Config
• $35,000/node
• Dual Power supply
• RAID
• SAS 15K RPM
• SAN
• VMs for Production
• Flat Network
Hardware & Infrastructure
7
[Image source: HP: The transformation
to HP Converged Infrastructure]Automated provisioning is a good thing!
![Page 8: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/8.jpg)
| 8
Locality Locality Locality
- Bring Computation to Data
#1 Locality
8
Co-locate data and compute
Locally Attached Storage
Localize & isolate network traffic
Rack Awareness
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
...disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
diskdisk
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
CPU
coreCPU
coreCPU
coreCPU
core
VS.
Hadoop Cluster VM Cluster
![Page 9: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/9.jpg)
| 9
Sequential IO >> Random Access
#2 Sequential IO
9
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Image credit: Wikipedia.org
Large block IO
Append only writes
JBOD
![Page 10: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/10.jpg)
| 10
Increase # parallel components
- Reduce component cost
Data block replication
- Availability
- Performance
Commodity++ (2014)
- High density data nodes
- $8-12,000
- ~12 drives
- ~12-16 cores
- Buy 4-5 servers for the cost of 1
• 4-5x spindles
• 4-5x cores
#3 Increase parallelism
10
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
...
![Page 11: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/11.jpg)
| 11
Expect Failure1,2 Rack Awareness
Data Block Replication
Task Retry
Node Black Listing
Monitor Everything
Name Node HA
#4 Failure
11
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
coredisk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
...
![Page 12: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/12.jpg)
| 12
Hadoop Ecosystem Tools
Tooling
12
![Page 13: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/13.jpg)
| 13
“If it came in the box then I should use it”
Example
- Oozie for scheduling
Tooling: Just looking inside the box
13
Best Practice:
• Use your current enterprise scheduler
![Page 14: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/14.jpg)
| 14
Tooling: NoSQL
14
• “Now I have all of my log data in NoSQL, let’s do analytics over it”
Example
- Streaming data into Mongo DB
• Running aggregates
• Running MR jobs
![Page 15: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/15.jpg)
| 15
Best Practice
15
Best Practice:
• Split the stream
• Real-time access in NoSQL
• Batch analytics in Hadoop
![Page 16: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/16.jpg)
| 18
Key Purpose
- Integrate legacy code
- Integrate analytic tools
• Data science libs
Hadoop supports integrating any type of application tooling
- Hadoop Streaming
• Python
• R
• C, C++
• Fortran
• Cobol
• Ruby
Right Framework, Right Need…
18
![Page 17: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/17.jpg)
| 19
Got to love Ruby
- Very Cool (or it was)
- Dynamic Language
- Expressive
- Compact
- Fast Iteration
Got to Hate Ruby
- Slow
- Hard to follow & debug
- Does not play well with threading
Right Use Case – ETL, Wrong Framework
19
“It’s much faster to develop in,
developer time is valuable,
just throw a couple more boxes at it”
Bench tested at 5,000 records /
second
![Page 18: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/18.jpg)
| 20
Right Use Case – ETL, Wrong Framework…
20
Best Practice:
• Write new code in fastest execution framework
• High value legacy code, analytic tools use Hadoop Streaming
DO THE MATH:
Storm Java: ~ 1MM+ events / second / Server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server
= 16.67 times more servers
“Test and Learn!”
![Page 19: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/19.jpg)
| 21
#1 ETL Offload
#2 Data Warehousing
Big Data Warehousing
21
![Page 20: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/20.jpg)
| 22
Right Schema
22
order
order line
customer
product
contract
sales_person
3NF - Transactional Source System Schema
order
contractcustomer
product
order
line
sales_person
Dimensional Schema
Hadoop
Data Warehouse
OLTP
order lineordercontractcustomer product sales_person
De-normalized schema
![Page 21: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/21.jpg)
| 23
Workload Hadoop NoSQL MPP, Reporting
DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup
Right Workload, Right Tool
![Page 22: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/22.jpg)
| 24
Understand strengths & weaknesses of each choice
- Get help if needed
Deploy the right tool for the right workload
Test and Learn
Summary
24
![Page 23: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/23.jpg)
| 25
Thank You
25
Work with the best on a wide variety of cool projects:
@douglas_ma
Douglas Moore
![Page 24: Big Data Anti-Patterns: Lessons From the Front LIne](https://reader033.vdocuments.net/reader033/viewer/2022052907/55941c201a28ab032c8b472d/html5/thumbnails/24.jpg)
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
Work with the
Leading Innovator in Big Data
26