solving performance problems on hadoop
TRANSCRIPT
![Page 1: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/1.jpg)
1
Solving performance problems on HadoopMoving analytic workloads into production
Tyler MitchellSr. Software EngineerActian Center of Excellence
![Page 2: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/2.jpg)
TopicsHow we got (stuck) herePerformance best practisesSample business casesBenchmarking results
2
![Page 3: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/3.jpg)
3
Actian’s Lineage
Ingres – 1970’s Versant – 1988 ParAccel – 2006
Pervasive – 1982 Vectorwise – 2003
Actian
![Page 4: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/4.jpg)
Actian at a Glance
4
10,000+
8 Countries; 7 US CitiesHQ Palo Alto
400+Employees Customers
3 Businesses
Banking, InsuranceTelecom and Media
Data ManagementData Integration
Big Data Analytics
![Page 5: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/5.jpg)
5
How We Got (Stuck) Here
![Page 6: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/6.jpg)
Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture Data Management & Integration
Analytics
Query & Analyze
Solutions
ProblemSolved
![Page 7: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/7.jpg)
Accidental Hadoop Tourist – Brief History
7
DataBusiness
Data Capture Data Management & Integration
Analytics Solutions
??????
![Page 8: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/8.jpg)
Accidental Hadoop Tourist – Brief History
8
DataBusiness
Data Capture Data Management & Integration
Analytics
???
Solutions
???
![Page 9: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/9.jpg)
Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue, reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively scalable platformDecisive reduction in the cost of high performance analytics at scalePerformance that can meet all SLAsFull leverage of existing SQL skills while deploying a modern analytic infrastructure
Grow Revenue
Reduce Cost
Mitigate Risk
Create New
Business
Business Solution Architecture Challenges
![Page 10: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/10.jpg)
Wide Ranges of Use Cases
10
Financial Services
Advanced Credit Risk Analytics
across billions of data points
Internet Scale Application
Predictive Analytics across
hundreds of millions of customers
Media
Data Science and Discovery across trillions of IoT events
Dept of Defense
Cyber-Security: Network intrusion
models every second
Credit Card Processing
Fraud detection
every milli-second
![Page 11: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/11.jpg)
11
Performance Best Practises
![Page 12: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/12.jpg)
3 Essential Big Data Concepts
12
0. Take nothing for granted1. Partitioning vs Data skew 2. Data types matter3. Maximize memory / minimize bottlenecks4. Take nothing for granted
![Page 13: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/13.jpg)
13
6 Game Changing Database Innovations
![Page 14: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/14.jpg)
6 Game Changing Database Innovations
14
1. Use the CPU! – Vector Processing2. Minimize bottlenecks – Exploiting Chip Cache3. Got columnar?4. Smarter compression5. Smarter indexing6. Multi-core matters
![Page 15: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/15.jpg)
15
Actian VectorH Innovations
![Page 16: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/16.jpg)
16
Big Data Business Use Cases
![Page 17: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/17.jpg)
Customer 360: Understanding Experience, Driving Revenue
17
Telecom ChallengeVast and growing repository of proprietary click data, customer records, service call records, smart phone and device data GPS location, webserver, telephone, network usage.Queries took minutes or hours, and sometimes never returned at all.Critical business analysis on a consolidated customer 360 data lake was grinding to a halt. The ability to gain deeper market insights, visualization and desired data management and operational optimization was at risk
![Page 18: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/18.jpg)
Customer 360: Initial Architecture
18
Development System• 300+ node cluster• HIVE access• SQL based BI / Data Science• Pre-processed as performance was unacceptable• Views taking days to return snapshot views
![Page 19: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/19.jpg)
Customer 360: Technical Improvements
19
Production Prototype• 30 node cluster (10% of Hive)• Actian Vector on Hadoop solution• SQL based BI / Data Science• No materialized view building required• Join on demand faster than aggregate tables in Hive• Reduced storage requirements• 91TB – two years data, 1100 columns when joined
![Page 20: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/20.jpg)
Customer 360: Understanding Experience, Driving Revenue
20
ResultsCustomer 360 across prior data silosLeveraged for customer retention strategies Predict and take proactive, tailored responsesEnables next gen data-driven troubleshooting, impact analysis and root cause analysis
• Accelerated operations intelligence• Improved customer experience• Reduced customer churn
Impact
![Page 21: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/21.jpg)
Financial Risk: Upgrading Legacy to Meet SLA
21
ChallengeLegacy single-purpose risk application took 3 hours to generate end-of-day risk report, and failed to meet changing SLA’s for reporting risk.In deciding to replace risk application, bank opted to build a multi-purpose risk application, addressing multiple business requirements
![Page 22: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/22.jpg)
Financial Risk: Upgrading Legacy to Meet SLA
22
Legacy System• Single server architecture, MS SSAS, Oracle - ~30 applications• Pre-processing of desired measures exploding data volumes• Cube and Analysis engines being maxed out as they exceed 1.5TB
range • Unable to scale to the desired range of > 200GB/day new data• Impala attempt failed • Highly invested in apps built on Analysis service
![Page 23: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/23.jpg)
Financial Risk: Upgrading Legacy to Meet SLA
23
New Possibilities• Clustered solution – Hadoop 5 and 10 node• No pre-processing cubes, SSAS partly kept• Tested solutions 1TB -> 20TB at a time• Produced interactive queries across large datasets• Focused query results in 2s or less• Processing all data in the database 6s – 80s• 2x nodes ~ 200% speed improvement
![Page 24: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/24.jpg)
Financial Risk: Upgrading Legacy to Meet SLA
24
ResultsIncreased data analyzed by 100X
2–200B rows / 1-20TBReport run in 28 seconds vs. 3 hoursUse of application for:
• Intra-day reporting (surveillance)• End of day reporting
(compliance)• Overnight float investment
options• Annual CCAR Analysis
ActualGoal
![Page 25: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/25.jpg)
25
Delivering the Results With Better Engineering
![Page 26: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/26.jpg)
26
Technical Benchmarks
![Page 27: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/27.jpg)
27
Technical Benchmarks – Single Machine
![Page 28: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/28.jpg)
28
Technical Benchmarks – Single Machine
![Page 29: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/29.jpg)
Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *VectorH vs other platforms, faster by how much?Tuned platformsIdentical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
![Page 30: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/30.jpg)
Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:• skip blocks via MinMax indexes• sophisticated query processing• efficient block format, esp. 64-
bit int
![Page 31: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/31.jpg)
31
Summary
Conscientious data handling & next gen engineering takes SQL in Hadoop to new levels.
All Hadoop users can move from development into production while delivering compelling business results.
![Page 32: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/32.jpg)
32
Delivering the Results With Better Engineering
VectorH v5 – Spark integration, external table support, and more
![Page 33: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/33.jpg)
33
SIGMOD 2016 Paper
![Page 34: Solving Performance Problems on Hadoop](https://reader035.vdocuments.net/reader035/viewer/2022062523/58f9a954760da3da068b6d8a/html5/thumbnails/34.jpg)
34
Thank [email protected] - @1tylermitchellBlogs at Actian.com - MakeDataUseful.com
Visit us in booth 503