october 2013 hug: gridgain-in memory accelaration

11
In-Memory Accelerator for Hadoop™ www.gridgain.com #gridgain

Upload: yahoo-developer-network

Post on 26-Jan-2015

105 views

Category:

Technology


0 download

DESCRIPTION

As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this session, we will discuss the desired performance characteristics of such a deployment and the corresponding challenges. Leave with an understanding of how performance-sensitive deployments can be accelerated using In-Memory technologies that merge the Big Data capabilities of Hadoop with the unmatched performance of In-Memory data management.

TRANSCRIPT

Page 1: October 2013 HUG: GridGain-In memory accelaration

In-Memory Accelerator for Hadoop™

www.gridgain.com #gridgain

Page 2: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

Hadoop: Pros and Cons

2

What is Hadoop?> Hadoop is a batch system> HDFS - Hadoop Distributed File System> Data must be ETL-ed into HDFS> Parallel processing over data in HDFS> Hive, Pig, HBase, Mahout...> Most popular data warehouse

Pros:> Scales very well> Fault tolerant and resilient> Very active and rich eco-system> Process TBs/PBs in parallel fashion

Cons:> Batch oriented - real time not possible> Complex deployment> Significant execution overhead> HDFS is IO and network bound

Page 3: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

In-Memory Accelerator For Hadoop: Overview

3

Up To 100x Faster:1. In-Memory File System

100% compatible with HDFSBoost HDFS performance by removing IO overheadDual-mode: standalone or cachingBlend into Hadoop ecosystem

2. In-Memory MapReduceEliminate Hadoop MapReduce overheadAllow for embedded executionRecord-based

Page 4: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

GridGain: In-Memory Computing Platform

4

Page 5: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

In-Memory Accelerator For Hadoop: Details

> PnP IntegrationMinimal or zero code change

> Any Hadoop distroHadoop v1 and v2

> In-Memory File System100% compatible with HDFSDual-mode: no ETL needed, read/write-throughBlock-level caching & smart evictionAutomatic pre-fetchingBackground fragmentizerOn-heap and off-heap memory utilization

> In-Memory MapReduceIn-process co-located computations - access GGFS in-processEliminate unnecessary IPCEliminate long task startup timeEliminate mandatory sorting and re-shuffling on reduction

5

Page 6: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

GridGain Visor: Unified DevOps

6

HDFS Profiler

File Manager

Page 7: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

Benchmarks: GGFS vs HDFS

7

10 nodes cluster of Dell R610> Each has dual 8-core CPU> Ubuntu 12.4, Java 7> 10 GBE network> Stock Apache Hadoop 2.x

Page 8: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

Comparison: Hadoop Accelerator vs. Spark

8

> No ETL requiredAutomatic HDFS read-through and write-throughData is loaded on demand

> Per-block file cachingOnly hot data blocks are in memory

> Strong management capabilitiesGridGain Visor - Unified DevOps

> Requires data ETL-ed into SparkChanges to data do not get propagated to HDFSExplicit ETL step consumes time

> Needs to have full file loadedIf does not fit - gets offloaded to disk

> No management capabilities

Page 9: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

Customer Use Case: Task & Challenge

9

Task:> Real time search with MapReduce> Dataset size is 5TB > Writes 80%, reads 20%> Perceptual real time SLA (few seconds)

Challenge:> Hadoop MapReduce too slow (> 30 sec)> Data scanning slow due to constant IO> Overall job takes > 1 minute

Page 10: October 2013 HUG: GridGain-In memory accelaration

Slide www.gridgain.com

Customer Use Case: Solution

10

> Utilize existing serversStart GridGain data node on every server

> Only put highly utilized files in GGFSUser controlled caching

> In-Memory MapReduce over GGFSEmbedded processing

> Results under 3 seconds

Page 11: October 2013 HUG: GridGain-In memory accelaration

GridGain Systemswww.gridgain.com

1065 East Hillsdale Blvd, Suite 230Foster City, CA 94404

@gridgain