october 2013 hug: gridgain-in memory accelaration
DESCRIPTION
As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this session, we will discuss the desired performance characteristics of such a deployment and the corresponding challenges. Leave with an understanding of how performance-sensitive deployments can be accelerated using In-Memory technologies that merge the Big Data capabilities of Hadoop with the unmatched performance of In-Memory data management.TRANSCRIPT
In-Memory Accelerator for Hadoop™
www.gridgain.com #gridgain
Slide www.gridgain.com
Hadoop: Pros and Cons
2
What is Hadoop?> Hadoop is a batch system> HDFS - Hadoop Distributed File System> Data must be ETL-ed into HDFS> Parallel processing over data in HDFS> Hive, Pig, HBase, Mahout...> Most popular data warehouse
Pros:> Scales very well> Fault tolerant and resilient> Very active and rich eco-system> Process TBs/PBs in parallel fashion
Cons:> Batch oriented - real time not possible> Complex deployment> Significant execution overhead> HDFS is IO and network bound
Slide www.gridgain.com
In-Memory Accelerator For Hadoop: Overview
3
Up To 100x Faster:1. In-Memory File System
100% compatible with HDFSBoost HDFS performance by removing IO overheadDual-mode: standalone or cachingBlend into Hadoop ecosystem
2. In-Memory MapReduceEliminate Hadoop MapReduce overheadAllow for embedded executionRecord-based
Slide www.gridgain.com
GridGain: In-Memory Computing Platform
4
Slide www.gridgain.com
In-Memory Accelerator For Hadoop: Details
> PnP IntegrationMinimal or zero code change
> Any Hadoop distroHadoop v1 and v2
> In-Memory File System100% compatible with HDFSDual-mode: no ETL needed, read/write-throughBlock-level caching & smart evictionAutomatic pre-fetchingBackground fragmentizerOn-heap and off-heap memory utilization
> In-Memory MapReduceIn-process co-located computations - access GGFS in-processEliminate unnecessary IPCEliminate long task startup timeEliminate mandatory sorting and re-shuffling on reduction
5
Slide www.gridgain.com
GridGain Visor: Unified DevOps
6
HDFS Profiler
File Manager
Slide www.gridgain.com
Benchmarks: GGFS vs HDFS
7
10 nodes cluster of Dell R610> Each has dual 8-core CPU> Ubuntu 12.4, Java 7> 10 GBE network> Stock Apache Hadoop 2.x
Slide www.gridgain.com
Comparison: Hadoop Accelerator vs. Spark
8
> No ETL requiredAutomatic HDFS read-through and write-throughData is loaded on demand
> Per-block file cachingOnly hot data blocks are in memory
> Strong management capabilitiesGridGain Visor - Unified DevOps
> Requires data ETL-ed into SparkChanges to data do not get propagated to HDFSExplicit ETL step consumes time
> Needs to have full file loadedIf does not fit - gets offloaded to disk
> No management capabilities
Slide www.gridgain.com
Customer Use Case: Task & Challenge
9
Task:> Real time search with MapReduce> Dataset size is 5TB > Writes 80%, reads 20%> Perceptual real time SLA (few seconds)
Challenge:> Hadoop MapReduce too slow (> 30 sec)> Data scanning slow due to constant IO> Overall job takes > 1 minute
Slide www.gridgain.com
Customer Use Case: Solution
10
> Utilize existing serversStart GridGain data node on every server
> Only put highly utilized files in GGFSUser controlled caching
> In-Memory MapReduce over GGFSEmbedded processing
> Results under 3 seconds
GridGain Systemswww.gridgain.com
1065 East Hillsdale Blvd, Suite 230Foster City, CA 94404
@gridgain