hpcc systems vs hadoop

Post on 09-Jul-2015

1.046 Views

Category:

Technology

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

A detailed comparison of the best know Big Data solutions: HPCC Systems and Hadoop.

TRANSCRIPT

Big Data Architectural

Overview

VS

By Fujio Turner

@FujioTurner

LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,!

accounting and academic markets. !!

LexisNexis has been in business since 1977 with over 30,000 employees worldwide. 

What is HPCC Systems?Who is ?

LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License.

http://hpccsystems.com/

Comparison

JAVA C++

Petabytes

1-80,000 Jobs/day

Since 2005

Exabytes

Since 2000

Indexed: 2K-3K Jobs/sec*

? ? ? ? ? ?

Thor Roxie

Block Based File Based

In-Memory: 30 - 40 Jobs/min*

Non-Indexed: 4-1,040,000 Jobs/day

 *based on job (size / result set / complexity)

BusinessDevelopmentCustomers1 20

Non-Indexed Full Data Set

http://hpccsystems.com/why-hpcc/benchmarks

“I’m sub-second fast.”

“I can query all or part of your

data.”

Thor RoxieSingle Threaded

Hard Disk Index(optional)

Multi-Threaded Hard Disk

Index(optional) In-memory

SSD

Either/Both

Cluster Architecture

300GB File

Kevin CA 45 Mark MI 27 Sara FL 64

Name State Age

How do the platforms !handle the same data?!

Example

Customer Data May 2010

Name Node

Data Nodes

!a?

! b?

! c?

big blocks

Kevin CA 45 Mark MI 27 Sara FL 64

Store DataData is stored in random blocks.

? ? ?

Name Node

Data Nodes

block a = server 1 …… b = …….. 2 …… c = …….. 3

!a?

! b?

! c?

big blocks

Kevin CA 45 Mark MI 27 Sara FL 64

Store DataBlock location are stored in memory.

? ? ?

K.. CA 45 M.. MI 27 S.. FL 64

Thor Master

Thor Slaves

Kevin CA 45 Mark MI 27 Sara FL 64

Store Data

File Name ~/customers_2010-05

Data is distributed evenly in the cluster with replica copies and is seen as a file (example below).

K.. CA 45 M.. MI 27 S.. FL 64

Thor Master

Thor Slaves

Kevin CA 45 Mark MI 27 Sara FL 64

Store Data

Dali

File Location & Job Scheduler

File locations are stored on disk.

File Name ~/customers_2010-05

!a?

! b?

! c?

Name Node

Data Nodes

What state do most people live in?

Blocks are scanned for wanted data

!a?

! b?

! c?

Name Node

Data Nodes

Mapper

What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..

Found data is sent to Mapper(s) in Key/Value pairs

and stored.

!a?

! b?

! c?

Name Node

Data Nodes

Mapper

Reducer

CA 120 MI 500 FL 7

What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..

Stored data is sent to Reducer(s) to be

aggregated.

!a?

! b?

! c?

Name Node

Data Nodes

Mapper

Reducer

CA 120 MI 500 FL 7

What state do most people live in? CA 1 FL 1 MI 1 FL 1 CA 1 MI 1 MI ..

Cannot use SSD in Mapper or Reducer

due to too many writes.

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

1a.

2.

File Location & Job Scheduler 1.a A pre-compiled query is triggered. (Mostly used in Roxie) 1b. Ad-hoc query. !2.Query is sent to Dali to get file locations.

1b.

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP3.

File Location & Job Scheduler3. Job is placed in que to be sent to Thor Master. Thor Master coordinates job execution on Thor Slave nodes.

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

File Location & Job SchedulerJob are done locally on slaves and/or coordinated by master globally.

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

4.

4.

MI 500 CA 120 FL 7

File Location & Job Scheduler

4.Job is returned with optional grouped by & sorted by at run time.

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Dali

What state do most people live in?

ESP

MI 500 CA 120 FL 7

File Location & Job Scheduler

SORT!GROUP!DEDUP!JOIN!MERGE!BETWEEN!LENGTH!REGEX!ROUND!SUM!COUNT!TRIM!WHEN!AVE!CASE!NORMALIZE!DENORMALIZE!K-MEANS!more ….

Multiple other actions can be done on the data in a single job.

! a? !

K CA 45

a b c

K CA 45

Closer Look at Finding DataFull block is scanned to find your data. Blocks can be many terabytes in size.

! a? !

K CA 45

a b c

K CA 45

Closer Look at Finding Data

CA , 1

When data is found its sent to mapper.

! a? !

K CA 45

a b c

K CA 45

Closer Look at Finding Data

K CA 45

Data location is know. !“Apply Schema on Read” during time of query. !Data is processed locally.

Name State Age

! a? !

K CA 45

a b c

File size can be a few bytes to 4 exabytes with no limits on the total number of files that can be stored.

K CA 45

Closer Look at Finding Data

K CA 45

! a?

128GB - 1TB

8TB - 16TB or more

1.5 - 12.5% of data is in memory and only recently used data is in memory.

Speed

2013

K CA 45 M MI 27 S FL 64Thor Master

Thor Slaves

Kevin CA 45 Mark MI 27 Sara FL 64

CA row #3 MI row #17 MI row #4 FL row #5

Speed - Part 1Indexing

Index Index Index

• index per file • customize by field(s)

File Name ~/customers_2010-05

File Name ~/customers_2010-05_index

1 40

Non-Indexed

1 200

To

Indexed

1 40

Non-Indexed

1 200+

To

Indexed

male row #345 female row #4 male row #97 female row #267

CA row #3 MI row #17 MI row #4 FL row #5

Example Index Example Index

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Index In-Memory

Index Index Index

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Index In-Memory & Part or All Data

Index Index Index

orIndex In-Memory

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Roxie is Multi-ThreadedIndex In-Memory & Part or All Data

orIndex In-Memory

Index Index Index

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Roxie is Multi-ThreadedIndex In-Memory & Part or All Data

orIndex In-Memory

Index Index Index

SSD are OK - write few / read many

Speed - Part 2Roxie

K CA 45 M MI 27 S FL 64Roxie Master

Roxie Slaves

Roxie is Multi-ThreadedIndex In-Memory & Part or All Data

orIndex In-Memory

Index Index Index

2004

Thor Master

Thor Slaves

Dali ESP

Roxie Master

Roxie Slaves

Common Cluster

Data is mostly unstructured. Use Thor to do ETL & create indexes. Send results to Roxie for user queries.

Dali ESP

Roxie Master

Roxie Slaves

High Speed Cluster

Data is mostly structured. Main goal is to have fast queries all the time.

Thor Master

Thor Slaves

Dali ESP

Storage Cluster

Data is structured or unstructured. Main goal is to storage lots of data and query using indexes on all or part of the data in the cluster.

!a?

! b?

! c?

Name Node

Data Nodes / Task Tracker

Mapper

Reducer

Complex or Multi-Step Queries

Job Tracker

Job Tracker coordinates multi step jobs.

Job Tracker

CA 120 MI 500 FL 7

Food 31 Water 99 Candy 84

Wed 80 Fri 73 Sun 96

1 2 3 4 5 6 7 8 9

3 hours 1 hours 1 hours 6 hours

Sum 80 Count 73

1 hours

How do I Query HPCC Systems?ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.!!

Note - ECL is very similar to Hadoop’s pig ,but!more expressive and feature rich.

Map/Reduce

SQL w/ JOINS

GraphDB

Machine Learning

Simple to Complex Queries

ECL (Enterprise Control Language) C++ based query language

Sort

Count

Group

Classification

(ROXIE) 0.27 seconds to (THOR) few hours

Country = ‘US’

Join

Index of ~/facebook_2013

Query is Completed in a Single Job!Asynchronously

~/facebook_2013

Country = ‘US’

~/twitter_2013

SORT!GROUP!DEDUP!JOIN!MERGE!BETWEEN!LENGTH!REGEX!ROUND!SUM!COUNT!TRIM!WHEN!AVE!CASE!NORMALIZE!DENORMALIZE!K-MEANS!more ….

+

Machine Learning Built-in

Regression! Linear Regression Classification! Naive Bayes Perceptron Decisions Trees Logistic Regression Clustering! K-Means KD Trees Agglomerative/Hierarchical Association Analysis! AprioriN EclatN Rules

http://hpccsystems.com/ml

Michael Payne ,of Clemson University, on high speed machine learning with PB-BLAS in HPCC Systems.

http://youtu.be/s_HWlMwi6iI

Lorem Ipsum is simply dummy text of the printing

Un-Structured Data?!Example

lots of text

300GB File

Lorem Ipsum is simply dummy text of the printing

Un-Structured Data

Regular Expression in C++

or Pattern Match in

ECL

Regular Expression in Java

+Reg Ex +meta data stored only

Filtered Data+

Indexes

Lorem Ipsum is simply dummy text of the printing

Full Text Search

Pattern Match in ECL and

+Rex Ex or

vs

More Moving Parts = More Downtime

Management & Administration

“I want sub-second speed but made investment in HDFS.”

K CA 45 M MI 27 S FL 64Roxie Master

Index Index Index

Roxie Slaves

!a?

! b?

! c?

Name Node

Data Nodes / Task Trackerhttp://hpccsystems.com/products-and-services/products/modules/hadoop-integration

Hadoop / HPCC Transport Plug-in

Migrating from Hadoop to HPCC Systems

K CA 45 M MI 27 S FL 64Roxie Master

Index Index Index

Roxie Slaves

Name NodeData Nodes / Task Tracker

Thor Master

Thor Slaves

Slowly replace Hadoop with Thor.

Alternative Query Methods

HPCC Systems Security

Encrypt Data on Disk optional

Third Party Authentication Kerberos OK

User / Group Authentication

http://www.slideshare.net/FujioTurner/

For More HPCC!“How To’s”!

Go to SlideShare

http://www.youtube.com/watch?v=8SV43DCUqJg

Watch how to install HPCC Systems

in 5 Minutes

Download HPCC Systems Open Source

Community Edition

or

Source Codehttps://github.com/hpcc-systems

http://hpccsystems.com/download/

top related