big data - in-memory index / sub second query engine - roxie - hpcc systems

29
HPCC Systems - Big Data Roxie - In-Memory Data & Index , Sub-Second Query Cluster By Fujio Turner @FujioTurner

Upload: fujio-turner

Post on 15-Jan-2015

344 views

Category:

Technology


0 download

DESCRIPTION

Roxie , the best kept Big Data secret for high performance. Leverage the multi-threaded processing of Roxie and use tools like In-memory indexes, In-memory data ,SSD and more to do sub-second querying.

TRANSCRIPT

Page 1: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

HPCC Systems - Big Data!!

Roxie - In-Memory Data & Index , Sub-Second Query Cluster

By Fujio Turner

@FujioTurner

Page 2: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,!

accounting and academic markets. !!

LexisNexis has been in business since 1977 with over 30,000 employees worldwide. 

What is HPCC Systems?Who is ?

LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License.

Page 3: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Comparison

JAVA C++

Petabytes

1-80,000 Jobs/day

Since 2005

Exabytes

Non-Indexed 4X-13X

Since 2000

Indexed: 2K-3K Jobs/sec

? ? ? ? ? ?

Thor Roxie

Block Based File Based

Page 4: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

BusinessDevelopmentCustomers1 20

Non-Indexed Full Data Set

http://hpccsystems.com/why-hpcc/benchmarks

Page 5: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

How do I Query HPCC Systems?ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.!!

Note - ECL is very similar to Hadoop’s pig ,but!more expressive and feature rich.

Page 6: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Map/Reduce

SQL w/ JOINS

GraphDB

Machine Learning

Simple to Complex Queries

ECL (Enterprise Control Language) C++ based query language

Page 7: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

“I’m sub-second fast.”

“I can query all or part of your

data.”

Thor RoxieHard Disk

Index(optional)Hard Disk

Index(optional) In-memory Index

SSD

Either/Both

Cluster Architecture

Rapid Online XML Inquiry Engine

Page 8: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Roxie

Date 2004

Query Languages ECL

In-Memory Index w/ Part or All Data

Data Type Normalized

Index Only

Query Methods REST Direct TCP

DeNormalizedor

or

SOAP

or Unstructured

Page 9: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Load Into Thor QueryFile

VS

Example 1

Index PublishLoad Into Roxie QueryFile

Page 10: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

HPCC Systems Sample Data for Examples 1

Sample Data

http://hpccsystems.com/download/docs/learning-ecl

Page 11: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Administrator Web GUI!on

Port 8010IP / Url of HPCC install

Page 12: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

4.

5.

1. Upload file*!2. Distribute to cluster!3. Name of file in cluster!4. Size of each row!5. Push to cluster

*2GB file size limit through web No limit if uploaded via SOAP

Load Data

Page 13: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

In Thor Cluster

Loaded

Page 14: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Query !Example 1

Data

allPeople := DATASET(‘~test::originalperson’,Layout_Person,THOR);

Layout_People := RECORD STRING15 FirstName; STRING25 LastName; STRING15 MiddleName; STRING5 Zip; STRING42 Street; STRING20 City; STRING2 State; END;

smith; //Output

smith := allPeople(LastName= ‘Smith’);Query

Schema

WHERE `LastName` = ‘Smith’

File Location,!“FROM Table”“USE DATABASE;”

“SELECT * ….”

Page 15: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Copy Data !From Thor to Roxie

Page 16: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

1.Select Thor File(s) 2.Copy Tab!!

3. Pick Roxie cluster!!

4. Click Copy!!

Page 17: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Indexing!In-Memory

Make Index

File Position Number!pseudo recordID!

“Alter Table”(new column) Index Filename

allPeople := DATASET(‘~test::originalperson_copy’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}}, FLAT);

rx := INDEX(allPeople,{LastName,RecPtr},’~test::key_person_copy’,PRELOAD);

BUILDINDEX(rx);

Ex. Creating an index by “LastName”

In-Memory

Page 18: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

rx := INDEX(allPeople,{LastName,RecPtr},’~test::key_person_copy’,PRELOAD);

Indexing!In-Memory with Luggage

Index Only

rx := INDEX(allPeople,{FirstName,MiddleName},{LastName,RecPtr},’~test::key_person_copy’,PRELOAD);Index + Part or All Data

+Store Data!In-Memory!with Index

Page 19: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

Query

filterdata; //Output

w/ IndexData

Queryfilterdata:= FETCH(allPeople,datax(LastName=‘Smith’),RIGHT. RecPtr);

datax:= INDEX(allPeople,{LastName,RecPtr},’~test::key_person_copy’);

WHERE `LastName` = ‘Smith’ from Index

allPeople := DATASET(‘~test::originalperson_copy’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}},FLAT);

Page 20: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

“I’m sub-second fast.” Publish Your Code

What Is publishing your code?

Can I do ad-hoc queries without publishing?

ECL is built from C++. So your ECL needs to be compiled before it runs.When you publish a query, you pre-compile your ECL and send it to the ESP server where it will be stored. ESP, on port 8002 , will listen to any requests and execute the published query.

Yes, but it will not be sub-second fast as it is not pre-compiled.

Page 21: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

How Querying ESP Works

Page 22: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

How to Publish Your ECL

1.Select your ad-hoc “lastname” query from before

2.Name you query “lastname”

3. Publish

Page 23: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

ESP on port 8002Enterprise Service Platform

1.Select “lastname” query2.Select your data output format ”Output JSON” 3.Click Submit button

Page 24: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

JSON Format

1.Send Request = Query or hit this Url

Page 25: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

JSON Format

Results - in less then a second

Page 26: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

2013-06-06 Twitter

2013-06-07 Twitter

2013-06-08 Twitter

2013-06 Twitter

2013-06-06 ……….. -07 ……….. -08

Logical File

Real File

SuperFile!Organizing Your Files

+ Append new real files on the fly

Use this file name to get all the real files

Page 27: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

2013-06-06 Twitter

2013-06-07 Twitter

2013-06-08 Twitter

2013 Twitter

SuperKeys No Sub-Super Files or Keys

in RoxieOrganizing Your Indexes

Page 28: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

http://www.slideshare.net/FujioTurner/

For More HPCC!“How To’s”!

Go to SlideShare

Page 29: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems

http://www.youtube.com/watch?v=8SV43DCUqJg

Watch how to install HPCC Systems

in 5 Minutes

Download HPCC Systems Open Source

Community Edition

or

Source Codehttps://github.com/hpcc-systems

http://hpccsystems.com/download/