accumulo summit 2015: using d4m for rapid prototyping of analytics for apache accumulo [frameworks]

42
D4M and Apache Accumulo Vijay Gadepally, Lauren Edwards, Dylan Hutchison, Jeremy Kepner Accumulo Summit College Park, MD April 29, 2015 This work is sponsored by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government

Upload: accumulo-summit

Post on 15-Jul-2015

387 views

Category:

Technology


8 download

TRANSCRIPT

Page 1: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

D4M and Apache Accumulo

Vijay Gadepally, Lauren Edwards,

Dylan Hutchison, Jeremy Kepner

Accumulo SummitCollege Park, MD

April 29, 2015

This work is sponsored by the Assistant Secretary of Defense for Research

and Engineering under Air Force Contract #FA8721-05-C-0002. Opinions,

interpretations, recommendations and conclusions are those of the authors

and are not necessarily endorsed by the United States Government

Page 2: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 2

Giving away the punch line

• D4M is a popular open source software tool that

connects scientists with Big Data technologies

• D4M-Accumulo binding provides high performance

connectivity to Apache Accumulo for quick analytic

prototyping

• Graphulo: Implement GraphBLAS server-side

iterators and operators on Accumulo tables

Page 3: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 3

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Page 4: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 4

Common Big Data Challenge

CommandersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Data

Users

Gap

2000 2005 2010 2015 & Beyond

Rapidly increasing

- Data volume

- Data velocity

- Data variety

- Data veracity (security)

Page 5: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 5

Common Big Data Architecture

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Page 6: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 6

Common Big Data Architecture- Data Volume: Cloud Computing -

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Operators

MIT

SuperCloud

Enterprise Cloud

Big Data Cloud Database Cloud

Compute Cloud

MIT SuperCloud merges four clouds

Page 7: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 7

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Lincoln benchmarking

validated Accumulo performance

Common Big Data Architecture- Data Velocity: Accumulo Database -

Page 8: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 8

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

D4M demonstrated a

universal approach to diverse data

columnsro

ws

Σ

raw

Common Big Data Architecture- Data Variety: D4M Schema -

intel reports, DNA, health records, publication

citations, web logs, social media, building alarms,

cyber, … all handled by a common 4 table schema

Page 9: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 9

Common Big Data Architecture- Data Veracity: Security Tools-

WarfightersOperators Analysts

Users

MaritimeGround SpaceC2 CyberOSINT

<html>

Data

AirHUMINTWeather

Analytics

A

C

D E

B

Computing

Web

Files

Scheduler

Ingest &

EnrichmentIngest &

EnrichmentIngestDatabases

Using cryptography to protect

sensitive data-Verifiable Query Results-

-Computing on Masked Data-

Big Data Cloud

Masked

Query

Plaintext

Query

Encrypt

CMD

Masked

Analytic

Result

Decrypt

Plaintext

Analytic

Result

Page 10: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 10

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Page 11: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 11

High Level Language: D4Mhttp://d4m.mit.edu

Accumulo

Distributed Database

Query:Alice

Bob

Cathy

David

Earl

Associative ArraysNumerical Computing Environment

D4MDynamic

Distributed

Dimensional

Data ModelA

C

DE

B

A D4M query returns a sparse

matrix or a graph…

…for statistical signal processing

or graph analysis in MATLAB

D4M binds associative arrays to databases, enabling rapid

prototyping of data-intensive cloud analytics and visualization

Page 12: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 12

What is D4M?

• The Dynamic Distributed Dimensional Data Model:

– Support for mathematical foundation – associative arrays

– Schema to represent most unstructured data as associative arrays

– Software tools to connect with variety of databases such as Apache Accumulo, SciDB, mySQL, PostgreSQL, …

• Software tools currently implemented in MATLAB/Octave, and Julia (v1)

• Connect to databases via JDBC (relational), SHIM (SciDB) or custom Java API (Accumulo)

Page 13: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 13

• Key innovation: mathematical closure

– All associative array operations return associative arrays

• Enables composable mathematical operations

A + B A - B A & B A|B A*B

• Enables composable query operations via array indexing

A('alice bob ',:) A('alice ',:) A('al* ',:)

A('alice : bob ',:) A(1:2,:) A == 47.0

• Simple to implement in a library in programming environments

with: 1st class support of 2D arrays, operator overloading,

sparse linear algebra

Mathematical Foundation: Associative Arrays

• Complex queries with ~50x less effort than Java/SQL

• Naturally leads to high performance parallel implementation

• Need a schema to convert arbitrary data to associative array

Page 14: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 14

D4M Data Schema

• A structure described in a language supported by the database management system.

• Use D4M schema to represent heterogeneous data types in common data format

– Schema converts structured or unstructured raw text to a tuple representation supported by Accumulo:

• Usually use a 4 table representation

– The Edge Table, the Transpose Table, Degree Table, Raw Table

33659254179712 2013-05-20 21:21:42 20798128

kiefpief web 3b77caf94bfc81fe I am

sending love to Oklahoma. And actually -- to everyone who

may need it. You are loved. And you are not alone.

Promise. #PrayforOklahoma

33660010027264 2013-05-20 21:54:56 35.99894978 -

78.90660222 -8783842.7781526 4300476.86376416

22435220 RyanBLeslie Twitter for iPad348803787

bced47a0c99c71d0 @HaydenBigCntry RT @jiminhofe:

The devastation in Oklahoma is

D4M

Schema

(33659254179712, time|2013-05-20 21:21:42, 1)

(33659254179712, user|kiefpief, 1)

(33659254179712, text, Sending love to OK #PrayforOklahoma)

(33659254179712, word|Sending, 1)

(33660010027264, time|2013-05-20 21:54:56, 1)

(33660010027264, lat|-78.90660222, 1 )

(33660010027264, lon|35.99894978, 1)

(33660010027264, user|RyanBLeslie, 1)

(33660010027264, RT|@HaydenBigCntry , 1)

(33660010027264, word|Oklahoma, 1)

Page 15: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 15

4 Table D4M Schema

row_num col1 col2 col3

001 row1col1 row1col2 word1 word2 word3

002 row2col1 row2col2 word2 word3

003 … … word1 word3

col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3

row_num|001 1 1 1 1 1

row_num|002 1 1 1 1

row_num|003 1 1

col1|row1col1 col1|row2col1 col2|row1col2 col2|row2col2 col3|word1 col3|word2 col3|word3

Degree 1 1 1 1 2 2 3

row_num|001 row_num|002 row_num|003

col1|row1col1 1

col1|row2col1

col2|row1col2 1 1

col2|row2col2 1

col3|word1 1 1

col3|word2 1 1

col3|word3 1 1

Tedge

TedgeDeg

TedgeT

text

row_num|001 word1 word2 word3

row_num|002 word2 word3

row_num|003 word1 word3

TedgeTxt

Page 16: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 16

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Page 17: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 17

D4M Software Library

• Associative Array representation works very well as an interface among databases.

• D4M currently implemented in languages with first class support of sparse matrices:

– MATLAB

– GNU Octave

– Julia (in progress)

• Implemented in ~2000 lines of MATLAB code

Download D4M

Source from

d4m.mit.edu

d4m_api.zip

matlab_src/

d4m_api_java.jar

libext.zip

dependency JARs

Page 18: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 18

D4M: What a user sees

(row, col, val)

Matlab strings

d4m

Matlab API

d4m_api_java

Java API

Accumulo

Java APIAccumulo

Table

% D4M Associative Array APIrow = 'r1,r2,'; col = 'c1,c1,'; val = '7,3,';

A = Assoc(row,col,val,@min);

% D4M Accumulo APIDB = DBserver(’zoohost.edu:2181', 'Accumulo',

'instance', 'user', 'password');

T = DB('Table'); % Create table if doesn't exist.

put(T,A); % Put associative array in T.

Aret = T(:,:); % Scan all of T.

Page 19: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 19

D4M: What a developer sees

Type Matlab/Julia File Java Class Use

Table

management

DBcreate.m D4mDbTableOperations Create table

@DBserver/ls.m D4mDbInfo List tables

@DBtable/nnz.m D4mDbTableOperationsNumber of entries in table,

summed from table's tablets

DBdelete.m D4mDbTableOperations Delete table

Write DBinsert.m D4mDbInsert Insert

Scan@DBtable/DBtable.m D4mDataSearch Create query holder

@DBtable/subsref.m D4mDataSearch Do query, possibly holding batches

@DBtable/close.m D4mDataSearch Reset query

Delete@DBtable/deleteTriple.m AccumuloDelete Delete entries

@DBtable/deleteAssoc.m AccumuloDelete Delete entries

Iterators@DBtable/ColCombiner.m D4mDbTableOperations List table iterators

@DBtable/addColCombiner.m D4mDbTableOperations Add all-scope table iterator

@DBtable/deleteColCombiner.m D4mDbTableOperations Remove iterator

Splits

@DBtable/Splits.m D4mDbTableOperationsReturn splits, number of entries in each

tablet, tablet server addresses

@DBtable/addSplits.m D4mDbTableOperations Add new table split

@DBtable/putSplits.m D4mDbTableOperations Replace table splits, merging old splits

@DBtable/mergeSplits.m D4mDbTableOperations Remove splits by merging tablets

• Source code released and available!

Page 20: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 20

D4M Write

More details on Batched Insert – 500 kB by default

• putNumBytes() controls #entries to insert in one batch, on MATLAB side

• Independent batches: each creates, flushes and closes separate

BatchWriters

• Guarantee BatchWriters correctly closed

• No need to maintain BatchWriter lifecycle in MATLAB

• 30 ms maximum latency before flushing

• 50 Write threads

• 1 MB maximum memory on BatchWriter, plenty for default batch size

KeyValue

Assoc

Val

Row ID

Assoc

Row

Column Timestamp

FamilyputColumn

Family()

QualifierAssoc

Col

Visibilityput

Security()

Page 21: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 21

D4M Scan Example

1. Translate Matlab queries into ranges for BatchScanner

T(:,:) %Scan all

T('r1;r5;:;r7;', :) %Scan given row ranges

T(:, 'c1;') %Use fetchColumn(), or row scan

Transpose table

T('r5;:;r9;', 'c1;:;c3;') %Complicated; break into simpler

queries

2. Hold state of Scanner iterator as state of MATLAB object

T_it = Iterator(T, 'elements', 1e5); % 100k entry batch size

A = T_it(:,:); % Initial query

while nnz(A) % While there is another batch

handleBatch(A);

A = T_it(); % Get next batch

end

Page 22: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 22

Parallel Accumulo Access

Sample script writing files to Accumulo in parallel:

T = DB('Tedge','TedgeT');

myFiles = global_ind(zeros(Nfile,1,map([Np 1],{},0:Np-1)));

for i = myFiles

fname = ['data/' num2str(i)]; % Create filename.

load([fname '.A.mat']); % Load file data.

put(T,num2str(A)); % Insert to Accumulo.

end

Run on 4 local processors: eval(pRUN('Script',4,{}));

• D4M + pMATLAB gives rise to high performance

Page 23: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 23

Accumulo Scaling on MIT SuperCloud

• Scales linearly with ingest processes, server nodes, and data sizeS

erv

er n

od

es

Page 24: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 24

115,000,000 inserts per second

• Using supercomputing techniques allows peak insert to be achieve

within seconds of launch

1M edge

Graph500

graph

43K

43B edges in

5 minutes

Page 25: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 25

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Page 26: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 26

D4M Twitter Demo

• August 24, 2014: Earthquake in Northern California

• Tweets from August 24-25

• Using D4M for:

– Exploration

– Analytics

– Visualization

Page 27: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 27

Set Table Bindings

Page 28: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 28

Query Tweets

Page 29: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 29

Find Common Locations

Page 30: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 30

Filter Tweets

Page 31: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 31

Query for Full Tweets

Page 32: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 32

Load Stopwords

Page 33: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 33

Remove Stopwords

Page 34: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 34

Find Co-Occurring Words

Page 35: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 35

Remove Diagonal

Page 36: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 36

See Words Most Used Together

Page 37: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 37

Display on Map

Page 38: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 38

Outline

• Introduction

• D4M Overview

• D4M Details

• Demonstration

• Conclusions

Page 39: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 39

Summary

• D4M is a popular software tool that connects

scientists with Big Data technologies

• D4M-Accumulo binding provides high performance

connectivity to Apache Accumulo for quick analytic

prototyping

• Current research expands this connection to support

high performance graph analytics

Page 40: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 40

• Graphulo: Implement GraphBLAS server-side iterators and operators on

Accumulo tables

• Use case: Queued analytics = Localized within a neighborhood

• Aim for Accumulo Contrib

• Released:

– Design Document

• Upcoming:

– Beta version of tools in

late May/early June

• Future:

– Scalability

– Schemas

– More example algorithms

G R A P H U L O

http://graphulo.mit.edu

Graphulo:

Contact Dylan Hutchison if you have any thoughts!

[email protected]

Page 41: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 41

Acknowledgements

• Bill Arcand

• Bill Bergeron

• David Bestor

• Chansup Byun

• Matt Hubbell

• Jeremy Kepner

• Jake Bolewski

• Pete Michaleas

• Julie Mullen

• Andy Prout

• Albert Reuther

• Tony Rosa

• Charles Yee

• Dylan Hutchison

And many more …

Page 42: Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache Accumulo [Frameworks]

Accumulo Summit

VNG - 42

Thank you!

• Contact:

– Vijay Gadepally ([email protected])

– Lauren Edwards ([email protected])

– Jeremy Kepner ([email protected])