real world considerations, implementing a big data system · ibm research 2014-03-24 ibm research -...

20
IBM Research 2014-03-24 IBM Research - Australia Big Challenges Building a Real-World Big Data System Or, what I did with eight years of my life Andy Frenkiel Big Data Talk, RMIT, 16 May 2014

Upload: lydang

Post on 17-Mar-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

IBM Research

2014-03-24IBM Research - Australia

Big Challenges Building a Real-WorldBig Data SystemOr, what I did with eight years of my life

Andy FrenkielBig Data Talk, RMIT, 16 May 2014

IBM RESEARCH

Isolated ResearchJoint Projects

IBM Divisions, Clients, Universities

Radical CollaborationThe World is Our Lab

’50s – ’80sHardware ’80s - ’00s

+ Software & Services

IP in IBM ProductsProprietary

Leverage IPfor Income

Income

’00s …+ Smarter Planet

Cross Licensingfor Freedom of Action

Influence

2014-03-24IBM Research - Australia GTO 2014

IBM Research OverviewFamous for its science and vital to IBM5 Nobel Laureates

3000 Researchers

$6B R&D Budget

Innovation that Matters

6 Turing Awards

21 years of Patent Leadership

12 Labs around the World

1955

1995

1998

19561995

1972 1982

China WatsonAlmaden

Austin TokyoHaifa

Zurich

India

Brazil

1945

2011 Melbourne

2011

2011

Ireland

Kenya2012

IBM RESEARCH

IBM Research – Australia: The Lab

2014-03-24IBM Research - Australia GTO 2014

► IBM Research Australia is located atLevel 5, 204 Lygon StreetCarlton, VIC 3053

► ~85 FTE now, growing to 150 by May 2016

IBM RESEARCH

IBM Research – Australia: Mission

2014-03-24IBM Research - Australia GTO 2014

“Evidence-based Decision Making at the Speed of Thought”Natural Resource

ManagementDisaster

ManagementLife Sciences* Healthcare

*IBM Life Sciences Collaboratory not in scope

High Performance ComputingStream Computing, Cloud Computing

Visualisation, Modelling, AnalyticsOptmisation, Human Computing Interfaces

Assimilation Decision support

Capturing Command & control

Modeling and predictive analysis

IBM RESEARCH

IBM Research – Australia: Projects

2014-03-24IBM Research - Australia

Disas

ter

Man

agem

ent

Natura

l

Resourc

es

Health

care

Life S

cien

ce

Nano Tech

Cloud Services

InformationComposition

BAMS

MultimediaAnalytics

Visualisation, Mobile, Social

Cognitive Computing

Industry Solutions

Computing as a Service

Science & Technology

AusCrisis

Tracker

AusCrisis

Tracker

ModellingModelling

FrothFlotation

FrothFlotation

MedicalSieve

MedicalSieve

SurveillanceSurveillance

RCHGenesisCare

Counties Manakau

RCHGenesisCare

Counties Manakau

ADMPADMP

EvacuationPlanning

EvacuationPlanning

BioNano

Sensors

BioNano

Sensors

GenomicsGenomics

CX Lab / THINK LabCX Lab / THINK Lab

SmarterFinancial

Life

SmarterFinancial

Life

We want to be famous for:Cognitive decision making (consequence management) delivered as a service

CardiacModellingCardiac

Modelling

ArterialStent

Modelling

ArterialStent

Modelling

VINEVINE

IBM RESEARCH

A few of the big challenges we encountered implementing IBM InfoSphere Streams, a big data product ...

IBM RESEARCH

Big Challenge #1

How do you write code to process streaming data?

TransformTransformFilter / Sample

Filter / Sample

ClassifyClassifyCorrelateCorrelate

AnnotateAnnotate

IBM RESEARCH

Big Challenge #1

How do you write code to process streaming data?

Input Streams

Ports

Operator

Type

OutputStream

{ }

{ }

Windows

IBM RESEARCH

Big Challenge #1, continued

How do you write code to process streaming data?

stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; window windowspec ; param parameters ; output output ; config configuration ;}

ExampleExample

stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}

stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}

► Operators share a common structure■ italics are sections to fill in

► Reading an operator invocation■ Declare a stream stream-name

■ With attributes from stream-type

■ that is produced by MyOperator

■ from the input(s) input-stream

■ MyOperator behavior defined by logic, parameters, windowspec, and configuration; output attribute assignments are specified in output

► For the example:■ Declare the stream Sale with the attribute item,

which is a raw (ASCII) string

■ Join the Bid and Ask streams with

■ sliding windows of 30 seconds on Bid, and 50 tuples of Ask

■ When items are equal, and Bid price is greater than or equal to Ask price

■ Output the item value on the Sale stream

Anatomy of an Operator Invocation

IBM RESEARCH

Interesting Challenge #1, continued

How do you write code to process streaming data?

SourceOperator

SinkOperator

Streams Application

graph

stream<TQRecT> TradeQuote = FileSource() { param file : "TradesAndQuotes.csv.gz"; format : csv; compression : gzip; }

stream<TradeFilterT> TradeFilter = Functor(TradeQuote) { param filter : isTrade(ttype) && (ticker in $monitoredTickers); output TradeFilter : ts = timeStringToTimestamp(date, time, false); }

stream<QuoteFilterT> QuoteFilter = Functor(TradeQuote) { param filter : isQuote(ttype) && (ticker in $monitoredTickers); output QuoteFilter : ts = timeStringToTimestamp(date, time, false); }

stream<VwapT, tuple<decimal64 sumvolume>> PreVwap = Aggregate(TradeFilter) { window TradeFilter : sliding, count(4), count(1), partitioned; param partitionBy : ticker; output PreVwap : ticker = Any(ticker), vwap = Sum(price * volume), minprice = Min(price), maxprice = Max(price), avgprice = Average(price), sumvolume = Sum(volume); }

stream<VwapT> Vwap = Functor(PreVwap) { output Vwap : vwap = vwap / sumvolume; }

stream<BargainIndexT> BargainIndex = Join(Vwap as V; QuoteFilter as Q) { window V : sliding, count(1), partitioned; Q : sliding, count(0); param partitionByLHS : V.ticker; equalityLHS : V.ticker; equalityRHS : Q.ticker; output BargainIndex : index = vwap > askprice ? asksize * exp(vwap - askprice) : 0d, ts = (rstring) ctime(ts); }

() as SinkOp = FileSink(BargainIndex) { param file : "out"; format : txt; }

IBM RESEARCH

Big Challenge #2

How do you implement an application consisting of 1000s of operators and data streams?

4 wells10s of operators and streams

20 wells100s of operators and streams

100 wells1000s of operators and streams

IBM RESEARCH

Big Challenge #2, continued

How do you implement an application consisting of 1000s of operators and data streams?

… Use hierarchy

...

IBM RESEARCH

Big Challenge #2, continued

How do you implement an application consisting of 1000s of operators and data streams?

… Use code to build code

Perl generates SPL

IBM RESEARCH

Big Challenge #2, continued

How do you implement an application consisting of 1000s of operators and data streams?

composite Main { stream<Type> Src = Source() {}

@parallel(width=2) stream<Type> Res = AB(Src) {}

() as Snk = Sink(Out) {}}

composite AB(input In; output B) { graph stream<Type> A = Functor(In) {} stream<Type> B = Functor(A) {}}

Logical Physical

Src A Snk SnkB Src

A[0] B[0]

A[1] B[1]

… Automatically generate operators for sub graphs that can be parrelelised

IBM RESEARCH

Big Challenge #3What should run where?

… Streams Runtime Supports Placement Criteria

x86 host x86 host

MetersCompany Filter

Usage Model

Meters

x86 host

Host pools can force operators to be on hosts with SolidDB installed

Host pools can force operators to be on hosts with SolidDB installed

Usage Contract

x86 host x86 host

Text Extract

Degree History

Compare History Store

History

Text Extract

Temp Action

Season Adjust

Daily Adjust

Operator placement constraints allow for co-location, ex-location, and isolation of operators

Operator placement constraints allow for co-location, ex-location, and isolation of operators

SolidDB could be wrapped as a custom operator for dynamic deployment and relocation

SolidDB could be wrapped as a custom operator for dynamic deployment and relocation

IBM RESEARCH

Big Challenge #3, continuedWhat should run where?

… Streams runtime optimises at deploy time

x86 host x86 host x86 host x86 host x86 host

Optimising scheduler assigns PEs to hosts, and continually manages resource allocation

Optimising scheduler assigns PEs to hosts, and continually manages resource allocation

MetersCompany Filter

Usage Model

Usage Contract

Temp Action

Dynamically add hosts and jobs

Dynamically add hosts and jobs

New jobs work with existing jobs

New jobs work with existing jobs

Text Extract

Degree History

Compare History Store

History

Meters

Season Adjust

Daily Adjust

Text Extract

IBM RESEARCH

Big Challenge #4How do you debug the streaming data flow?

© 2013 IBM Corporation17

[streamsadmin@streams bin]$ ./standalone IBM Stream Debugger (SDB), pid: 3982Standalone application execution is suspended.Set initial probe points, then run "g" command to continue execution.(sdb) o #in #out Operator Class 1 1 QuoteFilter QuoteFilter 1 0 SinkOp SinkOp 1 1 TradeFilter TradeFilter 0 1 TradeQuote TradeQuote 1 1 PreVwap PreVwap 2 1 BargainIndex BargainIndex 1 1 Vwap Vwap (sdb) b TraTradeFilter TradeQuote (sdb) b TradeFilter o 0 Set + 0 Breakpoint TradeFilter o 0 stopped:false (sdb) b PreVwap o 0 Set + 1 Breakpoint PreVwap o 0 stopped:false (sdb) g(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.48, decimal64 volume, 74200, decimal64 ts, (1135654207,521000000,0), timestamp ticker, "IBM", rstring

(sdb) u 0 price 87.00 price, 87.00, decimal64 volume, 74200, decimal64 ts, (1135654207,521000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.45, decimal64 volume, 200, decimal64 ts, (1135654213,518000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.45, decimal64 volume, 300, decimal64 ts, (1135654217,627000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.45, decimal64 volume, 300, decimal64 ts, (1135654219,132000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 1 Breakpoint PreVwap o 0 dropped:false stopped:true ticker, "IBM", rstring minprice, 83.45, decimal64 maxprice, 87.00, decimal64 avgprice, 84.33750000000000, decimal64 vwap, 6522160.00, decimal64 sumvolume, 75000, decimal64(sdb)

Set Breakpoints

… Use a Streams Debugger

IBM RESEARCH

Other Big Challenges

© 2013 IBM Corporation18

How do we support user-defined type-generic Streams operators and functions?

How can the runtime adapt dynamically to changes in workload?

How do we extend running applications?

How do we ensure that no tuples are dropped?

How do we make developing Streams programs easy?

How do we document Streams code, and enable understanding and sharing?

How do we enable Streams to readily integrate with other systems?

IBM RESEARCH

Challenge #5How can I learn more?

© 2013 IBM Corporation19

http://www-01.ibm.com/software/data/infosphere/streams/quick-start/ Install and go downloads for Streams

http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/ Install and go downloads for BigInsights

http://www-01.ibm.com/support/knowledgecenter/SSCRJU_3.2.1 Product documentation http://www.ibm.com/developerworks/bigdata/stream.html Forums, help, more downloads

https://github.com/IBMStreams Useful Streams toolkits

http://bigdatauniversity.com/http://www.ibmbigdatahub.com/ More reference information, videos, blogs, tools ...

IBM RESEARCH

2014-03-24IBM Research - Australia GTO 2014