real world considerations, implementing a big data system · ibm research 2014-03-24 ibm research -...
TRANSCRIPT
IBM Research
2014-03-24IBM Research - Australia
Big Challenges Building a Real-WorldBig Data SystemOr, what I did with eight years of my life
Andy FrenkielBig Data Talk, RMIT, 16 May 2014
IBM RESEARCH
Isolated ResearchJoint Projects
IBM Divisions, Clients, Universities
Radical CollaborationThe World is Our Lab
’50s – ’80sHardware ’80s - ’00s
+ Software & Services
IP in IBM ProductsProprietary
Leverage IPfor Income
Income
’00s …+ Smarter Planet
Cross Licensingfor Freedom of Action
Influence
2014-03-24IBM Research - Australia GTO 2014
IBM Research OverviewFamous for its science and vital to IBM5 Nobel Laureates
3000 Researchers
$6B R&D Budget
Innovation that Matters
6 Turing Awards
21 years of Patent Leadership
12 Labs around the World
1955
1995
1998
19561995
1972 1982
China WatsonAlmaden
Austin TokyoHaifa
Zurich
India
Brazil
1945
2011 Melbourne
2011
2011
Ireland
Kenya2012
IBM RESEARCH
IBM Research – Australia: The Lab
2014-03-24IBM Research - Australia GTO 2014
► IBM Research Australia is located atLevel 5, 204 Lygon StreetCarlton, VIC 3053
► ~85 FTE now, growing to 150 by May 2016
IBM RESEARCH
IBM Research – Australia: Mission
2014-03-24IBM Research - Australia GTO 2014
“Evidence-based Decision Making at the Speed of Thought”Natural Resource
ManagementDisaster
ManagementLife Sciences* Healthcare
*IBM Life Sciences Collaboratory not in scope
High Performance ComputingStream Computing, Cloud Computing
Visualisation, Modelling, AnalyticsOptmisation, Human Computing Interfaces
Assimilation Decision support
Capturing Command & control
Modeling and predictive analysis
IBM RESEARCH
IBM Research – Australia: Projects
2014-03-24IBM Research - Australia
Disas
ter
Man
agem
ent
Natura
l
Resourc
es
Health
care
Life S
cien
ce
Nano Tech
Cloud Services
InformationComposition
BAMS
MultimediaAnalytics
Visualisation, Mobile, Social
Cognitive Computing
Industry Solutions
Computing as a Service
Science & Technology
AusCrisis
Tracker
AusCrisis
Tracker
ModellingModelling
FrothFlotation
FrothFlotation
MedicalSieve
MedicalSieve
SurveillanceSurveillance
RCHGenesisCare
Counties Manakau
RCHGenesisCare
Counties Manakau
ADMPADMP
EvacuationPlanning
EvacuationPlanning
BioNano
Sensors
BioNano
Sensors
GenomicsGenomics
CX Lab / THINK LabCX Lab / THINK Lab
SmarterFinancial
Life
SmarterFinancial
Life
We want to be famous for:Cognitive decision making (consequence management) delivered as a service
CardiacModellingCardiac
Modelling
ArterialStent
Modelling
ArterialStent
Modelling
VINEVINE
IBM RESEARCH
A few of the big challenges we encountered implementing IBM InfoSphere Streams, a big data product ...
IBM RESEARCH
Big Challenge #1
How do you write code to process streaming data?
TransformTransformFilter / Sample
Filter / Sample
ClassifyClassifyCorrelateCorrelate
AnnotateAnnotate
IBM RESEARCH
Big Challenge #1
How do you write code to process streaming data?
Input Streams
Ports
Operator
Type
OutputStream
{ }
{ }
Windows
IBM RESEARCH
Big Challenge #1, continued
How do you write code to process streaming data?
stream<stream-type> stream-name = MyOperator(input-stream; …) { logic logic ; window windowspec ; param parameters ; output output ; config configuration ;}
ExampleExample
stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}
stream<rstring item> Sale = Join(Bid; Ask){ window Bid: sliding, time(30); Ask: sliding, count(50); param match : Bid.item == Ask.item && Bid.price >= Ask.price; output Sale: item = Bid.item;}
► Operators share a common structure■ italics are sections to fill in
► Reading an operator invocation■ Declare a stream stream-name
■ With attributes from stream-type
■ that is produced by MyOperator
■ from the input(s) input-stream
■ MyOperator behavior defined by logic, parameters, windowspec, and configuration; output attribute assignments are specified in output
► For the example:■ Declare the stream Sale with the attribute item,
which is a raw (ASCII) string
■ Join the Bid and Ask streams with
■ sliding windows of 30 seconds on Bid, and 50 tuples of Ask
■ When items are equal, and Bid price is greater than or equal to Ask price
■ Output the item value on the Sale stream
Anatomy of an Operator Invocation
IBM RESEARCH
Interesting Challenge #1, continued
How do you write code to process streaming data?
SourceOperator
SinkOperator
Streams Application
graph
stream<TQRecT> TradeQuote = FileSource() { param file : "TradesAndQuotes.csv.gz"; format : csv; compression : gzip; }
stream<TradeFilterT> TradeFilter = Functor(TradeQuote) { param filter : isTrade(ttype) && (ticker in $monitoredTickers); output TradeFilter : ts = timeStringToTimestamp(date, time, false); }
stream<QuoteFilterT> QuoteFilter = Functor(TradeQuote) { param filter : isQuote(ttype) && (ticker in $monitoredTickers); output QuoteFilter : ts = timeStringToTimestamp(date, time, false); }
stream<VwapT, tuple<decimal64 sumvolume>> PreVwap = Aggregate(TradeFilter) { window TradeFilter : sliding, count(4), count(1), partitioned; param partitionBy : ticker; output PreVwap : ticker = Any(ticker), vwap = Sum(price * volume), minprice = Min(price), maxprice = Max(price), avgprice = Average(price), sumvolume = Sum(volume); }
stream<VwapT> Vwap = Functor(PreVwap) { output Vwap : vwap = vwap / sumvolume; }
stream<BargainIndexT> BargainIndex = Join(Vwap as V; QuoteFilter as Q) { window V : sliding, count(1), partitioned; Q : sliding, count(0); param partitionByLHS : V.ticker; equalityLHS : V.ticker; equalityRHS : Q.ticker; output BargainIndex : index = vwap > askprice ? asksize * exp(vwap - askprice) : 0d, ts = (rstring) ctime(ts); }
() as SinkOp = FileSink(BargainIndex) { param file : "out"; format : txt; }
IBM RESEARCH
Big Challenge #2
How do you implement an application consisting of 1000s of operators and data streams?
4 wells10s of operators and streams
20 wells100s of operators and streams
100 wells1000s of operators and streams
IBM RESEARCH
Big Challenge #2, continued
How do you implement an application consisting of 1000s of operators and data streams?
… Use hierarchy
...
IBM RESEARCH
Big Challenge #2, continued
How do you implement an application consisting of 1000s of operators and data streams?
… Use code to build code
Perl generates SPL
IBM RESEARCH
Big Challenge #2, continued
How do you implement an application consisting of 1000s of operators and data streams?
composite Main { stream<Type> Src = Source() {}
@parallel(width=2) stream<Type> Res = AB(Src) {}
() as Snk = Sink(Out) {}}
composite AB(input In; output B) { graph stream<Type> A = Functor(In) {} stream<Type> B = Functor(A) {}}
Logical Physical
Src A Snk SnkB Src
A[0] B[0]
A[1] B[1]
… Automatically generate operators for sub graphs that can be parrelelised
IBM RESEARCH
Big Challenge #3What should run where?
… Streams Runtime Supports Placement Criteria
x86 host x86 host
MetersCompany Filter
Usage Model
Meters
x86 host
Host pools can force operators to be on hosts with SolidDB installed
Host pools can force operators to be on hosts with SolidDB installed
Usage Contract
x86 host x86 host
Text Extract
Degree History
Compare History Store
History
Text Extract
Temp Action
Season Adjust
Daily Adjust
Operator placement constraints allow for co-location, ex-location, and isolation of operators
Operator placement constraints allow for co-location, ex-location, and isolation of operators
SolidDB could be wrapped as a custom operator for dynamic deployment and relocation
SolidDB could be wrapped as a custom operator for dynamic deployment and relocation
IBM RESEARCH
Big Challenge #3, continuedWhat should run where?
… Streams runtime optimises at deploy time
x86 host x86 host x86 host x86 host x86 host
Optimising scheduler assigns PEs to hosts, and continually manages resource allocation
Optimising scheduler assigns PEs to hosts, and continually manages resource allocation
MetersCompany Filter
Usage Model
Usage Contract
Temp Action
Dynamically add hosts and jobs
Dynamically add hosts and jobs
New jobs work with existing jobs
New jobs work with existing jobs
Text Extract
Degree History
Compare History Store
History
Meters
Season Adjust
Daily Adjust
Text Extract
IBM RESEARCH
Big Challenge #4How do you debug the streaming data flow?
© 2013 IBM Corporation17
[streamsadmin@streams bin]$ ./standalone IBM Stream Debugger (SDB), pid: 3982Standalone application execution is suspended.Set initial probe points, then run "g" command to continue execution.(sdb) o #in #out Operator Class 1 1 QuoteFilter QuoteFilter 1 0 SinkOp SinkOp 1 1 TradeFilter TradeFilter 0 1 TradeQuote TradeQuote 1 1 PreVwap PreVwap 2 1 BargainIndex BargainIndex 1 1 Vwap Vwap (sdb) b TraTradeFilter TradeQuote (sdb) b TradeFilter o 0 Set + 0 Breakpoint TradeFilter o 0 stopped:false (sdb) b PreVwap o 0 Set + 1 Breakpoint PreVwap o 0 stopped:false (sdb) g(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.48, decimal64 volume, 74200, decimal64 ts, (1135654207,521000000,0), timestamp ticker, "IBM", rstring
(sdb) u 0 price 87.00 price, 87.00, decimal64 volume, 74200, decimal64 ts, (1135654207,521000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.45, decimal64 volume, 200, decimal64 ts, (1135654213,518000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.45, decimal64 volume, 300, decimal64 ts, (1135654217,627000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 0 Breakpoint TradeFilter o 0 dropped:false stopped:true price, 83.45, decimal64 volume, 300, decimal64 ts, (1135654219,132000000,0), timestamp ticker, "IBM", rstring(sdb) c(sdb) + 1 Breakpoint PreVwap o 0 dropped:false stopped:true ticker, "IBM", rstring minprice, 83.45, decimal64 maxprice, 87.00, decimal64 avgprice, 84.33750000000000, decimal64 vwap, 6522160.00, decimal64 sumvolume, 75000, decimal64(sdb)
Set Breakpoints
… Use a Streams Debugger
IBM RESEARCH
Other Big Challenges
© 2013 IBM Corporation18
How do we support user-defined type-generic Streams operators and functions?
How can the runtime adapt dynamically to changes in workload?
How do we extend running applications?
How do we ensure that no tuples are dropped?
How do we make developing Streams programs easy?
How do we document Streams code, and enable understanding and sharing?
How do we enable Streams to readily integrate with other systems?
IBM RESEARCH
Challenge #5How can I learn more?
© 2013 IBM Corporation19
http://www-01.ibm.com/software/data/infosphere/streams/quick-start/ Install and go downloads for Streams
http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/ Install and go downloads for BigInsights
http://www-01.ibm.com/support/knowledgecenter/SSCRJU_3.2.1 Product documentation http://www.ibm.com/developerworks/bigdata/stream.html Forums, help, more downloads
https://github.com/IBMStreams Useful Streams toolkits
http://bigdatauniversity.com/http://www.ibmbigdatahub.com/ More reference information, videos, blogs, tools ...