infosphere streams technical overview - use cases big data - jerome chailloux

41
IBM InfoSphere Streams Technical Overview Jerome Chailloux Europe IOT - Sr. Technical Field Specialist - Big Data, Linux Advocate [email protected] © 2013 IBM Corporation February 21, 2013

Upload: ibminfosphereugfr

Post on 08-Jun-2015

6.484 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

IBM InfoSphere StreamsTechnical Overview

Jerome Chailloux

Europe IOT - Sr. Technical Field Specialist - Big Data, Linux Advocate

[email protected]

© 2013 IBM CorporationFebruary 21, 2013

Page 2: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

IBM InfoSphere Streams v3.0

A platform for real-time analytics on BIG data

� Volume

© 2013 IBM Corporation2

Just-in-time decisions

Page 3: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX
Page 4: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Categories of Problems Solved by Streams

� Applications that require on-the-fly processing, filtering and analysis of streaming data– Sensors: environmental, industrial, surveillance video, GPS, …– “Data exhaust”: network/system/web server/app server log files– High-rate transaction data: financial transactions, call detail records

� Criteria: two or more of the following– Messages are processed in isolation or in limited data windows– Sources include non-traditional data (spatial, imagery, text, …)– Sources vary in connection methods, data rates, and processing requirements,

presenting integration challenges– Data rates/volumes require the resources of multiple processing nodes– Analysis and response are needed with sub-millisecond latency– Data rates and volumes are too great for store-and-mine approaches

© 2013 IBM Corporation4

Page 5: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

The Big Data Ecosystem: Interoperability is Key

© 2013 IBM Corporation5

Page 6: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

6

Page 7: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Use Case: Law Enforcement and Security

© 2013 IBM Corporation7

Page 8: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Predictive Analytics in a Neonatal ICU

� Real-time analytics and correlations on physiological data streams – Blood pressure, Temperature, EKG,

Blood oxygen saturation etc.,

� Early detection of the onset of potentially life-threatening conditions– Up to 24 hours earlier than current

medical practices – Early intervention leads to lower patient

morbidity and better long term outcomes

� Technology also enables physicians to verify new clinical hypotheses

© 2013 IBM Corporation8

Page 9: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

9

Smarter Faster Cheaper CDR Processing

InfoSphere Streams InfoSphere Streams

xDR HubxDR Hub

Key Requirements: Price/Performance and Scaling

6 Billion CDRs per day, dedups over 15 days, processing latency from 12 hours to a few seconds6 machines (using ½ processor capacity)

Page 10: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

10

� Use scenario– State-of-the-art covert surveillance system

based on Streams platform

– Acoustic signals from buried fiber optic cables are monitored, analyzed and reported in real time for necessary action

– Currently designed to scale up to 1600 streams of raw binary data

� Requirement

– Real-time processing of multi-modal signals (acoustics. video, etc)

– Easy to expand, dynamic

– 3.5M data elements per second

� Winner 2010 IBM CTO Innovation Award

Surveilance and Physical Security: TerraEchos (Business Partner)

Page 11: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

© 2013 IBM Corporation11

Page 12: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Use Cases: Video Processing (Contour Detection)

© 2013 IBM Corporation12

Original Picture

Page 13: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

How Streams Works

© 2013 IBM Corporation13

Page 14: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

How Streams Works

© 2013 IBM Corporation14

Page 15: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Scalable Stream Processing

� Streams programming model: construct a graph

– Mathematical concept• not a line -, bar -, or pie chart!• Also called a network• Familiar: for example,

a tree structure is a graph

– Consisting of operators and the streams that connect them• The vertices (or nodes) and edges of the mathematical graph• A directed graph: the edges have a direction (arrows)

� Streams runtime model: distributed processes– Single or multiple operators form a Processing Element (PE)– Compiler and runtime services make it easy to deploy PEs

• On one machine• Across multiple hosts in a cluster when scaled-up processing is required

– All links and data transport are handled by runtime services• Automatically• With manual placement directives where required

© 2013 IBM Corporation15

OP

OP

OP

OP

OP

OP

OP

stream

Page 16: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

From Operators to Running Jobs

� Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance)

� An instance can include a single processing node (hardware)

� Or multiple processing nodes

© 2013 IBM Corporation16

Page 17: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

InfoSphere Streams Objects: Runtime View

� Instance– Runtime instantiation of InfoSphere

Streams executing across one or more hosts

– Collection of components and

services

� Processing Element (PE)– Fundamental execution unit that is

run by the Streams instance – Can encapsulate a single operator

or many “fused” operators

� Job– A deployed Streams application

executing in an instance – Consists of one or more PEs

© 2013 IBM Corporation17

InstanceInstance

JobJob

NodeNode

Stream 1 PEPE PEPE

NodeNode

PEPE

Stream 1

Stream 2

Stream 3

Stream 3

Stream 4

Stream 5

operator

Page 18: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Competition: Complex-Event Processing (CEP)

© 2013 IBM Corporation18

“real time”

“ultra-low” latency

event streamprocessing

� Analysis on discrete business events

� Rules-based (if-then-

else) with correlation across event types

� Only structured data

types are supported� Modest data rates

Streams CEP

� Analytics on continuous data streams

� Simple to extremely

complex analytics � Scale for computational

intensity� Supports a whole range

of relational and non relational data types

Page 19: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

IBM InfoSphere Streams 3.0

© 2013 IBM Corporation19

Page 20: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

What is Streams Processing Language?

� Designed for stream computing– Define a streaming-data flow graph– Rich set of data types to define tuple attributes

� Declarative– Operator invocations name the input and output streams– Referring to streams by name is enough to connect the graph

� Procedural support– Full-featured imperative language– Custom logic in operator invocations– Expressions in attribute assignments and parameter definitions

� Extensible– User-defined data types– Custom functions written in SPL or a native language (C++ or Java)– Custom operators written in SPL– User-defined operators written in a native language (C++ or Java)

© 2013 IBM Corporation20

Page 21: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Streams Standard Toolkit: Relational Operators

� Relational operators

– Functor Perform tuple-level manipulations

– Filter Remove some tuples from a stream

– Aggregate Group and summarize incoming tuples

– Sort Impose an order on incoming tuples in a stream

– Join Correlate two streams

– Punctor Insert window punctuation markers into a stream

© 2013 IBM Corporation21

Page 22: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Streams Standard Toolkit: Adapter Operators

� Adapter operators

– FileSource Read data from files in formats such as csv, line, or binary

– FileSink Write data to files in formats such as csv, line, or binary

– DirectoryScan Detect files to be read as they appear in a given directory

– TCPSource Read data from TCP sockets in various formats

– TCPSink Write data to TCP sockets in various formats

– UDPSource Read data from UDP sockets in various formats

– UDPSink Write data to UDP sockets in various formats

– Export Make a stream available to other jobs in the same instance

– Import Connect to streams exported by other jobs

– MetricsSink Create displayable metrics from numeric expressions

© 2013 IBM Corporation22

Page 23: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Streams Standard Toolkit: Utility Operators

� Workload generation and custom logic

– Beacon Emit generated values; timing and number configurable

– Custom Apply arbitrary SPL logic to produce tuples

� Coordination and timing

– Throttle Make a stream flow at a specified rate

– Delay Time-shift an entire stream relative to other streams

– Gate Wait for acknowledgement from downstream operator

– Switch Block or allow the flow of tuples based on control input

© 2013 IBM Corporation23

Page 24: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Streams Standard Toolkit: Utility Operators (cont’d)

� Parallel flows

– Barrier Synchronize tuples from sequence-correlated streams

– Pair Group tuples from multiple streams of same type

– Split Forward tuples to output streams based on a predicate

– ThreadedSplit Distribute tuples over output streams by availability

– Union Construct an output tuple from each input tuple

– DeDuplicate Suppress duplicate tuples seen within a given time period

Miscellaneous

– DynamicFilter Filter tuples based on criteria that can change while it runs

– JavaOp General-purpose operator for encapsulating Java code

– Parse Parse blob data for use with user-defined adapters

– Format Format blob data for use with user-defined adapters

– Compress Compress blob data

– Decompress Decompress blob data

– CharacterTransform Convert blob data from one encoding to another

© 2013 IBM Corporation24

Page 25: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

XML Support: Built Into SPL

� XML type

– With XPath expressions and functions

to parse and manipulate XML data– Convert XML to tuples

� XQuery functions– Use XML data on the fly

� Adapters support XML format

Page 26: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Streams Extensibility: Toolkits

� Like packages, plugins, addons, extenders, etc. – Reusable sets of operators, types, and functions– What can be included in a toolkit?

• Primitive and composite operators• User-defined types• Native and SPL functions• Sample applications, utilities• Tools, documentation, data, etc.

– Versioning is supported– Define dependencies on other versioned assets (toolkits, Streams itself)

© 2013 IBM Corporation2626

Page 27: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Integration: the Internet and Database Toolkits

� Integration with traditional sources and consumers

� Internet Toolkit

– InetSource periodically retrieve data from HTTP, FTP, RSS, and files

� Database Toolkit• ODBC databases: DB2, Informix, Oracle, solidDb, MySQL, SQLServer, Netezza

– ODBCAppend Insert rows into an SQL database table

– ODBCEnrich Extend streaming data based on lookups in database tables

– ODBCRun Perform SQL queries with parameters from input tuples

– ODBCSource Read data from a SQL database

– SolidDBEnrichPerform table lookups in an in-memory database

– DB2SplitDB Split a stream by DB2 partition key

– DB2PartitionedAppend Write data to table in specified DB2 partition

– NetezzaLoad Perform high-speed loads into a Netezza database

– NetezzaPrepareLoad Convert tuple to delimited string for Netezza loads

© 2013 IBM Corporation27

Page 28: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Integration: the Big Data Toolkit

� Integration with IBM’s Big Data Platform

� Data Explorer

– DataExplorerPush Insert records into a Data Explorer index

� BigInsights: Hadoop Distributed File System

– HDFSDirectoryScan Like DirectoryScan, only for HDFS

– HDFSFileSource Like FileSource, only for HDFS

– HDFSFileSink Like FileSink, only for HDFS

– HDFSSplit Write batches of data in parallel to HDFS

© 2013 IBM Corporation28

Page 29: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Integration: the DataStage Integration Toolkit

� Streams real-time analytics and DataStage information integration– Perform deeper analysis on the

data as part of the information

integration flow– Get more timely results and

offload some analytics load

from the warehouse

� Operators and tooling– Adapters to exchange data

between Streams and

DataStage• DSSource• DSSink

– Tooling to generate

integration code– DataStage provides

Streams connectors

© 2013 IBM Corporation29

Page 30: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Integration: the Messaging Toolkit

� Integrate with IBM WebSphere MQ– Create a stream from a subscription to an MQ topic or queue– Publish a stream to an MQ series topic or queue

Operators– XMSSource Read data from an MQ queue or topic– XMSSink Send messages to applications that use WebSphere MQ

© 2013 IBM Corporation30

Page 31: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Integration and Analytics: the Text Toolkit

� Derive structured information from unstructured text– Apply extractors

• Programs that encode rules for extracting information from text• Written in AQL (Annotation Query Language)• Developed against a static repository of text data• AQL files can be combined into modules (directories) • Modules can be compiled into TAM files• NOTE: Old-style AOG files not supported by BigInsights 2.0 and this toolkit

� Can be used in Streams 3.0 with the Deprecated Text Toolkit

� Streams Studio plugins for developing AQL queries– Same tooling as in BigInsights

� Operator and utility– TextExtract Run AQL queries (module or TAM) over text documents– createtypes script

• Create stream type definitions that match the output views of the extractors.

� Plays a major role in accelerators – SDA: Social Data Analytics– MDA: Machine Data Analytics

© 2013 IBM Corporation31

Page 32: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Analytics: The Complex Event Processing Toolkit

� MatchRegex Use patterns to detect composite events– In streams of simple events (tuples)– Easy-to-use regex-style pattern match of user-defined predicates

� Integration in Streams allows CEP-style processing with high performance and rich analytics

Page 33: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Analytics: The Geospatial Toolkit

� High-performance analysis and processing of geospatial data� Enables Location Based Services (LBS)

– Smarter Transportation: monitor traffic speed and density– Geofencing: detect when objects enter or leave a specified area

� Geospatial data types– e.g. Point, LineString, Polygon

� Geospatial functions– e.g. Distance, Map point to LineString, IsContained, etc.

© 2013 IBM Corporation33

Page 34: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Analytics: The TimeSeries Toolkit

� Apply Digital Signal Processing techniques– Find patterns and anomalies in real time– Predict future values in real time

� A rich set of functionality for working with time series data– Generation

• Synthesize specific wave forms (e.g., sawtooth, sine)

– Preprocessing• Preparation and conditioning (ReSample, TSWindowing)

– Analysis• Statistics, correlations, decomposition and transformation

– Modeling• Prediction, regression and tracking (e.g. Holt-Winters, GAMLearner)

© 2013 IBM Corporation34

Page 35: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Analytics: the Mining Toolkit

� Enables scoring of real-time data in a Streams application– Scoring is performed against a predefined model (in PMML file)– Supports a variety of model types and scoring algorithms

� Predictive Model Markup Language (PMML)– Standard for statistical and data mining models– Produced by Analytics tools: SPSS, ISAS, etc.– XML representation defined by http://www.dmg.org/

� Scoring operators

– Classification Assign tuple to a class and report confidence

– ClusteringAssign tuple to a cluster and compute score

– Regression Calculate predicted value and standard deviation

– Associations Identify the applicable rule and report consequent

(rule head), support, and confidence

� Supports dynamic replacement of the PMML model usedby an operator

© 2013 IBM Corporation35

Page 36: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Analytics: Streams + SPSS Real Time Scoring Service� Included with SPSS, not with Streams

© 2013 IBM Corporation36

Page 37: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Analytics: the Financial Services Toolkit

� Adapters– Financial Information Exchange (FIX)– WebSphere Front Office for Financial Markets (WFO) – WebSphere MQ Low-Latency Messaging (LLM) Adapters

� Types and Functions– OptionType (put, call), TxType (buy, sell), Trade, Quote, OptionQuote, Order– Coefficient of Correlation– “The Greeks” (put/call values, delta, theta, etc.)

� Operators– Based on QuantLib financial analytics open source package.– Compute theoretical value of an option (European, American)

� Example applications– Equities Trading – Options Trading

© 2013 IBM Corporation37

Page 38: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

RESOURCE LINKS

Page 39: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

Where to get Streams

� Software available via– Passport Advantage site

– PartnerWorld

– Academic Initiative

– IBM Internal

– 90-day trial code• IBM Trials and Demos portal• ibm.com -> Support & downloads

->Download -> Trials and demos

© 2013 IBM Corporation39

Page 40: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

More Streams links

� Streams in the Information Management Acceleration Zone

� Streams in the Media Library (recorded sessions)– Search on “streams”

� The Streams Sales Kit– Software Sellers Workplace

(the old eXtreme Leverage)

– PartnerWorld

� InfoSphere Streams Wiki on developerWorks– Discussion forum, Streams Exchange

© 2013 IBM Corporation40

Page 41: InfoSphere Streams Technical Overview - Use Cases Big Data - Jerome CHAILLOUX

© 2013 IBM Corporation41