streaming visualization - doag.org · time extract-transform-load (etl) and data integration use...
TRANSCRIPT
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Streaming Visualization
Guido Schmutz DOAG Big Data 2018 – 20.9.2018
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: [email protected]
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Agenda
1. Visualization in Big Data Reference Architecture
2. How to implement „Data-in-Motion“?
3. Blueprints for Streaming Visualization
4. Blueprints for Stream Visualization – Implementation
–
•
• ,
Visualization in Big Data Reference
Architecture
Data Value Chain
Milliseconds • Place Trace • Serve ad • Enrich Stream • Approve Trans
Hundredths of Seconds • Calculate Risk • Leaderboard • Aggregate • Count
Second(s) • Retrieve Click
Stream • Show orders
Minutes • Backtest algo • BI • Daily Reports
Hours • Algo discovery • Log analysis • Fraud pattern match
Architekturen von Big Data Anwendungen
Traditional BI Infrastructures
Enterprise Data
Warehouse
ETL / Stored
Procedures
Bulk Source
DB
Extract
File
DB
Architekturen von Big Data Anwendungen
BI Tools
Search / Explore
Enterprise Apps
Logic
{ }
API
high latency
Bulk Source
Hadoop Clusterd Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore Parallel
Processing
Storage
Storage
Ra
w
Re
fin
ed
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Bulk Source
Hadoop Clusterd Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore Parallel
Processing
Storage
Storage
Ra
w
Re
fin
ed
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Source
Location
Telemetry
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Event Stream
Bulk Source
Hadoop Clusterd Hadoop Cluster
Big Data Platform
BI Tools
Enterprise Data
Warehouse
SQL
Search / Explore
• Machine Learning • Graph Algorithms • Natural Language Processing
Parallel
Processing
Storage
Storage
Ra
w
Re
fin
ed
Results
high latency
Enterprise Apps
Logic
{ }
API
File Import / SQL Import
DB
Extract
File
DB
Event Stream
Event Source
Location
IoT
Data
Mobile
Apps
Social
Big Data solves Volume and Variety – not Velocity
Introduction to Stream Processing
Event
Hub Event
Hub Event
Hub
Telemetry
"Data at Rest" vs. "Data in Motion"
Data at Rest Data in Motion
Store
Act
Analyze
Store Act
Analyze
111010101010110
111010101010110
Introduction to Stream Processing
Event
Hub Event
Hub
Hadoop Clusterd Hadoop Cluster
Stream Analytics
Platform
Stream Processing Architecture solves Velocity
BI Tools
Enterprise Data
Warehouse
Event
Hub
Search / Explore
Enterprise Apps
Search
Results Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Introduction to Stream Processing
Low(est) latency, no history
Telemetry
Hadoop Clusterd Hadoop Cluster
Stream Analytics
Platform
Big Data for all historical data analysis
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Event
Stream
Event
Stream
Hadoop Clusterd Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
Ra
w
Re
fin
ed
Results
Data Flow Event
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
File Import / SQL Import
Introduction to Stream Processing
Telemetry
Data Store
Integrate existing systems through CDC
Data
Event Hub
Integration
Consuming Systems
State Logic
CDC
CDC Connector
Traditional Silo-based
System
Logic User Interface
Capture changes directly on database
Change Data Capture (CDC) => think like
a global database trigger
Transform existing systems to event
producer
Event
Stream
Event
Stream
Introduction to Stream Processing
Hadoop Clusterd Hadoop Cluster
Stream Analytics
Platform
Integrate existing systems with lower latency through CDC
BI Tools
Enterprise Data
Warehouse
Search / Explore
Enterprise Apps
Search
Results Stream Analytics
Reference /
Models
Dashboard
Logic
{ }
API
Hadoop Clusterd Hadoop Cluster
Big Data Platform
Parallel
Processing
Storage
Storage
Ra
w
Re
fin
ed
Results
File Import / SQL Import
Event
Stream
Event
Stream
Data Flow Event
Hub
Event
Stream
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Introduction to Stream Processing
Telemetry
Hadoop Clusterd Hadoop Cluster
Big Data
Unified Architecture for Modern Data Analytics Solutions
SQL
Search
BI Tools
Enterprise Data
Warehouse
Search / Explore
File Import / SQL Import
Event
Hub
Parallel
Processing
Storage
Storage
Ra
w
Re
fin
ed
Results
Microservice State
{ }
API
Stream
Processor State
{ }
API
Event
Stream
Event
Stream
Service
Stream Analytics
Microservices
Enterprise Apps
Logic
{ }
API
Edge Node
Rules
Event Hub
Storage
Bulk Source
Event Source
Location
DB
Extract
File
DB
IoT
Data
Mobile
Apps
Social
Event Stream
Telemetry
Two Types of Stream Processing
(from Gartner)
Introduction to Stream Processing
Stream Data Integration
• primarily focuses on the ingestion and
processing of data sources targeting real-
time extract-transform-load (ETL) and data
integration use cases
• filter and enrich the data
• optionally calculate time-windowed
aggregations before storing the results in a
database or file system
Stream Analytics
• targets analytics use cases
• calculating aggregates and detecting
patterns to generate higher-level, more
relevant summary information (complex
events)
• Complex events may signify threats or
opportunities that require a response from
the business through real-time dashboards,
alerts or decision automation
How to implement „Data-in-
Motion“?
–
•
•
are
”Data-in-Motion” Ecosystem
Stream Analytics
Event Hub
Open Source Closed Source
Stream Data Integration
Source: adapted from Tibco
Edge
Introduction to Stream Processing
Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget
Blueprints for Stream Visualization
–
•
•
are
1) Direct Streaming to the Consumer
”Data in Motion”
Stream
Analytics
Event Hub
Integration
Streaming
Visualization
Channel
Consumer
Data Flow
Data Sources
2) Use a fast datastore and do regular polling from
consumer
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Data Store Streaming
Visualization
Data Flow
Consumer Data Sources
3) Use stateful Stream Analytics and query directly the
store
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Streaming
Visualization
Consumer Data Sources
Blueprints for Stream Visualization
- Impementation
–
•
•
are
Visualization: many many options! But do they support
Streaming Data?
Oracle Stream Analytics
”Data in Motion”
Stream
Analytics
Event Hub
Integration
Streaming
Visualization
Channel
Consumer
Data Flow
Data Sources
Oracle Stream Analytics
• Stream Analytics and Visualization in
one
• offers real-time actionable business
insight on streaming data
• automates action to drive today’s agile businesses (business user)
• Runs on top of Spark Streaming
• Cloud and on-premises
• Data Sources: Kafka, JMS, GoldenGate,
File
Web Sockets / SSE / Custom Java Script Application
”Data in Motion”
Stream
Analytics
Event Hub
Integration
Streaming
Visualization
Channel
Consumer
Data Flow Sever Sent Event (SSE)
Slack / WhatsApp / Twitter / …
”Data in Motion”
Stream
Analytics
Event Hub
Integration
Streaming
Visualization
Channel
Consumer
Data Flow
WebSockets vs. Server Sent Events (SSE)
WebSockets
• provide a richer protocol to perform bi-
directional, full-duplex communication
• require full-duplex connections and
new Web Socket servers to handle the
protocol
• Having a two-way channel is more
attractive for things like games,
messaging apps, and for cases where
you need near real-time updates in
both directions
SSE
• SSEs are sent over traditional HTTP
• do not require a special protocol or
server implementation to get working
• If only one direction is necessary,
• Server-Sent Events on the other hand,
have been designed from the ground
up to be efficient
KSQL / REST API / Custom App
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Streaming
Visualization
Consumer Data Sources
KSQL & Arcadia Data
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Streaming
Visualization
Consumer Data Sources
Arcadia Data
• Combines Batch and Streaming
Visualization in one
• Streaming Visualizations based on
Confluent KSQL (Kafka)
• Acadia Instant and Arcadia Enterprise
Druid & Superset / Imply
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Data Store Streaming
Visualization
Data Flow
Consumer Data Sources
What is Druid?
• Open Source Time Series DB by
Metamarkets
• Apache Incubating
• Column-Oriented Storage
• Streaming and Batch Ingest
• Time optimized partitioning
• SQL Support
• Deep Storage can be HDFS / S3
Imply
• Commercial offering of Druid
• Built around Apache Druid
• Analytics, search and intelligence for
event-driven data
Superset
• Open source data visualization tool by
Airbnb
• Apache incubator
• Superset supports 30 types of
visualizations
• easy-to-use interface for exploring and
visualizing data
• Create and share dashboards
• Deep integration with Druid
• Integration with most SQL-speaking
RDBMS through SQLAlchemy
Elasticsearch / Kibana
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Data Store Streaming
Visualization
Data Flow
Consumer Data Sources
Elasticsearch / Kibana
Elasticsearch
• NoSQL store
• a distributed, RESTful search and analytics
engine
• centrally stores your data so you can
discover the expected and uncover the
unexpected
• lets you perform and combine many types
of searches — structured, unstructured,
geo, metric
• aggregations let you zoom out to explore
trends and patterns in your data
Kibana
• Window into Elasticsearch
• Enables visual exploration and analysis of
data stored in Elasticsearch
InfluxDB / Grafana or Chronograf
”Data in Motion”
Stream
Analytics
Event Hub
Integration
API Data Store Streaming
Visualization
Data Flow
Consumer Data Sources
InfluxDB
InfluxDB
• Popular Time Series Database
• Open source as well as Commercial offering
Chronograf
Grafana
Grafana allows to query, visualize, alert
and understand metrics independent of
their storage
Supports various datasources
• Elasticsearch
• InfluxDB
• Prometheus
• OpenTSDB
• MySQL
• …
Technology on its own won't help you. You need to know how to use it properly.