new challenges for big data analytics ericsson sztaki ... · © sztaki 2016 streamline h2020 new...
TRANSCRIPT
© SZTAKI 2016 SZTAKI 2015.
New challenges for Big Data Analytics Ericsson – SZTAKI collaboration Institute for Computer Science and Control
Hungarian Academy of Sciences
Andras Benczur, head, Informatics Laboratory
„Big Data – Momentum” research group
© SZTAKI 2016
Challenges addressed in Ericsson-SZTAKI cooperation
Immediate implementational
• Access to Distributed Storage
– Measuring systems for real time querying
– Dynamic repartitioning during distributed execution
– Data locality considerations
• System configuration
– HDFS write partitioning:
• optimize block size for data
• optimal replication and data distribution polices
– Hadoop/YARN cluster use all resources by appropriate worker configuration
– Simplify the data processing pipeline
– Caching (largest gain)
– Keep spark warm, reuse for faster process start
Methods that do not really exist yet - Although many claim they have them
• Stream mining
– Predict on the fly, by using batch trained models?
– Train models on the fly?
• Architectures
– Lambda: Combine a batch (e.g. Hadoop) and a streaming (e.g. Storm)?
– Kappa: Or just use Spark Streaming, no batch at all?
– Or use the Flink unified dataflow engine, which can do both batch and streaming?
© SZTAKI 2016
Big Data Machine Learning
Session Drop (2014)
• Periodic radio channel physical parameter measurements
• Predict abnormal termination
QoE time series in SSL traffic (WIP)
• Predict event and errors in terminal by IP traffic
periods_bytes periods_bytes_up
load periods_end periods_flow_ cache periods_start protocol
service_
provider start
[363.0, 0.0, 227.0, 0.0,
33.0, 0.0, 131.0] ...
[226.0, 0.0, 227.0, 0.0,
0.0, 0.0, 98.0] ...
[1458300345.7,
1458300345.71, ...
[2.0, 2.0, 2.0, 2.0, 2.0,
2.0, 2.0] ...
[1458300345.59,
1458300345.7, ...
Facebook 1458300345.59
[52.0, 0.0, 33.0, 0.0,
52.0, 0.0, 33.0] ...
[52.0, 0.0, 0.0, 0.0,
52.0, 0.0, 0.0] ...
[1458300355.23,
1458300355.36, ...
[2.0, 2.0, 2.0, 2.0, 2.0,
2.0, 2.0] ...
[1458300355.21,
1458300355.23, ...
Facebook 1458300355.21
[1272.0] [861.0] [1458300440.44] [4.0] [1458300440.07] Android
C2DM/GCM
Google 1458300440.07
Timestamp delay event type errors service provider
1458301297 open app YouTube
1458301302 12 start new video (advertisement) YouTube
1458301320 1 skip add to video YouTube
1458301362 30 video freeze connection lost error YouTube
1458301392 30 retry connection lost error YouTube
1458301588 2 open app Facebook
1458301599 Scroll Facebook
1458301683 3 click on link index.hu
1458301686 10 page rendering started index.hu
1458301707 0 scroll down index.hu
1458301884 10 click on link index.hu
1458301894 page rendering started not fully completed index.hu
Drop No Drop
© SZTAKI 2016
STREAMLINE H2020
New initiative on top of Apache Flink
A "use-case complete" framework to unify batch and stream processing
• DFKI (DE)
• SICS (SE)
• Portugal Telecom (PT)
• Internet Memory (FR)
• Rovio (FI)
• NMusic (PT )
• SZTAKI (HU)
© SZTAKI 2016
Machine learning: Batch or Online or Updateable? Currently, first experiments with recommender systems
Batch • Repeatedly read all training data multiple times • Stochastic gradient: random order
Online • Read training data only once, no chance to store • Update immediately with large learning rate
Batch + Online • Linear combination • Challenge to implement in a software stack
Sampling from the past • Partially reuse recent data in a streaming solution • Can be implemented by using sliding windows
© SZTAKI 2016
Tracing under Spark by piggybacking
• Replacement for checkpointing records and building the traces offline with an external service
• We attach traces on-the-fly to records and pre-aggregate tracing-graphs for a fast, convenient and real-time result
Goal: Enable Spark to integrate with the internal tracing infrastructure of Ericsson’s Ark.
Despite the difficulty of the problem, results emerged beyond expectation: a full support to Spark’s Scala API.
SZTAKI came up with two solutions to the problem in Spark.
At Spark Summit next week
Sample graph piggybacked to a record in Spark
• 𝑇𝑖 can be any Spark operator
• We attach useful metrics to edges and vertices
(current load, time spent at task, …)
𝑇4
𝑇1
𝑇2
𝑇8
𝑇7
𝑇6
𝑇3
𝑇9
𝑇11
𝑇10
𝑇12
𝑇13
2
4 4
10
12 15
7
5
© SZTAKI 2016
Spark execution visualizer
• Understand the execution mechanism of Spark (new users to learn Spark faster)
• Discover issues of executors and physical tasks in detail
• Highlight bottlenecks of certain workflows
• Add insight to advanced, online optimization strategies
Demonstrated at ITU Telecom World, Budapest, 2015
DataBricks, the company behind Spark showed an interest towards the visualizer at Spark Summit EU, Amsterdam, 2015
Presented at Hadoop Summit, Dublin, 2016
© SZTAKI 2016
stage
boundary
& shuffle Dynamic repartitioning in Spark
even partitioning skewed data
slow task
slow task
Goal: Tackle the problem of highly skewed and dynamically changing distribution of keys
• Avoid data skew dynamically on-the-fly in any workflow
• Mitigate the problem of slow tasks
• Sophisticated optimization algorithm
• Major improvement
heavy keys create
heavy partitions (slow tasks)
© SZTAKI 2016
Adaptively repartition data during execution at shuffle-phase with a better partition function to approximate even partitioning
Slave Slave Slave
Master
approximate local key distribution & statistics
1 collect to master 2
global key distribution & statistics
3
4 redistribute new hash
© SZTAKI 2016
QoS scheduling: Spark (2015) and Yarn (2016)
1. Hard to predict the expected resource usage of a production-job, so users usually over-estimate consumption by 5x to meet the contract (hurts utilization) – problem in Spark and in Yarn
2. Peaks can slow down certain nodes or might cause out-of-memory errors – problem in Spark
time
resources
resource allocated by job
current consumption
out-of-memory error
under-utilization,
waste of resources
current consumption
close to optimal allocation
© SZTAKI 2016
Hops.io (Stockholm) EIT Digital Activity Proposal
Siz
e (
GB
)
250
500
1000
Year
2015 2016 2017 2018 2019
0
750
© SZTAKI 2016
Hops.io: Interactive Analytics with Zeppelin
© SZTAKI 2016
IoT Data Hub for Ericsson Business Unit Global Services (BUGS)
Open source NoSQL systems test result
Selected system is confidential
Last bar is Cassandra
In combination with Spark, tested w/ BUGS in 2015
© SZTAKI 2016
Spark vs Flink Stateful stream processing – “A/B testing”
Spark
• Better equipped for interactive, exploratory data analysis due to simpler runtime design and more mature third-party tools.
• Favorable in community support, currently available integration, monitoring and testing
Flink
• Competitive runtime especially for heavy workload
• Sort, join, iterative dataflows (e.g. ML)
• Transitioning to either of the two systems is fairly smooth: similar programming models
WIP: testing in Ericsson Aggregator
465
644
877
232
344
512
0
100
200
300
400
500
600
700
800
900
1000
5 10 20
10
00
re
co
rds
/s
ec
Number of CPU cores
Streaming A/B Test throughput
Flink
Spark
© SZTAKI 2016
Achievements
•High visibility of SZTAKI researchers within the open source community, as well as committers in other companies (e.g. DataBricks and Cloudera in San Francisco)
•Presentations at Hadoop and Spark Summits, Flink Forward
•A list of solved bugs, minor and major improvements submitted and accepted to the Apache Spark project in 2015:
–20+ issues covered
–Major improvement of deployment to YARN
–Major improvement in repartitioning efficiency
•Built expertise in Apache Spark, Flink, Yarn
•Training session and consultancy support for Ericsson Hungary
© SZTAKI 2016
Commercial break: still not late to join our contests + RecSys and Personalization Meetup: Budapest, June 9
RecSys Challenge 2016, Boston, Sept 15-19 ECML/PKDD Discovery Challenge 2016, Riva del Garda, September 19-23