data streaming (in a nutshell) ... and spark's window operations

1

Data Streaming (in a Nutshell)...

... and Spark’s window operationsVincenzo Gulisano,

Ph.D.

Chalmers University of technology

2

Agenda• Who am I?• Introduction–Motivation– System Model

• Spark’s window operations• References

3



4

https://vincenzogulisano.com

/ Assistant Professor

Distributed Computing and Systems Research Group

Department of Computer Science and engineering

Chalmers University of Technology

https://vincenzogulisano.com/



5

At our research team: Research expertise & projects

CyberSecurity

Efficient parallel &

stream computin

g

Distributed systems

IoT &Sensor

Networks

6



Motivation• Since the year 2000, applications such as:

– Sensor networks– Network Traffic Analysis– Financial tickers– Transaction Log Analysis– Fraud Detection

• Require:– Continuous processing of data streams– Real Time Fashion

7

Motivation

• Relying 100% on store and process (i.e., DBs) is not feasible– high-speed networks, nanoseconds to handle a packet– ISP router: gigabytes of headers every hour,…

• Data Streaming:– In memory– Bounded resources– Efficient one-pass analysis

8

9

Main Memory

Motivation• DBMS vs. DSMS

Disk

1 Data

Query Processing

3 Query results

2 Query

Main Memory

Query Processing

Continuous

Query Data Query

results

What about ?

10

Stonebraker, Michael, Uǧur Çetintemel and Stan Zdonik. The 8 requirements of real-time stream processing.

(2005)1. Keep the data moving 2. Query interface, e.g., extended SQL 3. Handle imperfections 4. Generate predictable outcomes 5. Integrate stored and streaming data 6. Guarantee data safety and availability 7. Partition and scale applications

automatically 8. Process and respond instantaneously

System Model• Data Stream: unbounded sequence of

tuples– Example: Call Description Record (CDR)

time

Field FieldCaller textCallee textTime (secs)

int

Price (€) double

A B 8:00

3 C D 8:20

7 A E 8:35

6

11

System Model• Operators:

OP Stateless1 input tuple1 output tuple

OP Stateful1+ input tuple(s)1 output tuple

12

Stateless Operators

Map: transform tuples schemaExample: convert price € $

Filter: discard / route tuplesExample: route depending on price

Union: merge multiple streams (sharing the same schema)Example: merge CDRs from

different sources

System Model

13

Map

Filter

Union

…

…

Stateful Operators

Aggregate: compute aggregatefunctions (group-by)

Example: compute avg. call duration

Join: match tuples from 2 streams(equality predicate)

Example: match CDRs with prices in the same range

System Model

14

Aggregate

Join2

System Model• Continuous Query: graph

operators/streams

Convert€ $

Only> 10$

Count callsmade by eachCaller number

Map

Filter Agg

15

FieldCaller

Callee

Time (secs)Price (€)

FieldCaller

Callee

Time (secs)Price ($)

FieldCaller

Callee

Time (secs)Price ($)

FieldCaller

Calls

Time (secs)

System Model• Infinite sequence of tuples / bounded

memory windows

• Example: 1 hour windows

time[8:00,9:00)

[8:20,9:20)

[8:40,9:40)

16

System Model• Infinite sequence of tuples / bounded

memory windows

• Example: count tuples - 1 hour windows

time[8:00,9:00)

8:05 8:15 8:22 8:45 9:05

Output: 4

17

[8:20,9:20)What about out-of-order tuples?

18



19

Spark’s window operations(source: http://

spark.apache.org/docs/latest/streaming-programming-guide.html)

http://spark.apache.org/docs/latest/streaming-programming-guide.html


20



// Reduce function adding two integers, defined separately for clarity Function2<Integer, Integer, Integer> reduceFunc = new Function2<Integer, Integer, Integer>() {

@Override public Integer call(Integer i1, Integer i2) { return i1 + i2;

} };

// Reduce last 30 seconds of data, every 10 seconds JavaPairDStream<String, Integer> windowedWordCounts = pairs.reduceByKeyAndWindow(reduceFunc, Durations.seconds(30), Durations.seconds(10));

# Reduce last 30 seconds of data, every 10 seconds windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 30, 10)



21



countByWindow(windowLength,slideInterval)

Return a sliding window count of elements in the stream.

reduceByWindow(func, windowLength,slideInterval)

Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.

reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window [...]

reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks])

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions” [...]



22

Maintaining tuples or windows?

time[8:00,9:00)

8:05 8:15 8:22 8:45 9:05

[8:20,9:20)

Maintain tuplesWhen the window shifts:1. Remove contribution of stale tuples2. Go on adding new incoming tuples

Need to maintain a single window instance

Need to maintain all the tuples (how many?)

23

Maintaining tuples or windows?

time[8:00,9:00) – 3 (so far...)

8:05 8:15 8:22 8:45 9:05

[8:20,9:20) – 1 (so far...)

Maintain windowsWhen a tuple arrives:1. Add its contribution to all

the windows it falls in

No need to maintain tuples

Need to maintain all windows to which each tuple contributes to

24


• Spark’s window operations• References (non exhaustive list)

25

References (non exhaustive list)

Bed time reading about Data Streaming1. Gulisano, Vincenzo. StreamCloud: An Elastic Parallel-

Distributed Stream Processing Engine. Ph.D. Thesis. Polytechnic University Madrid, 2012.

Shared-nothing parallelism / Elasticity2. StreamCloud: A Large Scale Data Streaming

System. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Patrick Valduriez. 30th International Conference on Distributed Computing Systems (ICDCS) 2010

3. StreamCloud: An Elastic and Scalable Data Streaming System. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente, Patrick Valduriez. IEEE Transactions on Parallel and Distributed Processing (TPDS)

26


Shared-memory parallelism / fine-grained synchronization1. ScaleJoin: a Deterministic, Disjoint-Parallel and Skew-Resilient Stream Join.

Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. IEEE International Conference on Big Data (IEEE Big Data 2015)

2. DEBS Grand Challenge: Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate Objects. Vincenzo Gulisano, Yiannis Nikolakopoulos, Ivan Walulya, Marina Papatriantafilou, Philippas Tsigas. The 9th ACM International Conference on Distributed Event-Based Systems (DEBS 2015)

3. Concurrent Data Structures for Efficient Streaming Aggregation (brief announcement). Daniel Cederman, Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas. The 26th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) 2014

Streaming + Security / Privacy / Cyber-physical systems4. Understanding the Data-Processing Challenges in Intelligent Vehicular Systems.

Stefania Costache, Vincenzo Gulisano, Marina Papatriantafilou. 2016 IEEE Intelligent Vehicles Symposium (IV16)

5. BES – Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures. Vincenzo Gulisano, Valentin Tudor, Magnus Almgren and Marina Papatriantafilou. 2nd ACM Cyber-Physical System Security Workshop (CPSS 2016) [held in conjunction with ACM AsiaCCS’16], 2016.

6. METIS: a Two-Tier Intrusion Detection System for Advanced Metering Infrastructures. Vincenzo Gulisano, Magnus Almgren, Marina Papatriantafilou. 10th International Conference on Security and Privacy in Communication Networks (SecureComm) 2014

27


• Motivation / System Model1. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer

Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’02, New York, NY, USA, 2002. ACM.

2. Michael Stonebraker, Uǧur Çetintemel, and Stan Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec., 34(4), December 2005.

3. Nesime Tatbul. QoS-Driven load shedding on data streams. In Proceedings of the Workshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers, EDBT ’02, London, UK, UK, 2002. Springer-Verlag.

28


• Centralized Stream Processing Engines

1. Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom. Stream: The Stanford data stream management system. Springer, 2004.

2. Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuous query language: semantic foundations and query execution. The VLDB Journal, 15(2), June 2006.

3. Daniel J. Abadi, Don Carney, Uǧur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2), August 2003.

4. Nesime Tatbul and Stan Zdonik. Window-aware load shedding for aggregation queries over data streams. In Proceedings of the 32nd international conference on Very large data bases, VLDB ’06. VLDB Endowment, 2006.

29


• Distributed Stream Processing Engines1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Uǧur Çetintemel, Mitch

Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. The design of the borealis stream processing engine. In CIDR, pages 277–289, 2005.

2. Magdalena Balazinska, Hari Balakrishnan, Samuel R Madden, and Michael Stonebraker. Fault-tolerance in the borealis distributed stream processing system. ACM Trans. Database Syst., 33(1), March 2008. ACM ID: 1331907.

3. Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri. Towards sensor database systems. In Proceedings of the Second International Conference on Mobile Data Management, MDM ’01, London, UK, UK, 2001. Springer-Verlag.

4. Jeong-hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and Stan Zdonik. A comparison of stream-oriented high availability algorithms. Technical report, Brown CS, 2003.

5. Jeong-Hyon Hwang, Magdalena Balazinska, Alexander Rasin, Uǧur Çetintemel, Michael Stonebraker, and Stan Zdonik. High-Availability algorithms for distributed stream processing. In Data Engineering, International Conference on, volume 0, Los Alamitos, CA, USA, 2005. IEEE Computer Society.

30


• Parallel Stream Processing Engines1. Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and Patrick

Valduriez. Streamcloud: A large scale data streaming system. In ICDCS 2010: International Conference on Distributed Computing Systems, pages 126–137, June 2010.

2. Mehul Shah Joseph, Joseph M. Hellerstein, Sirish Ch, and Michael J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In In ICDE, 2002.

31


• Elastic Stream Processing Engines1. Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio

Soriente, and Patrick Valduriez. Streamcloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel and Distributed Systems, 99(PrePrints), 2012.

2. Thomas Heinze. Elastic complex event processing. In Proceedings of the 8th Middleware Doctoral Symposium, MDS ’11, New York, NY, USA, 2011. ACM.

3. Simon Loesing, Martin Hentschel, Tim Kraska, and Donald Kossmann. Stormy: an elastic and highly available streaming service in the cloud. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, EDBT-ICDT ’12, New York, NY, USA, 2012. ACM.

4. Scott Schneider, Henrique Andrade, Bugra Gedik, Alain Biem, and Kun-Lung Wu. Elastic scaling of data parallel operators in stream processing. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, IPDPS ’09, Washington, DC, USA, 2009. IEEE Computer Society.