ffit execution of designated event-driven stream processing

8
DEIM Forum 2015 E4-4 Efficient Execution of Designated Event-driven Stream Processing Yan WANG , Hiroyuki KITAGAWA †† , Salman AHMED SHAIKH ††† , and Yousuke WATANABE †††† Graduate School of Systems and Information Engineering, University of Tsukuba †† Faculty of Engineering, Information and Systems, University of Tsukuba ††† Center for Computational Sciences, University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki. 305-8573, Japan †††† Institute of Innovation for Future Society, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-8603, Japan E-mail: {wangyan,salman}@kde.cs.tsukuba.ac.jp, ††[email protected], †††[email protected] Abstract Stream processing has been an important research issue with the increase in stream data sources. To date, several stream processing engines have been developed and one thing which is common among most of them is that they process the stream data and generate query results as soon as any new data from any stream arrives. However, sometimes users are not interested in all the results but would like to get the continuous query result for a short duration after the arrival of data from a particular stream. We name this processing scheme as desig- nated event-driven stream processing scheme. We propose a smart approach for the designated event-driven stream processing scheme. We performed extensive experiments to show the advantage of the smart approach. Key words Stream processing, Event-driven processing, Smart approach 1. Introduction With the advancement in technology, the amount of data is increasing. If we look around, a lot of devices are gen- erating continuous data. For example, our cell-phone GPS, our car trackers, weather and traffic sensors. Such data is called data streams among the database community. In or- der to process and query such continuously evolving data, many stream processing engines (SPEs) have been devel- oped in past. STREAM [1], S4 [2], Discretized Streams [3], Borealis [8], Aurora [6], TelegraphCQ [7] and Storm [4] are a few examples of the well-known and commonly used SPEs. One thing which is common among the most available SPEs is that, whenever new data arrives from any stream source, they process it and generate query results. In this work we call such data processing scheme the traditional stream pro- cessing scheme. However, sometimes users are not interested in all the re- sults but would like to get the continuous query result for a short duration after the arrival of data from a particular stream or after some particular event. For example, an ad- ministrator of a data center may wants to know the network condition when some failure occurs. He may also be inter- ested in monitoring the network condition for the 30 minutes following the failure. Suppose there are two streams being read by a SPE, a network connection stream and a failure stream. Then the SPE is supposed to generate query results for 30 minutes only when the data arrives from the failure stream. We name such processing scheme as the designated event-driven stream processing scheme. Here the designated stream is the one whose data arrival triggers the query and causes the results to be generated. The designated event- driven scheme is different from the traditional stream pro- cessing schemes where the continuous queries are triggered by all streams’ data. In this work, we use the terms mas- ter stream and activation duration for the designated stream and the triggering duration, respectively. Precisely, in this work we introduce the idea of desig- nated event-driven stream processing and propose an effi- cient execution approach (smart approach) for it extend- ing the incremental computation scheme employed in some SPEs. We performed extensive experiments to show that the proposed smart approach is capable of improving the system’s throughput significantly when the master streams’ input rates are relatively lower than the non-master streams. The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 precisely presents ba- sics of the continuous query language and traditional stream

Upload: others

Post on 08-Jun-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ffit Execution of Designated Event-driven Stream Processing

DEIM Forum 2015 E4-4

Efficient Execution of Designated Event-driven Stream ProcessingYan WANG†, Hiroyuki KITAGAWA††, Salman AHMED SHAIKH†††, and Yousuke

WATANABE††††

† Graduate School of Systems and Information Engineering, University of Tsukuba†† Faculty of Engineering, Information and Systems, University of Tsukuba

††† Center for Computational Sciences, University of Tsukuba1-1-1 Tennodai, Tsukuba, Ibaraki. 305-8573, Japan

†††† Institute of Innovation for Future Society, Nagoya UniversityFuro-cho, Chikusa-ku, Nagoya, 464-8603, Japan

E-mail: †{wangyan,salman}@kde.cs.tsukuba.ac.jp, ††[email protected],†††[email protected]

Abstract Stream processing has been an important research issue with the increase in stream data sources. Todate, several stream processing engines have been developed and one thing which is common among most of themis that they process the stream data and generate query results as soon as any new data from any stream arrives.However, sometimes users are not interested in all the results but would like to get the continuous query resultfor a short duration after the arrival of data from a particular stream. We name this processing scheme as desig-nated event-driven stream processing scheme. We propose a smart approach for the designated event-driven streamprocessing scheme. We performed extensive experiments to show the advantage of the smart approach.Key words Stream processing, Event-driven processing, Smart approach

1. Introduction

With the advancement in technology, the amount of data

is increasing. If we look around, a lot of devices are gen-

erating continuous data. For example, our cell-phone GPS,

our car trackers, weather and traffic sensors. Such data is

called data streams among the database community. In or-

der to process and query such continuously evolving data,

many stream processing engines (SPEs) have been devel-

oped in past. STREAM [1], S4 [2], Discretized Streams [3],

Borealis [8], Aurora [6], TelegraphCQ [7] and Storm [4] are a

few examples of the well-known and commonly used SPEs.

One thing which is common among the most available SPEs

is that, whenever new data arrives from any stream source,

they process it and generate query results. In this work we

call such data processing scheme the traditional stream pro-

cessing scheme.

However, sometimes users are not interested in all the re-

sults but would like to get the continuous query result for

a short duration after the arrival of data from a particular

stream or after some particular event. For example, an ad-

ministrator of a data center may wants to know the network

condition when some failure occurs. He may also be inter-

ested in monitoring the network condition for the 30 minutes

following the failure. Suppose there are two streams being

read by a SPE, a network connection stream and a failure

stream. Then the SPE is supposed to generate query results

for 30 minutes only when the data arrives from the failure

stream. We name such processing scheme as the designated

event-driven stream processing scheme. Here the designated

stream is the one whose data arrival triggers the query and

causes the results to be generated. The designated event-

driven scheme is different from the traditional stream pro-

cessing schemes where the continuous queries are triggered

by all streams’ data. In this work, we use the terms mas-

ter stream and activation duration for the designated stream

and the triggering duration, respectively.

Precisely, in this work we introduce the idea of desig-

nated event-driven stream processing and propose an effi-

cient execution approach (smart approach) for it extend-

ing the incremental computation scheme employed in some

SPEs. We performed extensive experiments to show that

the proposed smart approach is capable of improving the

system’s throughput significantly when the master streams’

input rates are relatively lower than the non-master streams.

The rest of the paper is organized as follows. Section 2

discusses the related work. Section 3 precisely presents ba-

sics of the continuous query language and traditional stream

Page 2: ffit Execution of Designated Event-driven Stream Processing

processing scheme. In section 4 we introduce the designated

event-driven stream processing scheme. Section 5 presents

and compares the naive approach with the proposed smart

approach. In section 6, an extensive experimental study is

presented, while section 7 concludes this paper and discusses

some of the future directions.

2. Related Work

Many stream processing engines have been developed re-

cently due to the increase in demand to efficiently process

fast evolving streams. In this section, we will review some of

the famous and commonly used SPEs.

STREAM: The Stanford Data Stream Management Sys-

tem [1], is a data management and query processing en-

gine for the applications requiring long-running or contin-

uous queries over continuous unbounded streams of data.

STREAM supports declarative continuous queries written in

CQL [5] over continuous streams and traditional stored data

sets. The STREAM prototype targets environments where

streams may be rapid, stream characteristics and query loads

may vary over time, and system resources may be limited.

Discretized stream [3] is a stream programming model for

computer clusters that provides consistency, fault recovery,

and integration with batch systems. The key idea in their

work is to treat data streams as a series of short batch jobs to

bring down the latency of these jobs. They also developed an

SPE to let users intermix streaming, batch and interactive

queries.

Storm [4] is also a distributed realtime computation sys-

tem which provides the processing of unbounded streams

of data. Storm is quite flexible and can be used with any

programming language. A storm cluster consists of a mas-

ter node which accepts queries called topologies and assigns

tasks to workers, a group of worker nodes for the processing

of the topologies, and a zoo-keeper cluster which provides co-

ordination between master and worker nodes and also keeps

the states of master and worker nodes.

Beside these, there are several other stream processing en-

gines. One thing which is common among most of the SPEs

discussed above is that they process the stream data as soon

as new data arrives. Some of the SPEs also provides event-

based stream processing capabilities in their work. How-

ever, none of the above discussed works provides designated

event-driven stream processing capability, which is the main

contribution of this work. Moreover, we propose an efficient

query execution approach (smart approach) for our desig-

nated event-driven stream processing scheme extending the

incremental computation scheme employed in some SPEs.

3. Traditional Stream Processing

In the traditional stream processing engines, when a user

registers a continuous query, it is executed continuously on

the incoming stream tuples and generates continuous out-

put. An example of such a query is shown in Query 1, which

performs continuous join operation with respect to attribute

A of Stream1 and Stream2.

SELECT ∗

FROM Stream1 [Rows 2 ] , Stream2 [Rows 2 ]

WHERE Stream1 .A = Stream2 .A

Query 1: An example of CQL query

Query 1 is written in CQL [5], which is an SQL-

based declarative language for registering continuous queries

against streams and updatable relations. The authors in [5]

proposed abstract semantics and a concrete language, which

is more general than many other continuous query languages

and is therefore adopted by many SPEs. CQL makes use of

the window specifications and constructs for mixing streams

and relations, and the power of any relational query lan-

guage. Since our stream engine also makes use of a slightly

modified form of CQL for querying data streams, we sum-

marize its abstract semantics and briefly discuss the three

classes of operators in CQL [5].

3. 1 Abstract Semantics of CQL

The abstract semantics for continuous queries is based on

two data types and three classes of operators, which are de-

fined using a discrete, ordered time domain Γ.

a ) Stream S :

A Stream is an unbounded bag (multiset) of pairs < s, t >,

where s is a tuple and t ∈ Γ is the timestamp that denotes

the logical arrival time of tuple s on stream S.

b ) Relation R:

A Relation is a time-varying bag of tuples. The bag of

tuples at time t ∈ Γ is denoted by R(t), where R(t) is an

instantaneous relation. Note that the definition of a relation

differs from the traditional one which has no built-in notion

of time.

There are three classes of operators over streams and re-

lations, which are stream-to-relation operators, relation-to-

relation operators and relation-to-stream operators.

Whenever a new stream tuple arrives at t, a CQL query

composes a new input relation ri(t) using a window operator,

and generates an output relation ro(t) evaluating relational

operators involved in the CQL query. Finally, a relation-

to-stream operator is applied to the output relation ro(t) to

convert it into a stream. For example, Istream outputs tu-

ples in ro(t) − ro(t′) downstream, where t′ is the previous

query evaluation timestamp.

Page 3: ffit Execution of Designated Event-driven Stream Processing

window

window

join Istream

𝑆1

𝑆2

𝑞1

𝑞2

𝑞3

𝑞4

𝑞5 𝑞6

𝑊𝑖𝑛𝑑𝑜𝑤𝑆𝑦𝑛1

𝑊𝑖𝑛𝑑𝑜𝑤𝑆𝑦𝑛2

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑆𝑦𝑛3

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑆𝑦𝑛4

𝐿𝑖𝑛𝑒𝑎𝑔𝑒𝑆𝑦𝑛5

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑆𝑦𝑛6

Figure 1: A simple query plan tree

3. 2 CQL Query Plan

Each CQL [5] query is compiled into a query plan tree. The

query plan tree for Query 1 is shown in Fig. 1. Each CQL

query plan tree runs continuously and is composed of three

different types of components: operators, queues and syn-

opses. The details of these components can be found in [9].

The query plan tree in Fig. 1 consists of three different

operators. The window operators are responsible for gener-

ating finite relations from infinite input streams. The tuples

in these finite relations are then processed by the algebraic

operators like join, aggregate, etc. In Fig. 1, binary join

operator receives the tuples of two windows through queues

q3 and q4. The last operator in the query plan is a relation-

to-stream operator Istream, which calculates increments of

consecutive join results.

3. 3 Incremental Computation

In this subsection, we discuss the incremental computation

scheme employed in SPEs like STREAM [1]. As mentioned

above, a CQL query logically outputs tuples based on ro(t)

and ro(t′), but computations required for ro(t) and ro(t

′)

often have a lot of overlap. To eliminate redundant compu-

tation, the incremental computation scheme is used. Assume

that ri(t) and ri(t′) are input relations used to generate ro(t)

and ro(t′), respectively. Intuitively, the incremental compu-

tation scheme calculates ro(t) incrementally by ri(t)− ri(t′),

ri(t′)− ri(t), and ro(t

′). For this purpose, operators are pro-

vided with synopses to maintain the current outputs, and tu-

ples flowing the query plan tree are associated with plus/mi-

nus tags.

Whenever a new tuple arrives at the window operator, it is

saved in the window synopsis. The window actually converts

a stream into a relation using a sliding window mechanism.

In order to implement the incremental computation scheme,

newly arriving tuples are appended with the plus tags. (We

call them p-tuples.) The p-tuples are sent to the downstream

operators, and they calculate increments caused by the p-

tuples. At the same time, the oldest tuples become obsolete

(outside the window), and minus tags are appended to them.

(We call them m-tuples.) The m-tuples are also sent to all

the operators downstreams to covey the expiration of the

tuples so that they can remove them from their synopses.

4. Designated Event-driven Stream Pro-cessing

In this section we introduce the proposed designated

event-driven stream processing scheme (shortly, designated

scheme). In this scheme, the continuous query is executed

for a fixed time duration τ after the arrival of a tuple from

the designated streams. The duration τ must be specified by

the user when the query is registered. More than one data

streams may be designated. In contrast, in the traditional

stream processing scheme, the query is executed continuously

on all incoming data streams. The traditional stream pro-

cessing scheme is a special case of the proposed designated

scheme. If we designate all the source streams as master

streams and set the activation duration to 0, then the desig-

nated scheme behaves like the traditional scheme.

We now formally define the designated scheme. Let T be

a time domain consisting of discrete, ordered timestamps t.

SPE assigns a timestamp t to each incoming stream tuple

from the domain T . Tuples may arrive at the SPE from the

master or non-master streams. Let T ′ be the time domain

of the master streams, such that T ′⊂=T and t′ ∈ T ′.

Suppose a query Q is registered on the SPE by a user.

When a tuple arrives from a master stream at t′, it triggers Q,

and we call this state of Q active. The active state of Q lasts

for a time duration τ . This time duration is referred to as the

activation duration. If a new tuple arrives from any incoming

data streams, while the query is still in the active state, the

query is triggered again. The active time duration of a query

which was activated at t′ is given by [t′, t′ + τ ]. Note that if

a new tuple arrives from a master stream at t′′ ∈ [t′, t′ + τ ],

the activation duration is updated to [t′′, t′′ + τ ]. Consider

a join query Query 2 using the proposed designated scheme.

In contrast to Query 1 for the traditional scheme, Query 2

contains two additional clauses. A MASTER clause and

an ACTIVATION DURATION clause. This query per-

forms a simple binary join operation with respect to attribute

X of Stream1 and Stream2. Stream1 is designated as the

master stream, while Stream2 is by default a non-master

stream. The activation duration τ is set to 1 second, which

means that the query will remain active for 1 second after

the arrival of each tuple from Stream1.

MASTER Stream1

SELECT ∗

FROM Stream1 [Rows 2 ] , Stream2 [Rows 2 ]

WHERE Stream1 .X = Stream2 .X

ACTIVATE DURATION 1second

Query 2: A designated event-driven query

Page 4: ffit Execution of Designated Event-driven Stream Processing

Time (s)

Time:1 Time:2 Time:3 Time:4 Time:5 Time:6

Stream2

Master

Stream1

x y

4 2

x y

5 3

x y

6 1

x

5

x

6

x y

7 8

Time:7

x y

5 9

Time:8

Figure 2: Input data streams

5 2 x y 5 3 5 2

x y 6 1

Time:3 Time:7

Figure 3: Output stream

The proposed designated scheme has many applications.

For example, in the case of network streams mentioned in

Section 1, the failure stream can be the master stream.

Another example will be a scenario where a user has two

streams, a news stream and a Twitter tweet stream, and the

user wants to analyse the related tweets when a piece of news

of interest arrives. In this case, the user can designate the

news stream as the master to avoid the processing of all the

tuples in the tweet stream. It saves a lot of computation

resources and increases the query throughput. For queries

with only one input stream, S, an external trigger or event

stream can be specified as the master stream to activate the

query or we can designate S as the master stream.

Once again consider Query 2 and the data streams shown

in Fig. 2, where the streams’ tuples arrive in time order. For

the sake of simplicity we assume that each stream is capable

of generating one tuple per second. As stated earlier, arrival

of a tuple from the master stream activates the query and

produces the result. In Fig. 2, the master stream activates

the query at t = 2 and t = 7. Since the activation duration

is set to 1 second, the query is activated by the arrival of the

master stream tuples and remain active for 1 second, keeping

the query active for t = 2, 3, 7 and 8. The results generated

by this execution are shown in Fig. 3.

5. Query Execution Approaches for theDesignated Event-driven Stream Pro-cessing Scheme

We present two query execution approaches to the des-

ignated event-driven stream processing scheme: the naive

approach and the proposed smart approach, which can sig-

nificantly increase the query throughput.

Namely, this section discusses how incoming stream tu-

ples, including master and non-master stream tuples, are

processed in the designated event-driven stream processing

engine under the naive and smart approaches.

5. 1 Naive Approach

The simplest way to handle incoming stream tuples is to

process the tuples only when the query is active. However

this does not work, as we must maintain the incoming tuples

from non-master streams even when the query is inactive to

guarantee the correctness of the window-based query result.

In the naive approach, the system works almost like the

traditional incremental stream processing scheme as ex-

plained in Section 3. The only difference is that we set the

master flag of all the incoming tuples which arrive during

the query activation duration. Master flags are conveyed to

intermediate tuples constructed by operators involved in the

query. More precisely, if an input tuple which triggers the

operator has the master flag, the master flags of the corre-

sponding output tuples generated in the incremental com-

putation scheme are set. Using the master flag, the outer-

most operator, which is a relation-to-stream operator such

as Istream operator for generating the output, identifies the

tuples which should be output as the query results.

When a leaf operator (an operator responsible for accept-

ing a stream’s input) receives a tuple, it checks whether the

tuple is from a master stream or from a non-master stream.

If the tuple is from a master stream, then the master flag of

the tuple is set, the query is activated, and the query activa-

tion time is updated. On the other hand, if the tuple is from

a non-master stream, then we check the query state. If the

query is active, the tuple master flag is set, which means that

the tuple arrived while the query was activated and must be

considered for the output.

The problem with the naive approach is that, during the

query inactive states, a lot of tuples arriving from non-master

streams get processed by all the operators and generate in-

termediate query results without master flags, which are ac-

cumulated in the outermost operator synopsis and deleted

using m-tuples. This wastes the computational and storage

resources of the SPE.

Example: Assume that Query 2 is provided with the in-

put data streams shown in Fig. 2. The master flag is set by

the leaf operator for all tuples which arrive while the query is

active. Here, we assume that a tuple has the timestamp, the

unique identifier, a plus or minus tag, and the data contents.

In the stream shown in Fig. 2, due to the arrival of a

tuple from Stream1 at t = 2, the query is active at t = 2

and t = 3. Thus the master flag of the tuple arriving at

t = 3 from Stream2 is set. The resultant tuple looks like this:

< 3, 3,+, T, 5, 3 >, which consists of the timestamp, the tu-

ple identifier, the plus/minus tag, the master flag (T:master

flag on, F: master flag off), and the data contents.

Since the tuple from Stream1 at t = 2 and the tuple from

Stream2 at t = 3 satisfy the join condition, the join opera-

Page 5: ffit Execution of Designated Event-driven Stream Processing

Smart

window

[Rows 2]

ts x y id M

Window synopsis

ts x y id T M

1 5 2 1 +

6 7

4 5 9 5 + F

5 6 1 6 + F

7 8 F

ts x y id T M

output

suspended

Figure 4: Smart Window Operator

tor generates a new p-tuple < 3, 4,+, T, 5, 3 >. When this

p-tuple arrives at the Istream operator, it outputs it as the

query result as shown in Fig. 3, because the arriving tuple’s

master flag is on.

At t = 4, a p-tuple < 4, 6,+, F, 5, 9 > is generated as a

result of the join of the tuple of t = 2 from Stream1 and the

tuple of t = 4 from Stream2. Since the query is inactive at

t = 4, its master flag is set to F. When this tuple arrives

at the Istream operator, it is saved in the synopsis. Since

the window size for Stream2 is 2, the corresponding window

synopsis maintains the latest 2 tuples from Stream2. At t =

5, a new tuple arrives from Stream2, and it is processed in a

similar way. At t = 6, a new tuple arrives from Stream2, and

it pushes the tuple of t = 4 out of the window synopsis. An

m-tuple < 6, 6,−, F, 5, 9 > is generated by the window oper-

ator and sent to the downstream. Then the Istream operator

delete the tuple < 4, 6,+, F, 5, 9 > from the IStream synop-

sis. Finally, at t = 7, another tuple arrives from Stream1

causing the query to be activated and generates another re-

sult as shown in Fig. 3. In this example, the join result of t

= 4 does not contribute to any query results.

5. 2 Smart Approach

The naive approach may generate many useless intermedi-

ate query results. The smart execution scheme address this

problem by changing the behaviour of the window opera-

tor. We introduce the smart window operator for non-master

streams.

In the smart approach, when a tuple arrives from a non-

master stream, the system checks whether the query is active

or not. If the query is inactive, the smart window operator

buffers this tuple inside the window operator and does not

output any p-tuple. If the query is active, the smart window

operator outputs p-tuples corresponding to this tuple as well

as all the buffered tuples. While buffering tuples in the inac-

tive state, some buffered tuples reach beyond the window size

because of arrivals of succeeding tuples and become obsolete.

These obsolete tuples can be deleted directly. In the naive

approach, p-tuples are also output for such tuples, and pro-

window

Smart window

join Istream

𝑆1

𝑆2

𝑞1

𝑞2

𝑞3

𝑞4

𝑞5 𝑞6

𝑊𝑖𝑛𝑑𝑜𝑤 𝑆𝑦𝑛1

𝑊𝑖𝑛𝑑𝑜𝑤 𝑆𝑦𝑛2

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑦𝑛3

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑦𝑛4

𝐿𝑖𝑛𝑒𝑎𝑔𝑒 𝑆𝑦𝑛5

𝑅𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑦𝑛6

Figure 5: Query plan tree using smart window operator

cessed by the downstream operators. When they get expired,

their results are canceled by the corresponding m-tuples.

The synopsis of the smart window operator is divided into

two parts: output and suspended parts as shown in Fig. 4.

Both the output and suspended parts keep recent incoming

tuples within the window. A new tuple is first put into the

suspended part. When the query is activated, p-tuples corre-

sponding to the tuples inside the suspended part are output

to the downstream, and they are moved to the output part.

If tuples in the output part get expired, corresponding m-

tuples are output to the downstream. But for tuples in the

suspended part, no p-tuples or m-tuples are sent to the down-

stream, which contributes to the reduction of computation

cost.

Example: Consider Query 2, whose query plan tree is

shown in Fig. 5. We assume the input data streams shown

in Fig. 2.

At t = 4, the query is inactive and the tuple is therefore

buffered in the suspended part of the smart window’s synop-

sis instead of sending a p-tuple like the naive approach. A

tuple of t = 5 from Stream2 is also buffered in the suspended

part. The snapshot of the smart window operator synopsis

is shown in Fig. 4. The window size of the smart window op-

erator is 2, so it always maintains the latest 2 tuples. When

the tuple < 6, 7, F, 7, 8 > arrives at t = 6, the oldest tuple

in the suspended part, i.e., < 4, 5,+, F, 5, 9 > gets expired.

Since this tuple stays in the suspended part of the synopsis,

it never contributed to generating intermediate results down-

stream. Thus, in contrast to the naive approach, the tuple

can be deleted directly from the window synopsis without

the need to generate any m-tuple . As a result, generation of

intermediate useless tuples can be avoided. The final results

of Query 2 is completely same as the one shown in Fig. 3.

6. Experiments

In this section we present detailed experimental study to

evaluate the effectiveness of the proposed designated event-

driven stream processing scheme and the smart approach.

For the sake of experiments, we developed a prototype SPE

Page 6: ffit Execution of Designated Event-driven Stream Processing

which enables users to register CQL style queries in their al-

gebraic expressions. The source program consists of about

13,000 lines of C++ code. The query is translated into the

query plan tree consisting of operators, queues and synopses.

Our prototype SPE also supports multiple queries. The pro-

totype SPE supports the designated event-driven stream pro-

cessing and both the smart and naive execution approaches.

We performed the experiments on Lenovo ThinkPad L412

with Intel i3, 2.26GHz processor and 6 GB RAM running

Ubuntu 14.10 OS.

We used two synthetic data streams, each consisting of two

attributes, an integer and a string. Both the streams have

an integer attribute A and a string attribute named B and

C for Stream1 and Stream2, respectively.

For the purpose of performance measurements, different

queries are executed on the prototype SPE. Unless stated

otherwise, each query is executed for the duration of 60 sec-

onds. Each experiment is performed 5 times and the average

values are taken. In the following, the query activation du-

ration is expressed by τ and the join selectivity by δ.

a ) Join Query:

First, a join query Query 3 is evaluated. Stream1 is the

master stream with input rate Rm and window size Wm,

while Stream2 is the non-master stream with input rate Rs

and window size Ws.

MASTER Stream 1

SELECT ∗

FROM Stream1 [Rows Wm] , Stream2 [Rows Ws]

WHERE Stream1 . a = Stream2 . a

ACTIVATE DURATION = τ

Query 3: Join query

As discussed in section 5, the smart approach generates less

intermediate tuples than the naive approach when tuples in

the smart window synopsis can be deleted directly. Tuples

stay in the suspended part of the smart window synopsis

only while the query is inactive. Since the interval between

the arrival of two master stream tuples is 1Rm

, if τ is larger

than 1Rm

, the query remains active all the time. Therefore,

the smart approach can be advantageous when τ < 1Rm

.

The query is inactive for the duration 1Rm

− τ . During

this interval, ( 1Rm

− τ)Rs tuples arrive from the non-master

stream. These tuples are buffered in the suspended part of

the smart window synopsis. If the number of these tuples

grow larger than the window size Ws, the oldest tuples are

deleted directly from the smart window. This suggests that

the smart approach exceeds the naive approach if the follow-

ing equation holds.

(1

Rm− τ)Rs > Ws (1)

Fig. 6 compares the throughput of the naive and the pro-

posed approaches. For this experiment we set Wm = 1,

Ws = 100, Rm = Rs1000

, τ = 0 and δ = 1. Rs was varied

from 100,000 tuples/s to 1,000,000 tuples/s. Query 3 was

executed for 5 minutes and we observed the processing rate

as well as the average processing time of an incoming tuple

in the naive and smart approaches.

Fig. 6b shows that the average processing time of the

smart approach is far less than the naive approach. Fig. 6a

shows that the smart approach can deal with more input tu-

ples than the naive approach when the input rate is more

than 400,000 tuples/s. The smart approach graph in Fig.

6a becomes flat after the input rate reaches 800,000 tuples/s

because the prototype system can process up to 800,000 tu-

ples/s.

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

200000 400000 600000 800000 1000000

Pro

cess

ing

rate

(tu

ple

s/s)

Input rate (tuples/s)

naive

smart

(a) Processing Rate

0.00E+000

2.00E-007

4.00E-007

6.00E-007

8.00E-007

1.00E-006

1.20E-006

1.40E-006

1.60E-006

1.80E-006

200000 400000 600000 800000 1000000

Ave

rage

pro

cess

ing

tim

e (

us)

Input rate (tuples/s)

naive

smart

(b) Processing Time

Figure 6: Smart approach throughput

Next we performed experiments to verify Eq. 1. We used

the average processing times to compare the naive and smart

approaches. The experiments were performed with join se-

lectivities δ = 0.1 and δ = 1. For these experiments, the

following set of default values are used: Ws = 1, Wm = 1,

Rm = 100, Rs = 1000 and τ = 0s.

All the graphs in Figs. 7 and 8 shows the average process-

ing time of the naive and smart query processing approaches.

Fig. 7a shows the case of Ws = 1 which satisfies Eq. 1 and

therefore the average processing time of the smart approach

is far less than the naive approach. However with Ws = 10,

Page 7: ffit Execution of Designated Event-driven Stream Processing

Eq. 1 does not hold and therefore there is not significant dif-

ference between the processing times of the two approaches.

Fig. 7b shows the measurements for different Rm values.

The cases of Rm = 10 and 100 satisfy Eq. 1. Similarly in

Fig. 7c, the cases of Rs = 1,000, 10,000 and 100,000 satisfy

Eq. 1. In Fig. 7d, the cases of τ = 0 and 0.001 also satisfy

Eq. 1.

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

7.00E-06

1 10

Aver

age

pro

cess

ing t

ime(

s)

Ws

naive smart

(a) Varying Ws

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

10 100 1000 10000

Av

erag

e p

roce

ssin

g t

ime(

s)

Rm

naive smart

(b) Varying Rm

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

100 1000 10000 100000

Av

erag

e p

roce

ssin

g t

ime(

s)

Rs

naive smart

(c) Varying Rs

0.00E+00

5.00E-07

1.00E-06

1.50E-06

2.00E-06

2.50E-06

3.00E-06

3.50E-06

4.00E-06

4.50E-06

5.00E-06

0 0.001 0.01 0.1

Aver

age

pro

cess

ing t

ime(

s)

τ

naive smart

(d) Varying τ

Figure 7: Average query processing time with join selectivity

of 1

Fig. 8 shows the measurements for the join processing with

the join selectivity τ = 0.1. The graphs in Fig. 8 show very

similar trend as those of Fig. 7. However, the overall process-

ing costs of both the approaches in Fig. 8 are smaller than

those of Fig. 7 due to the reduction in the join selectivity.

0.00E+00

5.00E-07

1.00E-06

1.50E-06

2.00E-06

2.50E-06

1 10

Aver

age

pro

cess

ing t

ime(

s)

Ws

naive smart

(a) Varying Ws

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

7.00E-06

8.00E-06

9.00E-06

10 100 1000 10000A

ver

ag

e p

roce

ssin

g t

ime(

s)

Rm

naive smart

(b) Varying Rm

0.00E+00

2.00E-06

4.00E-06

6.00E-06

8.00E-06

1.00E-05

1.20E-05

1.40E-05

100 1000 10000 100000

Av

erag

e p

roce

ssin

g t

ime(

s)

Rs

naive smart

(c) Varying Rs

0.00E+00

2.00E-07

4.00E-07

6.00E-07

8.00E-07

1.00E-06

1.20E-06

1.40E-06

1.60E-06

1.80E-06

2.00E-06

0 0.001 0.01 0.1

Aver

age

pro

cess

ing t

ime(

s)

τ

naive smart

(d) Varying τ

Figure 8: Average query processing time with join selectivity

of 0.1

b ) Single Stream Query:

The smart approach can be applied to queries with only

one input stream. Query 4 is an example of a single stream

query. We used this query, and the results are shown in Fig.

9a.

MASTER Stream1

Page 8: ffit Execution of Designated Event-driven Stream Processing

SELECT a

FROM Stream1 [Row 1 ]

WHERE b = 1

ACTIVATE DURATION τ = 0

Query 4: Single stream query

From the figure we can observe that the average processing

time of the smart approach is almost the same with the naive

approach. This query deals with only one stream, and the

same stream is designated as the master and is responsible

for generating query results. There is no smart window oper-

ator in the query plan tree and therefore no buffered tuples

are needed. This is the reason why the smart approach is

not advantageous.

0.00E+00

1.00E-06

2.00E-06

3.00E-06

4.00E-06

5.00E-06

6.00E-06

7.00E-06

8.00E-06

9.00E-06

naïve smart

Aver

age

pro

cess

ing t

ime(

s)

naïve smart

(a) Single Stream Query

0.00E+00

2.00E-07

4.00E-07

6.00E-07

8.00E-07

1.00E-06

1.20E-06

1.40E-06

naïve smart

Ave

rage

pro

cess

ing

tim

e(s

) naïve smart

(b) Single Stream Query with External Event Stream

Figure 9: Single Stream Analysis

c ) Single Stream Query with Event Stream:

We evaluated a query with one stream and an additional

event stream to trigger the query. We performed a join be-

tween an event stream and a normal stream using Query 5.

The average processing times of both the query processing

approaches are shown in Fig. 9b. Fig. 9b shows that the

smart approach is much better than the naive approach.

MASTER Event

SELECT Stream2 .∗

FROM Event [Row 1 ] , Stream2 [Row 1 ]

ACTIVATE DURATION τ = 0

Query 5: Single stream query with event stream

7. Conclusion and Future Work

In this work, we have proposed the designated event-driven

stream processing scheme to enable complex event-driven

querying of data streams. We have also proposed an effi-

cient query execution approach extending the incremental

computation scheme employed in some SPEs. We have de-

veloped a prototype stream processing engine implementing

the proposed event-driven stream processing scheme and the

smart query execution approach. Detailed experimental eval-

uations using the prototype have been presented to show the

advantage of the proposed scheme. From the experiments

it is very clear that the proposed smart approach is less ex-

pensive than the naive approach when the input rates of

the master streams are relatively lower than the non-master

streams. Future works include sophisticated query optimiza-

tion techniques incorporating the proposed smart approach

and complex event-driven parallel data processing.

Acknowledgement

This research was partly supported by the program "Re-

search and Development on Real World Big Data Integration

and Analysis" of the Ministry of Education, Culture, Sports,

Science and Technology, Japan.

References[1] R. Motwani et al. Query Processing, Resource Management,

and Approximation in a Data Stream Management System.In Proc. of CIDR, January 2003.

[2] L. Neumeyer, et al. S4: distributed stream computing plat-form. In Proc. of KDCloud, December 2010.

[3] M. Zaharia, et al. Discretized Streams: An Efficient andFault-Tolerant Model for Stream Processing on Large Clus-ters. In Proc. of HotCloud, June 2012.

[4] http://storm-project.net/[5] A. Arasu, et al. CQL: A Language for Continuous Queries

over Streams and Relations. In Proc. of the Intl. Conf. onDatabase Programming Languages, September 2003.

[6] D. Abadi et al. Aurora: A New Model and Architecture forData Stream Management. The VLDB Journal (12)2, pp.120-139, August 2003

[7] S. Chandrasekaran et al. TelegraphCQ: ContinuousDataflow Processing for an Uncertain World. In Proc. ofCIDR, January 2003.

[8] D. J. Abadi et al. The design of the borealis stream process-ing engine. In Proc. of CIDR, January, 2005.

[9] A. Arasu, S. Babu, and Jennifer Widom. The CQL contin-uous query language: semantic foundations and query exe-cution. The VLDB Journal 15(2), pp. 121-142, June 2006.