stream processing with bigdata: sss-mapreduce

Stream Processing with BigData: SSS-MapReduce

Hidemoto Nakada, Hirotaka Ogawa and Tomohiro KudohNational Institute of Advanced Industrial Science and Technology,1-1-1 Umezono, Tsukuba, Ibaraki 305-8568, JAPAN:

Stream Processing with BigData: SSS-MapReduceOutlineIntroductionImplementationOverview of SSS-MapReduceStream Processing in SSSSliding Window ManagementPreliminary EvaluationDiscussionRelated WorkConclusion

21. IntroductionExisting stream processing systems are mainly targeting on the low-latency data processing and work only on the relatively small on-memory data-set

This kind of systems is very effective for specific class of applications, such as algorithm trading, but applicable area is not so large.

31. IntroductionWe propose SSSwhich can process streamed data along with the stored large dataSSS is basically a KVS based MapReduce System, but can handle treamed data with Continuous Mapper and Reducer process,which is periodically invoked by the system.

SSS

SSSMapReduceKVSMapper ReducerSSSMergeReducer

42. ImplementationOverview of SSS-MapReduceServer Configuration :

Overview of SSS-MapReduce(SSS- MapReduce):SSSSSSKVSSSSMappersReducersSSSKVS

52. ImplementationImplementation of Distributed KVS:When SSS servers put key-value pair to the distributed KVSs, it determines unitary KVS to put with hashed value of the key.

All the SSS servers shares the same hash function to guarantee that key-value pairs with the same key go to the same unitary KVS.

SSS writes key-value pairs in bulk. The pairs are sorted with the key beforehand to reduce the burden on Tokyo Cabinet. SSS reads key-value pairs, again, in bulk, specifying the beginning key and the endding key of the range of keypairs.KVSSSSkey-value pairKVSskey-value pairKeySSSkey-value pairskeyKVSSSSkey-value pairsbulkTokyo Cabinet pairskeySSSkey-value pairsbulkkeypairskeykey

62. ImplementationWe also implemented a network service layer, called Data Server that wraps Tokyo Cabinet so that it can be accessed from remote SSS servers.

The protocol is specially designed to leverage the specific access patterns described above and to enable pipeline processing in the SSS servers.

Tuple Group:In SSS, data space is divided into several sub namespaces called Tuple Group. Mappers and Reduces read input from tuple group(s) and write the output into tuple group(s).

The Data Server allocates one date file to each Tuple Group.This design allows us to remove a whole file when we want to remove a Tuple Group.

Tokyo CabinetTokyo CabinetSSSSSS(pipeline processing)Tuple Group()SSS Tuple GroupMappersReducestuple group tuple groupTuple GroupTuple Group

72. ImplementationStream Processing in SSS:Stream Input and Output:In SSS, stream input is represented as a continuous writes to a specific tuple group.The tuple group works as input buffer for the input stream.Processing Mapper / Reducer will read from the tuple

Periodic Mapper / Reducer:We implemented streamed data processing by invoking Mappers and Reducers continuously and periodically.The Mappers and Reducers reads and delete Key Value Pairs from the specified tuple Group, to ensure that one Key Value Pair is not processed more than once.SSSSSStuple grouptuple groupMapper / Reducertuple grouptuple groupMapper / ReducerMappersReducersMappersReducerstuple GroupKey Value PairsKey Value Pair

82. ImplementationWhen Periodic Mapper or Reducer kicks in on a Tuple Group, the Data Server create a new database file and redirect successive write operations to the new file, while serving the old file for read operations from the Mapper or Reducer.

9Mapper ReducerTuple Group,Mapper or Reducer92. ImplementationMergeReducer:

The MergeReducers are special Reducer that can handle inputs more than one tuple groups

The inputs for the MergeReducers will be one Key and more than two Value lists.

In the SSS Server there are multiple threads that read tuples from each Tuple Group.

MergeReducersReducertuple groupsMergeReducersMergeReducerskey2SSSTuple Group.

102. ImplementationNote that the data in the Tuple Group are always sorted by key.

SSS Server controls read threads so that they will give MergeReducer tuples that have same keys.

SSSTuple Group.Tuple GroupKeySSS MergeReducer tupleskey

112. ImplementationSliding Window Management

Sliding windowSliding windowlengthintervalReducer and PostReducersubwindowLengthlengthinterval

122. ImplementationPostReducer have small amount of persistent storage on memory as ring buffer.The length of the buffer is length/subwindowLength, for each key.

PostReducerkey length/ subwindowLength,

133. Preliminary EvaluationWe have performed a preliminary evaluation to know data stream handling throughput of SSS on one node.

The input data was randomly generated so that they mimic the Apache Web Server log records.

The record size was about 300 bytes. We repeatedly put 10000 records with 10ms interval. The input stream is directly fed into the Data Server,without bothering the SSS server.SSSApache Weblog300 bytes10000 10SSS

143. Preliminary Evaluation

Mapper2510

Mapper

154.DiscussionThe event (Keyvalue pare) stream will be distributed to mappers and reducers on several nodes.And there is a shuffle phase between Map and Reduce, where events from nodes are shuffled each other. This means that Reducers will receive event in out-of-order fashion.

8SSS(Keyvalue pare)mappersreducersMapReduceReducersin out-of-

165. Related WorkHadoop Online PrototypeHOP (Hadoop Online Prototype)is a Hadoop variant that enabled pipeline processing by directly connecting Mapper with Reducer and even Reducer with Mappers in the next iteration using sockets, aiming at quick iteration of MapReduce operations.Although it can handle BigData, since it is based on Hadoop, the Continuous query only works on the stream data from outside and cannot handle the static data store in the HDFS.HOPHadoopHadoopReducerReducerMappersMappersMapReduceBigDataHadoopHDFS

175. Related WorkC-MR (Continuous MapReduce)C-MR is a stream processing system that is targeting on a single node with multi-core processors. C-MR adopts MapReduce as the programming interface and supports strict Sliding Window management.Since it is meant for single node, it cannot scale out by increasing the number of nodes.

C-MRC-MRMapReduceSliding Window185. Related WorkS4S4 is a distributed processing framework for stream data, which is originally implemented by Yahoo and contributed to Apache.The basic data structure in S4 is called Event, which is composed of key and value. Operations are performed in modules called PEs (Processing Elements).The basic concept of S4 is somewhat similar to SSS. The main difference is that SSS can handle off-memory big data while S4 only supports on-memory data.

S4ApacheS4key and valuePE(Processing Elements)S4SSSSSSS4

196.ConclusionSSS handles stream data with continuous MapReduce that is periodically invoked by the system and perform operation in the storage.

With the continuous MapReduce and MergeReducer, SSS can perform stream processing based on static BigData stored in the system.

SSSMapReduceMapReduceMergeReducerSSSBigData

20

stream processing with bigdata: sss-mapreduce

Documents