advanced databases term project stream data

ADVANCEDDATABASES

TERMPROJECT

STREAMDATAPLATFORMS–APACHEKAFKA

KANATOV,KARİM

ZEYBEK,ALİBAHADIR

TABLEOFCONTENTS

• Abstract

• Introduction

• StreamProcessing,Messaging,DataWarehouse

• UseCases

• ComplexEventProcessing

• ApacheKafka

o QuickStart

o HowtoputdataintoKafka

o HowtogetdatafromKafka

o Design

• Application–ConvertingatraditionalRDBMStoStreamof

ChangesusingKafka

• Application–Development&Running

• Conclusion

ABSTRACT

Throughout this paper, one of the emerging concepts in the database world –

StreamDataPlatformsandwhat are theyuseful forwill bediscussed.Moreover,Apache

Kafka – a publish/subscribe messaging rethought – will be demonstrated to better

understandspecificusecasesofStreamDataPlatforms.

INTRODUCTION

60seconds.Letusthinkforawhilewhatmighthappenoninternetwithinthatshort

periodof time.According toDOMO1,eachminuteofeveryday the followinghappenson

theInternet:

● 347,222Tweetssent!

● Morethan300HoursofcontentuploadedtoYouTube!

● Upto2BillionphotoslikedinInstagram.

● 4.1BILLIONFacebookuserlikesposts!

● Amazonreceives4,310uniquevisitors.

● Appleusersdownload51,000apps

Allofthesehappenonlywithinoneminute.

According to Gartner’s, Inc. forecasts from 20142, we will have around 5 billion

connected things (Internet of Things) used in 2015. In other words, those things will

generate more data and the data will be streamed in real time. More interestingly, the

majority of data that is used in internet today are created by individual users via social

media.Itisneverendingfeedofinformation.

Globally3.2billionpeopleareusingtheInternetbyend2015,ofwhich2billionare

fromdevelopingcountries3,which is45%ofworldwidepopulation. Itmeans therewillbe

more data in coming years. Those data will grow exponentially and will require to be

processedinrealtime.

1 https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 2https://www.gartner.com/newsroom/id/29057173 https://www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2015.pdf

Now let us switch the topic to a different perspective. In most of the traditional

businessapplications, therearecertainchangesofeventsthatareneededtobecaptured

andactaccordingly.Evena simple shoppingsitehasdifferentevents suchasuser login–

addtothecart–payment–shipment–etc.Thequestionhereisthathowmuchofthose

eventchangescanbecaptured?

Normally,whatmostof theapplicationsusing traditionalRDBMS’s is changing the

state of corresponding entry in the relation to the most up-to-date value. However,

underlyingproceduresthatbuildsuptothefinalstateareusuallyoverridden.

StreamDataPlatform,isbasicallygoodfortwomainconcepts:

• Data Integration: The stream data platform captures streams of

eventsordatachangesandfeedsthesetootherdatasystemssuchas

relational databases, key-value stores, Hadoop, or the data

warehouse.

• Stream Processing: It enables continuous, real-time processing and

transformation of these streams and makes the results available

system-wide.

Thedemand for streamprocessing is increasinga lot thesedays.Oftenprocessing

hugevolumesofdataisnotenough.Datahastobeprocessedfast,sothatafirmcanreact

to changingbusiness conditions in real time. This is required for trading, frauddetection,

systemmonitoring,andmanyotherexamples.

Generally, gathering the data depends on the data sources. Events coming from

databases,machine-generatedlogs,andevensensorsneedtobecleaned,schematizedand

forwardedtoacentralplace.

Secondly,oneofthemaingoalsondevelopmentofKafkawascollectingthestreams

in one central place. So, brokers collect stream of data, log and buffer them in a fault

tolerant manner, as well as distribute them among the various consumers that are

interested in the streams of data. Finally, analysing the stream of data, which can be

performedwellusingotherApachesolutionslikeSamzaorStorm.

Thispaperisconsistedofbasicunderstandingofthestreamprocessingandtherole

ofKafkaonit.Wewilldiscussnotonlytheoretical,butalsowillhavealookatpracticaluse

ofexistingsolution.

Further we will discuss the use of Kafka in detail. We will look at configuration,

designchoiceandimplementationofKafka.Also,theuseofKafkainstreamprocessingwill

bediscussedindetail.

STREAMPROCESSING,MESSAGING,DATAWAREHOUSE

Data Warehouses are well known for storing and analysing the structured data.

Runningsimplequeryontopofterabytesofdataisenoughtogetanyhistoricaldatawithin

seconds.

However the ETL processes often take too long. As business grows, shareholders

want to query up-to-date information, since information from the day before is not

sufficientenoughtomakeprecisedecisions.Thisiswherestreamprocessingcomesinand

feedsallnewdataintotheDataWarehouseimmediately.

Figure1.StreamDataPlatformarchitecturebuildaroundKafka

Theproblemofmostbigcompanies istheuseofdifferentdatasystems:relational

OLTPdatabases,Hadoop,Teradata,asearchsystem,monitoringsystems,OLAPstoresand

soon.Therefore,eachofthesesystemsneedsreliablefeedsofdata.

OneofthemainproblemsthatKafkasolvesistransportationofdatabetweenthose

systems.Once thedata is inKafka, consumers candowhatever isneeded.Reliabilityand

lowlatencyisthekeyadvantagesofusingKafka.

,

USECASES

One of the first commercial uses of stream processing was found in the finance

industry,as stockexchangesmoved fromfloor-based trading toelectronic trading.Today,

wecanconfidentlysayitisusedalmostineveryindustry-anywherewherestreamofdata

are generated. The Internet of Things will increase volume, variety and velocity of data,

leadingtoadramaticincreaseintheapplicationsforstreamprocessingtechnologies.

Kafkahas beendevelopedby LinkedIn engineering team; so, itwas the first place

wheretheconceptandrealabilitiesofthesystemhavebeentested.AsKafkafounder,Jay

Kreps,statestheexistingmonolithic,centralizeddatabasebeganfacingitslimitsandthere

wasaneedtostartthetransitiontoaportfolioofspecializeddistributedsystems4.So,Kafka

wasbuilttoserveasacentralrepositoryofdatastream.Themostinterestingisthatindeed

engineersunderestimatethepoweroflogandmostofthetimewedonotacceptlogsasan

innovationeventhoughithasbeenaroundaslongascomputer.

Kafkacanserveasamessagebroker,webactivitytracker,logaggregationsolutions.

Nowadays all the stream processing solution such as Apache Storm and Samza goes

togetherwithKafkaasmessagingstreamandarewellimplementedwitheachother.

Kafka lets us see the data not as a data itself but accept the data as a streamof

events.Forinstance,retailhaseventssuchaslogistic,sales,orders,refundsandsoon.Web

sites have streamsof clicks, impressions, searches, and soon. Big software systemshave

streamsof requests,errors,machinemetrics,and logs.So, the ideabehindallof thatare

collecting, storing and the use of all those events for better understanding of users,

customers’behaviour.

Most of the e-commerce websites do not only want to know the final result of

customerpurchase,butalso theywould like to see the logicalpathofuser choice,which

could be perceived as stream of events. Those streams of events help the company to

improvetheirserviceandpricepolicywhileitisalsogoodformarketingpurpose.

4https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Thus,muchofwhatpeoplerefertowhentheytalkabout"bigdata"isreallytheact

ofcapturingtheseeventsthatpreviouslyweren'trecordedanywhereandputtingthemto

use for analysis, optimization, and decision making. In some sense these events are the

otherhalfofthestorythedatabasetablesdon'ttell:theyarethestoryofwhatthebusiness

did.

COMPLEXEVENTPROCESSING

Kafka iswellknownasasystemforhigh-throughputreliableeventstreams.So,on

the processing side most of the time developers choose some stream processing

frameworkslikeSamza,StormorSparkStreaming.

On other hand, there is also somework on high-level query languages for stream

processing, andComplex Event Processing is especiallyworthmentioning. It originated in

1990s research on event-driven simulation, and most CEP products are commercial,

expensive enterprise software. Esper is the only free and open source solution, but it is

limitedtorunningonasinglemachine.

WithCEP,queriesorrulesthatmatchcertainpatternsintheeventsarewritten.In

otherwords, CEP is about detecting and selecting the interesting events (andonly them)

frommultiple sources, finding their relationships and deducing new data from them and

their relationships. The goal of complexeventprocessing is to identifymeaningful events

(suchasopportunitiesorthreats)andrespondtothemasquicklyaspossible

TheyarecomparabletoSQLqueries(whichdescribewhatresultsyouwanttoreturn

froma database), except that theCEP engine continually searches the stream for sets of

events thatmatch the query and notifies you (generates a "complex event")whenever a

match is found. This is useful for fraud detection or monitoring business processes, for

example.

ForusecasesthatcanbeeasilydescribedintermsofaCEPquerylanguage,sucha

high-levellanguageismuchmoreconvenientthanalow-leveleventprocessingAPI.Onthe

otherhand,a low-levelAPIgivesyoumore freedom,allowingyou todoawider rangeof

thingsthanaquerylanguagewouldletyoudo.Also,byfocussingtheireffortsonscalability

andfault tolerance,streamprocessingframeworksprovideasolid foundationuponwhich

querylanguagescanbebuilt.

Nowadays, as CEP is supposed to be one of the main topic in IT industry, many

vendorsofferCEPsolutions.ManyoftheCEPtoolsonthemarketallowthecreationofreal-

time, event-driven applications. These applications might consumes data from multiple

externalsources,buttheycanalsoconsumedatafromtraditionaldatabasesources.

Most of these products include a graphical event flow language and support SQL.

Key commercial vendors in this space are IBM with IBM Operational Decision Manager,

InformaticawithRulePoint,Oraclewith itsComplexEventProcessingSolution,Microsoft’s

StreamInsights, and SAS DataFlux Event Stream Processing Engine, and TIBCO’s CEP.

Numerousstart-upsareemerginginthismarket.

APACHEKAFKA

AKafkaclustercanaccepttensofmillionsofwritesper-secondandpersistthemfor

days,months,orindefinitely.Eachwriteisreplicatedovermultiplenodesforfault-tolerance

andaKafkaclustercansupport thousandsofconcurrentreadersandwriters.This journal

actsasthesubscriptionmechanismforothersystemstofeedoffof.TheKafkaclustercan

expanddynamicallytohandlegreaterloadorretainmoredata.

Kafkaisadistributed,partitioned,replicatedcommitlogservice5.Inotherwords,it

ispublish-subscribemessaging implementedasadistributedcommit logsuitablenotonly

for online, but also for offline message consumption. In general Kafka is made of three

levels:

● Producers-theprocessesthatpublishmessagestoaKafkatopic.

● Broker-isKafkaclusterconsistingofoneormoreservers.

● Consumers-theprocessesthatsubscribetotopicsandprocessthefeedofpublished

messages.

Kafkamaintains feedsofmessages incategoriescalledtopics.ForeachtopicKafka

maintains at least onepartitionblock. Partitionblock containsmessages in unchangeable

sequence and each message is continually appended to a commit log. Each message is

identifiedbyuniquesequenceofIDcalledoffset.Allpublishedmessagesareretainedinthe

Kafkaclusterforaconfiguredperiodoftime,afterthatthosemessageswillbediscardedt

freeupthespace.

Partitionscontainmessagesthatarereplicatedovermultipleservers,which intern

formKafkabroker.Thisisdoneforafaulttolerance.Everypartitionhasoneserveractingas

a leader, the rest of them as followers. Reader handles all read/write requests for the

partition.Thefollowerpassivelyreplicatesthe leader. If the leaderserver fails,oneofthe

followerserversbecomesaleaderbydefaultmakingthebrokermoreresilientagainsthost

failures.

5 https://kafka.apache.org/documentation.html#messages

In the consumer side, themessage can be broadcasted to all consumers or each

messagecanbereadonebyonefromtheserver.

QUICKSTART

• DownloadtheCode&Un-Tarit

CodeSource:

https://www.apache.org/dyn/closer.cgi?path=/kafka/0.9.0.0/kafka_2.

11-0.9.0.0.tgz

Un-Tar:

>tar-xzfkafka_2.11-0.9.0.0.tgz

>cdkafka_2.11-0.9.0.0

• StarttheServer

KafkausesZooKeepersoyouneedtofirststartaZooKeeperserverif

youdon'talreadyhaveone.Youcanusetheconveniencescriptpackaged

withKafkatogetaquick-and-dirtysingle-nodeZooKeeperinstance.

>bin/zookeeper-server-start.shconfig/zookeeper.properties

NowstarttheKafkaserver:

>bin/kafka-server-start.shconfig/server.properties

• CreateaTopic

Createatopicnamed"test"withasinglepartitionandonlyonereplica:

>bin/kafka-topics.sh--create--zookeeperlocalhost:2181

--replication-factor1--partitions1--topictest

Wecannowseethattopicifwerunthelisttopiccommand:

>bin/kafka-topics.sh--list--zookeeperlocalhost:2181

test

Alternatively,insteadofmanuallycreatingtopicsyoucanalsoconfigureyour

brokerstoauto-createtopicswhenanon-existenttopicispublishedto.

• SendMessages

Kafka comeswith a command line client thatwill take input from a file or

fromstandardinputandsenditoutasmessagestotheKafkacluster.Bydefaulteach

linewillbesentasaseparatemessage.

Runtheproducerandthentypeafewmessagesintotheconsoletosendto

theserver.

>bin/kafka-console-producer.sh--broker-listlocalhost:9092--topictest

HelloKafka

Thisisaneventthatisstreamed…

• ConsumeMessages

Kafkaalsohasacommandlineconsumerthatwilldumpoutmessagesto

standardoutput.

>bin/kafka-console-consumer.sh--zookeeperlocalhost:2181--topic

test--from-beginning

HelloKafka

Thisisaneventthatisstreamed…

Ifyouhaveeachoftheabovecommandsrunninginadifferentterminalthen

youshouldnowbeabletotypemessagesintotheproducerterminalandseethem

appearintheconsumerterminal.

Allofthecommandlinetoolshaveadditionaloptions;runningthecommandwithno

argumentswilldisplayusageinformationdocumentingtheminmoredetail.

• SettingUpaMulti-BrokerCluster

So farwehavebeen runningagainsta singlebroker,but that'sno fun. For

Kafka, a singlebroker is just a clusterof sizeone, sonothingmuch changesother

thanstartingafewmorebrokerinstances.Butjusttogetfeelforit,let'sexpandour

clustertothreenodes(stillallonourlocalmachine).

Firstwemakeaconfigfileforeachofthebrokers:

>cpconfig/server.propertiesconfig/server-1.properties

>cpconfig/server.propertiesconfig/server-2.properties

Noweditthesenewfilesandsetthefollowingproperties:

config/server-1.properties:

broker.id=1

port=9093

log.dir=/tmp/kafka-logs-1

config/server-2.properties:

broker.id=2

port=9094

log.dir=/tmp/kafka-logs-2

Thebroker.idproperty is the unique and permanent nameof each node in

the cluster.We have to override the port and log directory only because we are

running theseall on the samemachineandwewant to keep thebrokers fromall

tryingtoregisteronthesameportoroverwriteeachother’s’data.

WealreadyhaveZookeeperandoursinglenodestarted,sowejustneedto

startthetwonewnodes:

>bin/kafka-server-start.shconfig/server-1.properties&

>bin/kafka-server-start.shconfig/server-2.properties&

Nowcreateanewtopicwithareplicationfactorofthree:

>bin/kafka-topics.sh--create--zookeeperlocalhost:2181

--replication-factor3--partitions1--topicmy-replicated-topic

Okaybutnowthatwehaveaclusterhowcanweknowwhichbrokerisdoing

what?Toseethatrunthe"describetopics"command:

>bin/kafka-topics.sh--describe--zookeeperlocalhost:2181--topic

my-replicated-topic

Topic:my-replicated-topic PartitionCount:1

ReplicationFactor:3 Configs:

Topic:my-replicated-topic Partition:0 Leader:1

Replicas:1,2,0 Isr:1,2,0

Here is an explanation of output. The first line gives a summary of all the

partitions,eachadditionallinegivesinformationaboutonepartition.Sincewehave

onlyonepartitionforthistopicthereisonlyoneline.

• "leader" is the node responsible for all reads and writes for the given

partition.Eachnodewillbetheleaderforarandomlyselectedportionof

thepartitions.

• "replicas" is the list of nodes that replicate the log for this partition

regardless ofwhether they are the leader or even if they are currently

alive.

• "isr" is thesetof"in-sync"replicas.This is thesubsetof thereplicas list

thatiscurrentlyaliveandcaught-uptotheleader.

Note that in thisexamplenode1 is the leader for theonlypartitionof the

topic.

We can run the same command on the original topic we created to see

whereitis:

>bin/kafka-topics.sh--describe--zookeeperlocalhost:2181--topictest

Topic:test PartitionCount:1


Topic:test Partition:0 Leader:0

Replicas:0 Isr:0

Sothereisnosurprisethere—theoriginaltopichasnoreplicasandison

server0,theonlyserverinourclusterwhenwecreatedit.

Let'spublishafewmessagestoournewtopic:

>bin/kafka-console-producer.sh--broker-listlocalhost:9092–topic

my-replicated-topic

Whoislisteningtothis

Maytheforcebewithyou

Nowlet'sconsumethesemessages:

>bin/kafka-console-consumer.sh--zookeeperlocalhost:2181

--from-beginning--topicmy-replicated-topic



Nowlet'stestoutfault-tolerance.Broker1wasactingastheleadersolet's

killit:

>ps|grepserver-1.properties

7564ttys0020:15.91

>kill-97564

Leadershiphasswitchedtooneoftheslavesandnode1isnolongerinthe

in-syncreplicaset:

>bin/kafka-topics.sh--describe--zookeeperlocalhost:2181--topic

my-replicated-topic

Topic:my-replicated-topic PartitionCount:1


Topic:my-replicated-topic Partition:0

Leader:2Replicas:1,2,0Isr:2,0

Butthemessagesarestillbeavailableforconsumptioneventhoughthe

leaderthattookthewritesoriginallyisdown:

>bin/kafka-console-consumer.sh--zookeeperlocalhost:2181

--from-beginning--topicmy-replicated-topic



HOWTOPUTDATAINTOKAFKA

• ProducerAPIintheorg.apache.kafka.clientspackage

Thisclientisproductiontestedandgenerallybothfastandfullyfeatured.

You can use this client by adding a dependency on the client jar using the

following example maven co-ordinates (you can change the version numbers

withnewreleases):

<dependency>

<groupId>org.apache.kafka</groupId>

<artifactId>kafka-clients</artifactId>

<version>0.9.0.0</version>

</dependency>

• KafkaConnectintheorg.apache.kafka.connect

KafkaConnectisatool includedwithKafkathatimportsandexports

datatoKafka.Itisanextensibletoolthatrunsconnectors,whichimplement

the custom logic for interacting with an external system. Kafka Connect

allowsdeveloperstowritecustomproducer-SOURCEproceduresinJavaand

run it with Kafka Connect. How it is done ismore deeply explained in the

exampleapplication.

HOWTOGETDATAFROMKAFKA

• ConsumerAPIinorg.apache.kafka.clients

This client enables developers to write personalized consumers in

Java,thatconnectstoKafkacluster,listenstoatopicandretrievesmessages.

However, both Producer& ConsumerAPI’s seems to bemore complicated

and confusing than Kafka Connect which is the better version of

importing/exportingdata.

• KafkaClients

TherearevariousnumbersofKafkaClients forconsumingdatawith

differentprogramminglanguageswrittenasopen-sourceprojectsbyothers.

Kafka project lists them on their clients web page

(https://cwiki.apache.org/confluence/display/KAFKA/Clients). Some of them

include:

Python

.NET

Node.js

HTTPREST,etc.

Among them, Node.js seems to be interesting since Node.js is

designedtobeadata-drivenwebframeworkandwouldbenicetointegrate

KafkaasacommunicationmiddlewarebetweenaNode.jsapplications.

• KafkaConnect

KafkaConnectisatool includedwithKafkathatimportsandexports

datatoKafka.Itisanextensibletoolthatrunsconnectors,whichimplement

the custom logic for interacting with an external system. Kafka Connect

allowsdevelopers towrite customconsumer - SINKprocedures in Javaand

run it with Kafka Connect. How it is done ismore deeply explained in the

exampleapplication.

DESIGN

• Persistency

Kafkastoresandcachesitsmessagesmostlyinthefilesystem.Instead

of the common sense that is disk operations are very slow compared to

memoryoperations–which isactuallynotfalse– it isstilldependsonhow

thoseoperationsaredoneandhowthediskstructure isdesigned.Random

accessesareindeedveryslow.Ontheotherhand,throughputofharddrives

hasbeendivergingfromthelatencyofadiskseekforthelastdecade.

Asa result, theperformanceof linearwritesonaJBODconfiguration

with six 7200rpm SATA RAID-5 array is about 600MB/sec but the

performanceofrandomwrites isonlyabout100k/sec—adifferenceofover

6000X. These linear reads andwrites are themost predictable of all usage

patterns which are heavily optimized by the operating system. A modern

operating system provides read-ahead and write-behind techniques that

prefetchingdatainlargeblockmultiplesandgroupsmallerlogicalwritesinto

largephysicalwrites.

This suggests adesignwhich is very simple: rather thanmaintainas

muchaspossible in-memoryandflush itallouttothefilesysteminapanic

whenaprogramrunsoutofspace,Kafkainvertsthat.Alldataisimmediately

written toapersistent logon the filesystemwithoutnecessarily flushing to

disk. In effect this just means that it is transferred into the kernel's page

cache.

Mostly,aBTreeorothergeneral-purposeaccessdata structuresare

being used as the persistent data structure in themessaging systems. This

structure has the advantage of all the operations are being O(1) and I/O

operationsdonotblockeachother.

• Efficiency

SinceoneoftheprimaryusecasesofKafkaistohandlewebactivity

data suchas click sourcing–whichare inhugeamounts–,eachpageview

mayproducedozensofwrites.Bybeingveryfast,Kafkahelpsensurethatit

doesnottip-overunderloadbeforetheapplication.Thisisaprettyimportant

property since one of the Kafka’s aims is to become a centralized hub to

serve to dozens or hundreds of applications on a centralized cluster as

changesinusagepatternsareclosetodaily-occurrence.

Ingeneral,otherthandiskaccesscost,therearetwocommoncauses

ofinefficiencyinthistypeofsystem,thataretoomanysmallI/Ooperations

andexcessivebytecopying.

ToavoidlotsofsmallI/Ooperations,Kafkaabstractsgroupmessages

aroundamessagesetabstraction.Henceallowingnetworkrequeststogroup

messages together and amortize the overhead of the network roundtrip.

Evenwiththis littleoptimization,results inaspeedupofthewholesystem

anundeniableamount.

Moreover,most of the data transferred frompage cache of theOS

which serves very fast and eliminates the cost of copying data tomemory

first. Secondly,transferofdatafrompagecachetosockets,whicharethen

publishedtotheoutsideworld,donebysendfilesystemcallwhichisahighly

optimizedcodepathfortransferringdatatosockets.

Taking all of those into consideration, Kafka does a lot of

improvements under the hood to become one of the messaging systems

availableonthemarket.

• LogCompaction

LogcompactionensuresthatKafkawillalwaysretainatleastthelast

knownvalue foreachmessagekeywithin the logofdata fora single topic

partition. It addressesuse cases and scenarios suchas restoring state after

application crashes or system failure, or reloading caches after application

restartsduringoperationalmaintenance.

Logcompactiongivesusamoregranularretentionmechanismsothat

weareguaranteedtoretainatleastthelastupdateforeachprimarykey.By

doingthisweguaranteethatthelogcontainsafullsnapshotofthefinalvalue

for every key not just keys that changed recently. Thismeans downstream

consumers can restore their own state off this topic without us having to

retainacompletelogofallchanges.

Log compaction is a mechanism to give finer-grained per-record

retention,ratherthanthecoarser-grainedtime-basedretention.The idea is

toselectivelyremoverecordswherewehaveamorerecentupdatewiththe

sameprimary key. Thisway the log is guaranteed to have at least the last

stateforeachkey.

Thisretentionpolicycanbesetper-topic,soasingleclustercanhave

some topics where retention is enforced by size or time and other topics

whereretentionisenforcedbycompaction.

,

APPLICATION–ConvertingatraditionalRDBMS

toStreamofChangesUsingKafka

In order to demonstrate the use of Apache Kafka, we have took into account an

OLTPsystem,specificallytheonegivenasanassignmentintheDataWarehousescourse–

ING Bank. This database contains Customer and Account information that are versioned

when loaded into theDataWarehouse. Traditionally, snapshot of a database is captured

periodicallyandgiven to theETLprocedure.CustomerandAccountchangesarecaptured

througha‘lastupdated’fieldwhichrepresentswhenthelastchangehappenedtoaspecific

entry.Whatismissinginthisscenariois,onlythelatestupdateduringthesnapshotperiod

is taken intoconsideration.Moreover, foreverysnapshot,allof thedatabase’scontent is

copiedandthenchangedentriesareextractedfromit.

Instead,wehavedevelopedaKafkaConnectSourceTask thatperiodically–which

can be very small – checks for changed entries and puts them into apache Kafka as a

message for consumption. By doing so, we guarantee that only the changed entries are

processedwhileloadingtodatabasewithmoregranularchangehistory.ThisKafkaConnect

SourceTaskrunbyKafkaConnectwithspecificconfigurationsresultsintheeaseofallthe

ETLprocess.

MOREONKAFKACONNECT

Kafka Connect is a tool for scalably and reliably streaming data between Apache

Kafka andother systems. Itmakes it simple to quickly define connectors thatmove large

collections of data into and out of Kafka. Kafka Connect can ingest entire databases or

collectmetricsfromallyourapplicationserversintoKafkatopics,makingthedataavailable

forstreamprocessingwithlowlatency.AnexportjobcandeliverdatafromKafkatopicsinto

secondary storage and query systems or into batch systems for offline analysis. Kafka

Connectfeaturesinclude:

• A common framework for Kafka connectors - Kafka Connect standardizes

integration of other data systems with Kafka, simplifying connector development,

deployment,andmanagement

• Distributedandstandalonemodes -scaleuptoa large,centrallymanagedservice

supportinganentireorganizationorscaledowntodevelopment,testing,andsmall

productiondeployments

• RESTinterface-submitandmanageconnectorstoyourKafkaConnectclusterviaan

easytouseRESTAPI

• Automaticoffsetmanagement-withjustalittleinformationfromconnectors,Kafka

Connect can manage the offset commit process automatically so connector

developers do not need to worry about this error prone part of connector

development

• Distributedandscalablebydefault-KafkaConnectbuildsontheexisting

• Streaming/batchintegration-leveragingKafka'sexistingcapabilities,KafkaConnect

isanidealsolutionforbridgingstreamingandbatchdatasystems

RunningKafkaConnect

Kafka Connect currently supports two modes of execution: standalone (single

process)anddistributed.Instandalonemodeallworkisperformedinasingleprocess.This

configurationissimplertosetupandgetstartedwithandmaybeusefulinsituationswhere

onlyoneworkermakessense(e.g.collectinglogfiles),butitdoesnotbenefitfromsomeof

the featuresofKafkaConnect suchas fault tolerance.Youcan starta standaloneprocess

withthefollowingcommand:

>bin/connect-standalone.shconfig/connect-standalone.properties

connector1.properties[connector2.properties...]

Thefirstparameteristheconfigurationfortheworker.Thisincludessettingssuchas

the Kafka connection parameters, serialization format, and how frequently to commit

offsets.Theprovidedexampleshouldworkwellwithalocalclusterrunningwiththedefault

configuration provided by config/server.properties. It will require tweaking to usewith a

differentconfigurationorproductiondeployment.Theremainingparametersareconnector

configuration files. Youmay include asmany as youwant, but allwill executewithin the

sameprocess(ondifferentthreads).

ConnectorsAndTasks

To copydatabetweenKafkaandanother system,users createaConnector for the

system they want to pull data from or push data to. Connectors come in two flavors:

SourceConnectorsimportdatafromanothersystem(e.g.JDBCSourceConnectorwouldimporta

relationaldatabaseintoKafka)andSinkConnectorsexportdata(e.g.HDFSSinkConnectorwould

export thecontentsofaKafka topic toanHDFS file).Connectorsdonotperformanydata

copyingthemselves:theirconfigurationdescribesthedatatobecopied,andtheConnector

isresponsibleforbreakingthatjobintoasetofTasksthatcanbedistributedtoworkers.

TheseTasksalsocomeintwocorrespondingflavors:SourceTaskandSinkTask.Withan

assignment inhand,eachTaskmustcopy itssubsetof thedatatoor fromKafka. InKafka

Connect, it should always be possible to frame these assignments as a set of input and

output streamsconsistingof recordswithconsistent schemas.Sometimes thismapping is

obvious: each file in a set of log files can be considered a streamwith each parsed line

forming a record using the same schema andoffsets stored as byte offsets in the file. In

other cases itmay requiremore effort tomap to thismodel: a JDBC connector canmap

eachtabletoastream,buttheoffsetislessclear.Onepossiblemappingusesatimestamp

column to generate queries incrementally returning new data, and the last queried

timestampcanbeusedastheoffset.

ApplicationDevelopment&Running

While developing the application, we have only turned Customer relation

into stream for testing purposes. For this task, two java classes have been implemented

namely, ‘SourceJDBCConnectorCustomer.java’ & ‘SourceJDBCTaskCustomer.java’ that

implements ‘SourceConnector’ and ‘SourceTask’ interfaces of Kafka Connect package. As

the nature of SourceConnector interface, ‘SourceJDBCConnectorCustomer.java’ class have

thecorrespondingtaskconfigurationsanddealswithtaskpartitioningformultiplethreads.

In our application, only one SourceTask thread will be run. In the

‘SourceJDBCTaskCustomer.java’, poll() function continuously checks for customer changes

by checking the ‘lastupdated’ column with check interval. For every run of poll, check

interval is incrementedasdesired. Then, all the changes in the customer relationareput

into Kafka topic ‘customer-pipe’. One thing to note here is that, while putting data into

KafkathroughcustomKafkaConnectSourceConnectorandSourceTask,thereareonlyfew

possibleschemaoptionssuchasBoolean,String, Integer. Inordertoeaseconsumptionof

the messages – database events, customer changes in this case – custom Schema

corresponding to the Customer Relation Schema has been generated. Currently, every

message in this topic is typeof this Schema.Below, is the screenshot of Kafka consumer

client,listeningtothetopic‘customer-pipe’.

In order to run custom Kafka Connect Source Connector and Source Task,

project should be built and its ‘.jar’ file should be put into to Kafka’s libs folderwith the

dependencyfoldersincluded.Moreover,inordertoruntheConnector,Configurationfileis

placedintotheKafka’s‘config’folderandspecifiedwhilerunningtheKafkaConnectthrough

terminal.

Configurationfileanditscontentareasfollows:

• name=etl-source-branch---Nameoftheconfigurationclass

• connector.class=kafkaconnectetl.SourceJDBCConnectorCustomer---Nameof

theSourceConnectorClass

• tasks.max=1---Maximumnumberoftaskthreadsthatshouldberunin

parallel

• driver=org.postgresql.Driver---Nameofthedriver

• db=jdbc:postgresql://localhost/ING_OLTP---DatabaseAdress

• dbuser=postgres---DatabaseUserName

• dbpassword=postgrespassword---DatabasePassword

• topic=customer-pipe---Whichtopictopublish

Among those configurations, name, connector.class, tasks.max and topic are

necessary. Other configurations are for ease of readability of implementation and used

internallybySourceConnectorandSourceTaskclasses,notbyKafka.

Bydoingsuch,wehavejustputCustomertablechangesintoastreaminKafkatopic.

ThesemessagescanbeeasilyusedbyanycustomKafkaConnectclasses that implements

Sink Connector and Sink Task.Moreover, those can be put into different targets such as

HDFS, OLAP system, Key-Value storage, etc.Moreover, since the Schema that is used by

SourceTaskclasscanbemodifiedtoholdsomeextravalues, it ispossibletoprecalculate

someETLproceduresbeforeputtingthemintoKafkaoranyscenariocanbe implemented

internally. Source code for ‘SourceJDBCConnectorCustomer.java’ and

‘SourceJDBCTaskCustomer.java’areincludedwiththereport.

CONCLUSION

Stream processing is required when data has to be processed fast and / or

continuously,i.e.reactionshavetobecomputedandinitiatedinrealtime.Thisrequirement

iscomingmoreandmoreintoeveryvertical.Manydifferentframeworksandproductsare

availableonthemarketalready,howeverthenumberofmaturesolutionswithgoodtools

andcommercial support is small today.ApacheStorm isagood,opensource framework;

however custom coding is required due to a lack of development tools and there’s no

commercialsupportrightnow.

Products such as IBM InfoSphere Streams or TIBCO StreamBase offer complete

products,whichclosethisgap.Youdefinitelyhavetotryoutthedifferentproducts,asthe

websitesdonot showyouhow theydiffer regardingeaseofuse, rapiddevelopmentand

debugging, and real-time streaming analytics and monitoring. Stream processing

complementsothertechnologiessuchasaDWHandHadoopinabigdataarchitecture-this

isnotan"either/or"question.

Stream processing has a great future and will become very important for most

companies. Big Data and Internet of Things are huge drivers of change.