advanced databases term project stream data
TRANSCRIPT
ADVANCEDDATABASES
TERMPROJECT
STREAMDATAPLATFORMS–APACHEKAFKA
KANATOV,KARİM
ZEYBEK,ALİBAHADIR
TABLEOFCONTENTS
• Abstract
• Introduction
• StreamProcessing,Messaging,DataWarehouse
• UseCases
• ComplexEventProcessing
• ApacheKafka
o QuickStart
o HowtoputdataintoKafka
o HowtogetdatafromKafka
o Design
• Application–ConvertingatraditionalRDBMStoStreamof
ChangesusingKafka
• Application–Development&Running
• Conclusion
ABSTRACT
Throughout this paper, one of the emerging concepts in the database world –
StreamDataPlatformsandwhat are theyuseful forwill bediscussed.Moreover,Apache
Kafka – a publish/subscribe messaging rethought – will be demonstrated to better
understandspecificusecasesofStreamDataPlatforms.
INTRODUCTION
60seconds.Letusthinkforawhilewhatmighthappenoninternetwithinthatshort
periodof time.According toDOMO1,eachminuteofeveryday the followinghappenson
theInternet:
● 347,222Tweetssent!
● Morethan300HoursofcontentuploadedtoYouTube!
● Upto2BillionphotoslikedinInstagram.
● 4.1BILLIONFacebookuserlikesposts!
● Amazonreceives4,310uniquevisitors.
● Appleusersdownload51,000apps
Allofthesehappenonlywithinoneminute.
According to Gartner’s, Inc. forecasts from 20142, we will have around 5 billion
connected things (Internet of Things) used in 2015. In other words, those things will
generate more data and the data will be streamed in real time. More interestingly, the
majority of data that is used in internet today are created by individual users via social
media.Itisneverendingfeedofinformation.
Globally3.2billionpeopleareusingtheInternetbyend2015,ofwhich2billionare
fromdevelopingcountries3,which is45%ofworldwidepopulation. Itmeans therewillbe
more data in coming years. Those data will grow exponentially and will require to be
processedinrealtime.
1 https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/ 2https://www.gartner.com/newsroom/id/29057173 https://www.itu.int/en/ITU-D/Statistics/Documents/facts/ICTFactsFigures2015.pdf
Now let us switch the topic to a different perspective. In most of the traditional
businessapplications, therearecertainchangesofeventsthatareneededtobecaptured
andactaccordingly.Evena simple shoppingsitehasdifferentevents suchasuser login–
addtothecart–payment–shipment–etc.Thequestionhereisthathowmuchofthose
eventchangescanbecaptured?
Normally,whatmostof theapplicationsusing traditionalRDBMS’s is changing the
state of corresponding entry in the relation to the most up-to-date value. However,
underlyingproceduresthatbuildsuptothefinalstateareusuallyoverridden.
StreamDataPlatform,isbasicallygoodfortwomainconcepts:
• Data Integration: The stream data platform captures streams of
eventsordatachangesandfeedsthesetootherdatasystemssuchas
relational databases, key-value stores, Hadoop, or the data
warehouse.
• Stream Processing: It enables continuous, real-time processing and
transformation of these streams and makes the results available
system-wide.
Thedemand for streamprocessing is increasinga lot thesedays.Oftenprocessing
hugevolumesofdataisnotenough.Datahastobeprocessedfast,sothatafirmcanreact
to changingbusiness conditions in real time. This is required for trading, frauddetection,
systemmonitoring,andmanyotherexamples.
Generally, gathering the data depends on the data sources. Events coming from
databases,machine-generatedlogs,andevensensorsneedtobecleaned,schematizedand
forwardedtoacentralplace.
Secondly,oneofthemaingoalsondevelopmentofKafkawascollectingthestreams
in one central place. So, brokers collect stream of data, log and buffer them in a fault
tolerant manner, as well as distribute them among the various consumers that are
interested in the streams of data. Finally, analysing the stream of data, which can be
performedwellusingotherApachesolutionslikeSamzaorStorm.
Thispaperisconsistedofbasicunderstandingofthestreamprocessingandtherole
ofKafkaonit.Wewilldiscussnotonlytheoretical,butalsowillhavealookatpracticaluse
ofexistingsolution.
Further we will discuss the use of Kafka in detail. We will look at configuration,
designchoiceandimplementationofKafka.Also,theuseofKafkainstreamprocessingwill
bediscussedindetail.
STREAMPROCESSING,MESSAGING,DATAWAREHOUSE
Data Warehouses are well known for storing and analysing the structured data.
Runningsimplequeryontopofterabytesofdataisenoughtogetanyhistoricaldatawithin
seconds.
However the ETL processes often take too long. As business grows, shareholders
want to query up-to-date information, since information from the day before is not
sufficientenoughtomakeprecisedecisions.Thisiswherestreamprocessingcomesinand
feedsallnewdataintotheDataWarehouseimmediately.
Figure1.StreamDataPlatformarchitecturebuildaroundKafka
Theproblemofmostbigcompanies istheuseofdifferentdatasystems:relational
OLTPdatabases,Hadoop,Teradata,asearchsystem,monitoringsystems,OLAPstoresand
soon.Therefore,eachofthesesystemsneedsreliablefeedsofdata.
OneofthemainproblemsthatKafkasolvesistransportationofdatabetweenthose
systems.Once thedata is inKafka, consumers candowhatever isneeded.Reliabilityand
lowlatencyisthekeyadvantagesofusingKafka.
,
USECASES
One of the first commercial uses of stream processing was found in the finance
industry,as stockexchangesmoved fromfloor-based trading toelectronic trading.Today,
wecanconfidentlysayitisusedalmostineveryindustry-anywherewherestreamofdata
are generated. The Internet of Things will increase volume, variety and velocity of data,
leadingtoadramaticincreaseintheapplicationsforstreamprocessingtechnologies.
Kafkahas beendevelopedby LinkedIn engineering team; so, itwas the first place
wheretheconceptandrealabilitiesofthesystemhavebeentested.AsKafkafounder,Jay
Kreps,statestheexistingmonolithic,centralizeddatabasebeganfacingitslimitsandthere
wasaneedtostartthetransitiontoaportfolioofspecializeddistributedsystems4.So,Kafka
wasbuilttoserveasacentralrepositoryofdatastream.Themostinterestingisthatindeed
engineersunderestimatethepoweroflogandmostofthetimewedonotacceptlogsasan
innovationeventhoughithasbeenaroundaslongascomputer.
Kafkacanserveasamessagebroker,webactivitytracker,logaggregationsolutions.
Nowadays all the stream processing solution such as Apache Storm and Samza goes
togetherwithKafkaasmessagingstreamandarewellimplementedwitheachother.
Kafka lets us see the data not as a data itself but accept the data as a streamof
events.Forinstance,retailhaseventssuchaslogistic,sales,orders,refundsandsoon.Web
sites have streamsof clicks, impressions, searches, and soon. Big software systemshave
streamsof requests,errors,machinemetrics,and logs.So, the ideabehindallof thatare
collecting, storing and the use of all those events for better understanding of users,
customers’behaviour.
Most of the e-commerce websites do not only want to know the final result of
customerpurchase,butalso theywould like to see the logicalpathofuser choice,which
could be perceived as stream of events. Those streams of events help the company to
improvetheirserviceandpricepolicywhileitisalsogoodformarketingpurpose.
4https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Thus,muchofwhatpeoplerefertowhentheytalkabout"bigdata"isreallytheact
ofcapturingtheseeventsthatpreviouslyweren'trecordedanywhereandputtingthemto
use for analysis, optimization, and decision making. In some sense these events are the
otherhalfofthestorythedatabasetablesdon'ttell:theyarethestoryofwhatthebusiness
did.
COMPLEXEVENTPROCESSING
Kafka iswellknownasasystemforhigh-throughputreliableeventstreams.So,on
the processing side most of the time developers choose some stream processing
frameworkslikeSamza,StormorSparkStreaming.
On other hand, there is also somework on high-level query languages for stream
processing, andComplex Event Processing is especiallyworthmentioning. It originated in
1990s research on event-driven simulation, and most CEP products are commercial,
expensive enterprise software. Esper is the only free and open source solution, but it is
limitedtorunningonasinglemachine.
WithCEP,queriesorrulesthatmatchcertainpatternsintheeventsarewritten.In
otherwords, CEP is about detecting and selecting the interesting events (andonly them)
frommultiple sources, finding their relationships and deducing new data from them and
their relationships. The goal of complexeventprocessing is to identifymeaningful events
(suchasopportunitiesorthreats)andrespondtothemasquicklyaspossible
TheyarecomparabletoSQLqueries(whichdescribewhatresultsyouwanttoreturn
froma database), except that theCEP engine continually searches the stream for sets of
events thatmatch the query and notifies you (generates a "complex event")whenever a
match is found. This is useful for fraud detection or monitoring business processes, for
example.
ForusecasesthatcanbeeasilydescribedintermsofaCEPquerylanguage,sucha
high-levellanguageismuchmoreconvenientthanalow-leveleventprocessingAPI.Onthe
otherhand,a low-levelAPIgivesyoumore freedom,allowingyou todoawider rangeof
thingsthanaquerylanguagewouldletyoudo.Also,byfocussingtheireffortsonscalability
andfault tolerance,streamprocessingframeworksprovideasolid foundationuponwhich
querylanguagescanbebuilt.
Nowadays, as CEP is supposed to be one of the main topic in IT industry, many
vendorsofferCEPsolutions.ManyoftheCEPtoolsonthemarketallowthecreationofreal-
time, event-driven applications. These applications might consumes data from multiple
externalsources,buttheycanalsoconsumedatafromtraditionaldatabasesources.
Most of these products include a graphical event flow language and support SQL.
Key commercial vendors in this space are IBM with IBM Operational Decision Manager,
InformaticawithRulePoint,Oraclewith itsComplexEventProcessingSolution,Microsoft’s
StreamInsights, and SAS DataFlux Event Stream Processing Engine, and TIBCO’s CEP.
Numerousstart-upsareemerginginthismarket.
APACHEKAFKA
AKafkaclustercanaccepttensofmillionsofwritesper-secondandpersistthemfor
days,months,orindefinitely.Eachwriteisreplicatedovermultiplenodesforfault-tolerance
andaKafkaclustercansupport thousandsofconcurrentreadersandwriters.This journal
actsasthesubscriptionmechanismforothersystemstofeedoffof.TheKafkaclustercan
expanddynamicallytohandlegreaterloadorretainmoredata.
Kafkaisadistributed,partitioned,replicatedcommitlogservice5.Inotherwords,it
ispublish-subscribemessaging implementedasadistributedcommit logsuitablenotonly
for online, but also for offline message consumption. In general Kafka is made of three
levels:
● Producers-theprocessesthatpublishmessagestoaKafkatopic.
● Broker-isKafkaclusterconsistingofoneormoreservers.
● Consumers-theprocessesthatsubscribetotopicsandprocessthefeedofpublished
messages.
Kafkamaintains feedsofmessages incategoriescalledtopics.ForeachtopicKafka
maintains at least onepartitionblock. Partitionblock containsmessages in unchangeable
sequence and each message is continually appended to a commit log. Each message is
identifiedbyuniquesequenceofIDcalledoffset.Allpublishedmessagesareretainedinthe
Kafkaclusterforaconfiguredperiodoftime,afterthatthosemessageswillbediscardedt
freeupthespace.
Partitionscontainmessagesthatarereplicatedovermultipleservers,which intern
formKafkabroker.Thisisdoneforafaulttolerance.Everypartitionhasoneserveractingas
a leader, the rest of them as followers. Reader handles all read/write requests for the
partition.Thefollowerpassivelyreplicatesthe leader. If the leaderserver fails,oneofthe
followerserversbecomesaleaderbydefaultmakingthebrokermoreresilientagainsthost
failures.
5 https://kafka.apache.org/documentation.html#messages
In the consumer side, themessage can be broadcasted to all consumers or each
messagecanbereadonebyonefromtheserver.
QUICKSTART
• DownloadtheCode&Un-Tarit
CodeSource:
https://www.apache.org/dyn/closer.cgi?path=/kafka/0.9.0.0/kafka_2.
11-0.9.0.0.tgz
Un-Tar:
>tar-xzfkafka_2.11-0.9.0.0.tgz
>cdkafka_2.11-0.9.0.0
• StarttheServer
KafkausesZooKeepersoyouneedtofirststartaZooKeeperserverif
youdon'talreadyhaveone.Youcanusetheconveniencescriptpackaged
withKafkatogetaquick-and-dirtysingle-nodeZooKeeperinstance.
>bin/zookeeper-server-start.shconfig/zookeeper.properties
NowstarttheKafkaserver:
>bin/kafka-server-start.shconfig/server.properties
• CreateaTopic
Createatopicnamed"test"withasinglepartitionandonlyonereplica:
>bin/kafka-topics.sh--create--zookeeperlocalhost:2181
--replication-factor1--partitions1--topictest
Wecannowseethattopicifwerunthelisttopiccommand:
>bin/kafka-topics.sh--list--zookeeperlocalhost:2181
test
Alternatively,insteadofmanuallycreatingtopicsyoucanalsoconfigureyour
brokerstoauto-createtopicswhenanon-existenttopicispublishedto.
• SendMessages
Kafka comeswith a command line client thatwill take input from a file or
fromstandardinputandsenditoutasmessagestotheKafkacluster.Bydefaulteach
linewillbesentasaseparatemessage.
Runtheproducerandthentypeafewmessagesintotheconsoletosendto
theserver.
>bin/kafka-console-producer.sh--broker-listlocalhost:9092--topictest
HelloKafka
Thisisaneventthatisstreamed…
• ConsumeMessages
Kafkaalsohasacommandlineconsumerthatwilldumpoutmessagesto
standardoutput.
>bin/kafka-console-consumer.sh--zookeeperlocalhost:2181--topic
test--from-beginning
HelloKafka
Thisisaneventthatisstreamed…
Ifyouhaveeachoftheabovecommandsrunninginadifferentterminalthen
youshouldnowbeabletotypemessagesintotheproducerterminalandseethem
appearintheconsumerterminal.
Allofthecommandlinetoolshaveadditionaloptions;runningthecommandwithno
argumentswilldisplayusageinformationdocumentingtheminmoredetail.
• SettingUpaMulti-BrokerCluster
So farwehavebeen runningagainsta singlebroker,but that'sno fun. For
Kafka, a singlebroker is just a clusterof sizeone, sonothingmuch changesother
thanstartingafewmorebrokerinstances.Butjusttogetfeelforit,let'sexpandour
clustertothreenodes(stillallonourlocalmachine).
Firstwemakeaconfigfileforeachofthebrokers:
>cpconfig/server.propertiesconfig/server-1.properties
>cpconfig/server.propertiesconfig/server-2.properties
Noweditthesenewfilesandsetthefollowingproperties:
config/server-1.properties:
broker.id=1
port=9093
log.dir=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
port=9094
log.dir=/tmp/kafka-logs-2
Thebroker.idproperty is the unique and permanent nameof each node in
the cluster.We have to override the port and log directory only because we are
running theseall on the samemachineandwewant to keep thebrokers fromall
tryingtoregisteronthesameportoroverwriteeachother’s’data.
WealreadyhaveZookeeperandoursinglenodestarted,sowejustneedto
startthetwonewnodes:
>bin/kafka-server-start.shconfig/server-1.properties&
>bin/kafka-server-start.shconfig/server-2.properties&
Nowcreateanewtopicwithareplicationfactorofthree:
>bin/kafka-topics.sh--create--zookeeperlocalhost:2181
--replication-factor3--partitions1--topicmy-replicated-topic
Okaybutnowthatwehaveaclusterhowcanweknowwhichbrokerisdoing
what?Toseethatrunthe"describetopics"command:
>bin/kafka-topics.sh--describe--zookeeperlocalhost:2181--topic
my-replicated-topic
Topic:my-replicated-topic PartitionCount:1
ReplicationFactor:3 Configs:
Topic:my-replicated-topic Partition:0 Leader:1
Replicas:1,2,0 Isr:1,2,0
Here is an explanation of output. The first line gives a summary of all the
partitions,eachadditionallinegivesinformationaboutonepartition.Sincewehave
onlyonepartitionforthistopicthereisonlyoneline.
• "leader" is the node responsible for all reads and writes for the given
partition.Eachnodewillbetheleaderforarandomlyselectedportionof
thepartitions.
• "replicas" is the list of nodes that replicate the log for this partition
regardless ofwhether they are the leader or even if they are currently
alive.
• "isr" is thesetof"in-sync"replicas.This is thesubsetof thereplicas list
thatiscurrentlyaliveandcaught-uptotheleader.
Note that in thisexamplenode1 is the leader for theonlypartitionof the
topic.
We can run the same command on the original topic we created to see
whereitis:
>bin/kafka-topics.sh--describe--zookeeperlocalhost:2181--topictest
Topic:test PartitionCount:1
ReplicationFactor:1 Configs:
Topic:test Partition:0 Leader:0
Replicas:0 Isr:0
Sothereisnosurprisethere—theoriginaltopichasnoreplicasandison
server0,theonlyserverinourclusterwhenwecreatedit.
Let'spublishafewmessagestoournewtopic:
>bin/kafka-console-producer.sh--broker-listlocalhost:9092–topic
my-replicated-topic
Whoislisteningtothis
Maytheforcebewithyou
Nowlet'sconsumethesemessages:
>bin/kafka-console-consumer.sh--zookeeperlocalhost:2181
--from-beginning--topicmy-replicated-topic
Whoislisteningtothis
Maytheforcebewithyou
Nowlet'stestoutfault-tolerance.Broker1wasactingastheleadersolet's
killit:
>ps|grepserver-1.properties
7564ttys0020:15.91
>kill-97564
Leadershiphasswitchedtooneoftheslavesandnode1isnolongerinthe
in-syncreplicaset:
>bin/kafka-topics.sh--describe--zookeeperlocalhost:2181--topic
my-replicated-topic
Topic:my-replicated-topic PartitionCount:1
ReplicationFactor:3 Configs:
Topic:my-replicated-topic Partition:0
Leader:2Replicas:1,2,0Isr:2,0
Butthemessagesarestillbeavailableforconsumptioneventhoughthe
leaderthattookthewritesoriginallyisdown:
>bin/kafka-console-consumer.sh--zookeeperlocalhost:2181
--from-beginning--topicmy-replicated-topic
Whoislisteningtothis
Maytheforcebewithyou
HOWTOPUTDATAINTOKAFKA
• ProducerAPIintheorg.apache.kafka.clientspackage
Thisclientisproductiontestedandgenerallybothfastandfullyfeatured.
You can use this client by adding a dependency on the client jar using the
following example maven co-ordinates (you can change the version numbers
withnewreleases):
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.9.0.0</version>
</dependency>
• KafkaConnectintheorg.apache.kafka.connect
KafkaConnectisatool includedwithKafkathatimportsandexports
datatoKafka.Itisanextensibletoolthatrunsconnectors,whichimplement
the custom logic for interacting with an external system. Kafka Connect
allowsdeveloperstowritecustomproducer-SOURCEproceduresinJavaand
run it with Kafka Connect. How it is done ismore deeply explained in the
exampleapplication.
HOWTOGETDATAFROMKAFKA
• ConsumerAPIinorg.apache.kafka.clients
This client enables developers to write personalized consumers in
Java,thatconnectstoKafkacluster,listenstoatopicandretrievesmessages.
However, both Producer& ConsumerAPI’s seems to bemore complicated
and confusing than Kafka Connect which is the better version of
importing/exportingdata.
• KafkaClients
TherearevariousnumbersofKafkaClients forconsumingdatawith
differentprogramminglanguageswrittenasopen-sourceprojectsbyothers.
Kafka project lists them on their clients web page
(https://cwiki.apache.org/confluence/display/KAFKA/Clients). Some of them
include:
Python
.NET
Node.js
HTTPREST,etc.
Among them, Node.js seems to be interesting since Node.js is
designedtobeadata-drivenwebframeworkandwouldbenicetointegrate
KafkaasacommunicationmiddlewarebetweenaNode.jsapplications.
• KafkaConnect
KafkaConnectisatool includedwithKafkathatimportsandexports
datatoKafka.Itisanextensibletoolthatrunsconnectors,whichimplement
the custom logic for interacting with an external system. Kafka Connect
allowsdevelopers towrite customconsumer - SINKprocedures in Javaand
run it with Kafka Connect. How it is done ismore deeply explained in the
exampleapplication.
DESIGN
• Persistency
Kafkastoresandcachesitsmessagesmostlyinthefilesystem.Instead
of the common sense that is disk operations are very slow compared to
memoryoperations–which isactuallynotfalse– it isstilldependsonhow
thoseoperationsaredoneandhowthediskstructure isdesigned.Random
accessesareindeedveryslow.Ontheotherhand,throughputofharddrives
hasbeendivergingfromthelatencyofadiskseekforthelastdecade.
Asa result, theperformanceof linearwritesonaJBODconfiguration
with six 7200rpm SATA RAID-5 array is about 600MB/sec but the
performanceofrandomwrites isonlyabout100k/sec—adifferenceofover
6000X. These linear reads andwrites are themost predictable of all usage
patterns which are heavily optimized by the operating system. A modern
operating system provides read-ahead and write-behind techniques that
prefetchingdatainlargeblockmultiplesandgroupsmallerlogicalwritesinto
largephysicalwrites.
This suggests adesignwhich is very simple: rather thanmaintainas
muchaspossible in-memoryandflush itallouttothefilesysteminapanic
whenaprogramrunsoutofspace,Kafkainvertsthat.Alldataisimmediately
written toapersistent logon the filesystemwithoutnecessarily flushing to
disk. In effect this just means that it is transferred into the kernel's page
cache.
Mostly,aBTreeorothergeneral-purposeaccessdata structuresare
being used as the persistent data structure in themessaging systems. This
structure has the advantage of all the operations are being O(1) and I/O
operationsdonotblockeachother.
• Efficiency
SinceoneoftheprimaryusecasesofKafkaistohandlewebactivity
data suchas click sourcing–whichare inhugeamounts–,eachpageview
mayproducedozensofwrites.Bybeingveryfast,Kafkahelpsensurethatit
doesnottip-overunderloadbeforetheapplication.Thisisaprettyimportant
property since one of the Kafka’s aims is to become a centralized hub to
serve to dozens or hundreds of applications on a centralized cluster as
changesinusagepatternsareclosetodaily-occurrence.
Ingeneral,otherthandiskaccesscost,therearetwocommoncauses
ofinefficiencyinthistypeofsystem,thataretoomanysmallI/Ooperations
andexcessivebytecopying.
ToavoidlotsofsmallI/Ooperations,Kafkaabstractsgroupmessages
aroundamessagesetabstraction.Henceallowingnetworkrequeststogroup
messages together and amortize the overhead of the network roundtrip.
Evenwiththis littleoptimization,results inaspeedupofthewholesystem
anundeniableamount.
Moreover,most of the data transferred frompage cache of theOS
which serves very fast and eliminates the cost of copying data tomemory
first. Secondly,transferofdatafrompagecachetosockets,whicharethen
publishedtotheoutsideworld,donebysendfilesystemcallwhichisahighly
optimizedcodepathfortransferringdatatosockets.
Taking all of those into consideration, Kafka does a lot of
improvements under the hood to become one of the messaging systems
availableonthemarket.
• LogCompaction
LogcompactionensuresthatKafkawillalwaysretainatleastthelast
knownvalue foreachmessagekeywithin the logofdata fora single topic
partition. It addressesuse cases and scenarios suchas restoring state after
application crashes or system failure, or reloading caches after application
restartsduringoperationalmaintenance.
Logcompactiongivesusamoregranularretentionmechanismsothat
weareguaranteedtoretainatleastthelastupdateforeachprimarykey.By
doingthisweguaranteethatthelogcontainsafullsnapshotofthefinalvalue
for every key not just keys that changed recently. Thismeans downstream
consumers can restore their own state off this topic without us having to
retainacompletelogofallchanges.
Log compaction is a mechanism to give finer-grained per-record
retention,ratherthanthecoarser-grainedtime-basedretention.The idea is
toselectivelyremoverecordswherewehaveamorerecentupdatewiththe
sameprimary key. Thisway the log is guaranteed to have at least the last
stateforeachkey.
Thisretentionpolicycanbesetper-topic,soasingleclustercanhave
some topics where retention is enforced by size or time and other topics
whereretentionisenforcedbycompaction.
,
APPLICATION–ConvertingatraditionalRDBMS
toStreamofChangesUsingKafka
In order to demonstrate the use of Apache Kafka, we have took into account an
OLTPsystem,specificallytheonegivenasanassignmentintheDataWarehousescourse–
ING Bank. This database contains Customer and Account information that are versioned
when loaded into theDataWarehouse. Traditionally, snapshot of a database is captured
periodicallyandgiven to theETLprocedure.CustomerandAccountchangesarecaptured
througha‘lastupdated’fieldwhichrepresentswhenthelastchangehappenedtoaspecific
entry.Whatismissinginthisscenariois,onlythelatestupdateduringthesnapshotperiod
is taken intoconsideration.Moreover, foreverysnapshot,allof thedatabase’scontent is
copiedandthenchangedentriesareextractedfromit.
Instead,wehavedevelopedaKafkaConnectSourceTask thatperiodically–which
can be very small – checks for changed entries and puts them into apache Kafka as a
message for consumption. By doing so, we guarantee that only the changed entries are
processedwhileloadingtodatabasewithmoregranularchangehistory.ThisKafkaConnect
SourceTaskrunbyKafkaConnectwithspecificconfigurationsresultsintheeaseofallthe
ETLprocess.
MOREONKAFKACONNECT
Kafka Connect is a tool for scalably and reliably streaming data between Apache
Kafka andother systems. Itmakes it simple to quickly define connectors thatmove large
collections of data into and out of Kafka. Kafka Connect can ingest entire databases or
collectmetricsfromallyourapplicationserversintoKafkatopics,makingthedataavailable
forstreamprocessingwithlowlatency.AnexportjobcandeliverdatafromKafkatopicsinto
secondary storage and query systems or into batch systems for offline analysis. Kafka
Connectfeaturesinclude:
• A common framework for Kafka connectors - Kafka Connect standardizes
integration of other data systems with Kafka, simplifying connector development,
deployment,andmanagement
• Distributedandstandalonemodes -scaleuptoa large,centrallymanagedservice
supportinganentireorganizationorscaledowntodevelopment,testing,andsmall
productiondeployments
• RESTinterface-submitandmanageconnectorstoyourKafkaConnectclusterviaan
easytouseRESTAPI
• Automaticoffsetmanagement-withjustalittleinformationfromconnectors,Kafka
Connect can manage the offset commit process automatically so connector
developers do not need to worry about this error prone part of connector
development
• Distributedandscalablebydefault-KafkaConnectbuildsontheexisting
• Streaming/batchintegration-leveragingKafka'sexistingcapabilities,KafkaConnect
isanidealsolutionforbridgingstreamingandbatchdatasystems
RunningKafkaConnect
Kafka Connect currently supports two modes of execution: standalone (single
process)anddistributed.Instandalonemodeallworkisperformedinasingleprocess.This
configurationissimplertosetupandgetstartedwithandmaybeusefulinsituationswhere
onlyoneworkermakessense(e.g.collectinglogfiles),butitdoesnotbenefitfromsomeof
the featuresofKafkaConnect suchas fault tolerance.Youcan starta standaloneprocess
withthefollowingcommand:
>bin/connect-standalone.shconfig/connect-standalone.properties
connector1.properties[connector2.properties...]
Thefirstparameteristheconfigurationfortheworker.Thisincludessettingssuchas
the Kafka connection parameters, serialization format, and how frequently to commit
offsets.Theprovidedexampleshouldworkwellwithalocalclusterrunningwiththedefault
configuration provided by config/server.properties. It will require tweaking to usewith a
differentconfigurationorproductiondeployment.Theremainingparametersareconnector
configuration files. Youmay include asmany as youwant, but allwill executewithin the
sameprocess(ondifferentthreads).
ConnectorsAndTasks
To copydatabetweenKafkaandanother system,users createaConnector for the
system they want to pull data from or push data to. Connectors come in two flavors:
SourceConnectorsimportdatafromanothersystem(e.g.JDBCSourceConnectorwouldimporta
relationaldatabaseintoKafka)andSinkConnectorsexportdata(e.g.HDFSSinkConnectorwould
export thecontentsofaKafka topic toanHDFS file).Connectorsdonotperformanydata
copyingthemselves:theirconfigurationdescribesthedatatobecopied,andtheConnector
isresponsibleforbreakingthatjobintoasetofTasksthatcanbedistributedtoworkers.
TheseTasksalsocomeintwocorrespondingflavors:SourceTaskandSinkTask.Withan
assignment inhand,eachTaskmustcopy itssubsetof thedatatoor fromKafka. InKafka
Connect, it should always be possible to frame these assignments as a set of input and
output streamsconsistingof recordswithconsistent schemas.Sometimes thismapping is
obvious: each file in a set of log files can be considered a streamwith each parsed line
forming a record using the same schema andoffsets stored as byte offsets in the file. In
other cases itmay requiremore effort tomap to thismodel: a JDBC connector canmap
eachtabletoastream,buttheoffsetislessclear.Onepossiblemappingusesatimestamp
column to generate queries incrementally returning new data, and the last queried
timestampcanbeusedastheoffset.
ApplicationDevelopment&Running
While developing the application, we have only turned Customer relation
into stream for testing purposes. For this task, two java classes have been implemented
namely, ‘SourceJDBCConnectorCustomer.java’ & ‘SourceJDBCTaskCustomer.java’ that
implements ‘SourceConnector’ and ‘SourceTask’ interfaces of Kafka Connect package. As
the nature of SourceConnector interface, ‘SourceJDBCConnectorCustomer.java’ class have
thecorrespondingtaskconfigurationsanddealswithtaskpartitioningformultiplethreads.
In our application, only one SourceTask thread will be run. In the
‘SourceJDBCTaskCustomer.java’, poll() function continuously checks for customer changes
by checking the ‘lastupdated’ column with check interval. For every run of poll, check
interval is incrementedasdesired. Then, all the changes in the customer relationareput
into Kafka topic ‘customer-pipe’. One thing to note here is that, while putting data into
KafkathroughcustomKafkaConnectSourceConnectorandSourceTask,thereareonlyfew
possibleschemaoptionssuchasBoolean,String, Integer. Inordertoeaseconsumptionof
the messages – database events, customer changes in this case – custom Schema
corresponding to the Customer Relation Schema has been generated. Currently, every
message in this topic is typeof this Schema.Below, is the screenshot of Kafka consumer
client,listeningtothetopic‘customer-pipe’.
In order to run custom Kafka Connect Source Connector and Source Task,
project should be built and its ‘.jar’ file should be put into to Kafka’s libs folderwith the
dependencyfoldersincluded.Moreover,inordertoruntheConnector,Configurationfileis
placedintotheKafka’s‘config’folderandspecifiedwhilerunningtheKafkaConnectthrough
terminal.
Configurationfileanditscontentareasfollows:
• name=etl-source-branch---Nameoftheconfigurationclass
• connector.class=kafkaconnectetl.SourceJDBCConnectorCustomer---Nameof
theSourceConnectorClass
• tasks.max=1---Maximumnumberoftaskthreadsthatshouldberunin
parallel
• driver=org.postgresql.Driver---Nameofthedriver
• db=jdbc:postgresql://localhost/ING_OLTP---DatabaseAdress
• dbuser=postgres---DatabaseUserName
• dbpassword=postgrespassword---DatabasePassword
• topic=customer-pipe---Whichtopictopublish
Among those configurations, name, connector.class, tasks.max and topic are
necessary. Other configurations are for ease of readability of implementation and used
internallybySourceConnectorandSourceTaskclasses,notbyKafka.
Bydoingsuch,wehavejustputCustomertablechangesintoastreaminKafkatopic.
ThesemessagescanbeeasilyusedbyanycustomKafkaConnectclasses that implements
Sink Connector and Sink Task.Moreover, those can be put into different targets such as
HDFS, OLAP system, Key-Value storage, etc.Moreover, since the Schema that is used by
SourceTaskclasscanbemodifiedtoholdsomeextravalues, it ispossibletoprecalculate
someETLproceduresbeforeputtingthemintoKafkaoranyscenariocanbe implemented
internally. Source code for ‘SourceJDBCConnectorCustomer.java’ and
‘SourceJDBCTaskCustomer.java’areincludedwiththereport.
CONCLUSION
Stream processing is required when data has to be processed fast and / or
continuously,i.e.reactionshavetobecomputedandinitiatedinrealtime.Thisrequirement
iscomingmoreandmoreintoeveryvertical.Manydifferentframeworksandproductsare
availableonthemarketalready,howeverthenumberofmaturesolutionswithgoodtools
andcommercial support is small today.ApacheStorm isagood,opensource framework;
however custom coding is required due to a lack of development tools and there’s no
commercialsupportrightnow.
Products such as IBM InfoSphere Streams or TIBCO StreamBase offer complete
products,whichclosethisgap.Youdefinitelyhavetotryoutthedifferentproducts,asthe
websitesdonot showyouhow theydiffer regardingeaseofuse, rapiddevelopmentand
debugging, and real-time streaming analytics and monitoring. Stream processing
complementsothertechnologiessuchasaDWHandHadoopinabigdataarchitecture-this
isnotan"either/or"question.
Stream processing has a great future and will become very important for most
companies. Big Data and Internet of Things are huge drivers of change.