bdx 2016 - tal sliwowicz @ taboola
TRANSCRIPT
Taboola’sRoadtoScaleTheDataPerspec4veTalSliwowicz
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
TalSliwowiczDirector,R&[email protected]
WhoamI?
You’ve Seen Us Before!
Enabling people to discover information at that moment when they’re likely to engage
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
Entertainment | Lifestyle
Tech
Our Clients are All Around the Globe
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
750M monthly unique
users
100K+ Requests/sec
10B+ recommendation
s/day
5TB+ Daily data
REACH PROPERTY
95.5% Google Ad Network
87.8% Taboola 86.2% Google Sites 61.5% Facebook 60.3% Yahoo Sites 56.6% Outbrain
52% mobile traffic
48% desktop
traffic
US desktop users reached, 12/2015
TaboolainNumbers
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
Context
Metadata Region-based
Location
Recommendations
User Behavior
Cookie Data
Collaborative Filtering
Bucketed Consumption Groups
CONTENT RECOMMENDATION ENGINE
Social
Facebook / Twitter API
TheRecommenda4onEngine
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
Taboola’s Discovery Platform
Traffic Acquisition
Business Dev.!
Sponsored ContentEditorial!
Newsroom
Sales!Native Ads
Audience Dev. Product!
Personalization
Data & Insights!
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Eventsandlogs(rawdata)wriPendirectlytoDB
• RecsArereadfromDB
• CrashedwhenCNNlaunched
Taboola2007
Frontend
FEServer
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Sameasbefore,butwithoutdirectwritetoDB
• Switchingtobulkload• But–VeryBasicRepor4ng,notscalable
Taboola2007.5
Frontend
BulkLoad
FEServer
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Introducedasemireal4meeventsparsingservices:SessionParserandSessionAnalyzer
• Dividedanalysisworkbyunit(session)
• FileswerepushedfromRecServer(s)toBackendprocessing
• FilesaregziptextualINSERTstatements
• But–notreal4meenough
Taboola2008
Frontend
NFS
Backend
FEServer SessionParser SessionAnalyzer
WriteSummarizedData
WriterawdataReadsessionfiles
Readrawdata
Writesessionfiles
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Madealeaptowardsreal-4mestreamprocessing
• UnifiedSessionParserandSessionAnalyzertoanin-memoryservice(withoutgoingthroughdisk)
• Madedrama4cop4miza4ontomemoryalloca4onanddatamodels
• Failuresafearchitecture-canenduredatadelays,front-endservers’malfunc4on
• NodirectDBaccess-keyforperformance,onlyusingbulkloadingforloadinghourlydata
Taboola2010
Frontend
NFS
Backend
FEServer SessionParser+Analyzer
WriteHourlyData(BulkLoading)
Writerawdata
Readrawdata
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Mul4DC
• Roughlysamearchitecture
• Increasingbackendgrowthbyscalingin(monstermachines)
• Introducedreal-4meanalyzers
• Introducedsharding
• Movedtolsyncbasedfilesync
• IntroducedTopReportscapabili4es
Taboola2011-2013
Frontend
Lsync
Backend
FEServer SessionParser+Analyzer
WriteHourlyData(BulkLoading)
Writerawdata
Readrawdata
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
Taboola2014-
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Lotsofincomingtraffic(100Krequests/sec)• Data(5+TB/day):
• Personalizedservedrecommenda4ons–peruser,perpageview• Events-Whattheuseractuallyreadandwhathedid
• Thedataneedstobejoinedandprocessedinreal4me• CampaignsManagement• Recommenda4ons• Billing• Reports• Etc.
• Thedataneedstobeavailableforofflineresearch
OurDataRequirements
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
DataModelUsers
Sessions
Views
Requests
Items
Events
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Wecareaboutsessions-chainofpageviewsandeventsforaspecificuser
• Lengthcanbehoursorevendays• Wecareaboutusers–chainofsessionsacrosssites
• Lengthcanbedaysorevenmonths• StatelessApplica4on–singleuserdataissentfrommul4pledatacentersandmul4pleservers
• Nodeterminis4caffinitytoaserverorDC• Orderisn’tguaranteed• Mustberobustandautoma4callydealwithlatearrivals• “Exactlyonce”seman4cs
Challenges
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Manystreamsofdatathatneedtobejoined(user,session,pageview,widgets,recommenda4ons,events,ac4ons)
• 5+TBofdailydata• Researchpurposesrequirelookingatfulluserac4vityacross4me
ChallengesCont.
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
DataFlow
FEServers
Kana
FEConsumer(Spark)
C*Sessions
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Par44onkey-sessionstarthour+userbucket(0-9,999)• Clusteringkey-publisher_id,user_id,session_id,view_id,data_type,data_hash
• DataType-MULTI_REQUEST,USER_EVENT,ACTION_CONVERSION,…• Data-blobsofprotobuff
• Results:• Allthedataofasinglesessionisinoneplace,regardlessof4meofarrival• Idempotentprocess-ifsamemessageisreceivedtwiceitoverrunsthe
previousarrivalsduetosamehashid• Samplingisbuilt-intothemodel
TableModelinC*
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
TrafficProcessor(Spark)
Manualrunner
NextGen.Reports
NextGen.Counters(Spark)
Zeppelin BIgQuery
DataFlowCont.
C*Sessions
Hadoop Ver4ca
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Rawdata–real4mefullaccesstotherawdata,notjustaggregateddata
• Weekofdata(~35TB)-2hourstoanalyzeandreport• 10physicalnodes,320Cores,2.5TBmemory,SSDs
• Analyzing1%sampleoftheusersreducesthislinearly(par44onkey)
• Analyzingasinglepublisherwhichis1%ofthedatareducesthisalmostlinearly(clusteringkey)
• Repor4ng–minutesforavailabilityoffullrepor4ngvs.hours
• Suppor4ngourgrowth–Sparkasadistributedcompu4ngengineisverystrong,easytoscaleandextend
Beforevs.Ayer
Copyrig
ht©2016
The
Nielse
nCo
mpany.Con
fiden
4aland
Proprietary.
• Longtermdataaccess–Hadoop,CassandraandBigQueryprovideasolu4onwedidnothavebefore
• Analy4csengine–themovefromMySQLtoVer4ca(asanMPPengine)allowsustosupportcomplexqueriesoververylargedatasets
• AlgorithmicResearchandModeling–wearenowcapableofindepthanalysisonmul4pledimensionsacrosslong4meperiods
Beforevs.Ayer-Cont.
Thank You!