bdx 2016 - tal sliwowicz @ taboola

23
Taboola’s Road to Scale The Data Perspec4ve Tal Sliwowicz

Upload: ido-shilon

Post on 14-Apr-2017

269 views

Category:

Internet


3 download

TRANSCRIPT

Page 1: BDX 2016 - Tal sliwowicz @ taboola

Taboola’sRoadtoScaleTheDataPerspec4veTalSliwowicz

Page 2: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

TalSliwowiczDirector,R&[email protected]

WhoamI?

Page 3: BDX 2016 - Tal sliwowicz @ taboola

You’ve Seen Us Before!

Enabling people to discover information at that moment when they’re likely to engage

Page 4: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

Entertainment | Lifestyle

Tech

Our Clients are All Around the Globe

Page 5: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

750M monthly unique

users

100K+ Requests/sec

10B+ recommendation

s/day

5TB+ Daily data

REACH PROPERTY

95.5% Google Ad Network

87.8% Taboola 86.2% Google Sites 61.5% Facebook 60.3% Yahoo Sites 56.6% Outbrain

52% mobile traffic

48% desktop

traffic

US desktop users reached, 12/2015

TaboolainNumbers

Page 6: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

Context

Metadata Region-based

Location

Recommendations

User Behavior

Cookie Data

Collaborative Filtering

Bucketed Consumption Groups

CONTENT RECOMMENDATION ENGINE

Social

Facebook / Twitter API

TheRecommenda4onEngine

Page 7: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

Taboola’s Discovery Platform

Traffic Acquisition

Business Dev.!

Sponsored ContentEditorial!

Newsroom

Sales!Native Ads

Audience Dev. Product!

Personalization

Data & Insights!

Page 8: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Eventsandlogs(rawdata)wriPendirectlytoDB

•  RecsArereadfromDB

•  CrashedwhenCNNlaunched

Taboola2007

Frontend

FEServer

Page 9: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Sameasbefore,butwithoutdirectwritetoDB

•  Switchingtobulkload•  But–VeryBasicRepor4ng,notscalable

Taboola2007.5

Frontend

BulkLoad

FEServer

Page 10: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Introducedasemireal4meeventsparsingservices:SessionParserandSessionAnalyzer

•  Dividedanalysisworkbyunit(session)

•  FileswerepushedfromRecServer(s)toBackendprocessing

•  FilesaregziptextualINSERTstatements

•  But–notreal4meenough

Taboola2008

Frontend

NFS

Backend

FEServer SessionParser SessionAnalyzer

WriteSummarizedData

WriterawdataReadsessionfiles

Readrawdata

Writesessionfiles

Page 11: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

• Madealeaptowardsreal-4mestreamprocessing

•  UnifiedSessionParserandSessionAnalyzertoanin-memoryservice(withoutgoingthroughdisk)

• Madedrama4cop4miza4ontomemoryalloca4onanddatamodels

•  Failuresafearchitecture-canenduredatadelays,front-endservers’malfunc4on

•  NodirectDBaccess-keyforperformance,onlyusingbulkloadingforloadinghourlydata

Taboola2010

Frontend

NFS

Backend

FEServer SessionParser+Analyzer

WriteHourlyData(BulkLoading)

Writerawdata

Readrawdata

Page 12: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Mul4DC

•  Roughlysamearchitecture

•  Increasingbackendgrowthbyscalingin(monstermachines)

•  Introducedreal-4meanalyzers

•  Introducedsharding

•  Movedtolsyncbasedfilesync

•  IntroducedTopReportscapabili4es

Taboola2011-2013

Frontend

Lsync

Backend

FEServer SessionParser+Analyzer

WriteHourlyData(BulkLoading)

Writerawdata

Readrawdata

Page 13: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

Taboola2014-

Page 14: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Lotsofincomingtraffic(100Krequests/sec)•  Data(5+TB/day):

•  Personalizedservedrecommenda4ons–peruser,perpageview•  Events-Whattheuseractuallyreadandwhathedid

•  Thedataneedstobejoinedandprocessedinreal4me•  CampaignsManagement•  Recommenda4ons•  Billing•  Reports•  Etc.

•  Thedataneedstobeavailableforofflineresearch

OurDataRequirements

Page 15: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

DataModelUsers

Sessions

Views

Requests

Items

Events

Page 16: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Wecareaboutsessions-chainofpageviewsandeventsforaspecificuser

•  Lengthcanbehoursorevendays•  Wecareaboutusers–chainofsessionsacrosssites

•  Lengthcanbedaysorevenmonths•  StatelessApplica4on–singleuserdataissentfrommul4pledatacentersandmul4pleservers

•  Nodeterminis4caffinitytoaserverorDC•  Orderisn’tguaranteed•  Mustberobustandautoma4callydealwithlatearrivals•  “Exactlyonce”seman4cs

Challenges

Page 17: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

• Manystreamsofdatathatneedtobejoined(user,session,pageview,widgets,recommenda4ons,events,ac4ons)

• 5+TBofdailydata• Researchpurposesrequirelookingatfulluserac4vityacross4me

ChallengesCont.

Page 18: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

DataFlow

FEServers

Kana

FEConsumer(Spark)

C*Sessions

Page 19: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Par44onkey-sessionstarthour+userbucket(0-9,999)•  Clusteringkey-publisher_id,user_id,session_id,view_id,data_type,data_hash

•  DataType-MULTI_REQUEST,USER_EVENT,ACTION_CONVERSION,…•  Data-blobsofprotobuff

•  Results:•  Allthedataofasinglesessionisinoneplace,regardlessof4meofarrival•  Idempotentprocess-ifsamemessageisreceivedtwiceitoverrunsthe

previousarrivalsduetosamehashid•  Samplingisbuilt-intothemodel

TableModelinC*

Page 20: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

TrafficProcessor(Spark)

Manualrunner

NextGen.Reports

NextGen.Counters(Spark)

Zeppelin BIgQuery

DataFlowCont.

C*Sessions

Hadoop Ver4ca

Page 21: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Rawdata–real4mefullaccesstotherawdata,notjustaggregateddata

•  Weekofdata(~35TB)-2hourstoanalyzeandreport•  10physicalnodes,320Cores,2.5TBmemory,SSDs

•  Analyzing1%sampleoftheusersreducesthislinearly(par44onkey)

•  Analyzingasinglepublisherwhichis1%ofthedatareducesthisalmostlinearly(clusteringkey)

•  Repor4ng–minutesforavailabilityoffullrepor4ngvs.hours

•  Suppor4ngourgrowth–Sparkasadistributedcompu4ngengineisverystrong,easytoscaleandextend

Beforevs.Ayer

Page 22: BDX 2016 - Tal sliwowicz @ taboola

Copyrig

ht©2016

The

Nielse

nCo

mpany.Con

fiden

4aland

Proprietary.

•  Longtermdataaccess–Hadoop,CassandraandBigQueryprovideasolu4onwedidnothavebefore

• Analy4csengine–themovefromMySQLtoVer4ca(asanMPPengine)allowsustosupportcomplexqueriesoververylargedatasets

• AlgorithmicResearchandModeling–wearenowcapableofindepthanalysisonmul4pledimensionsacrosslong4meperiods

Beforevs.Ayer-Cont.

Page 23: BDX 2016 - Tal sliwowicz @ taboola

Thank You!

[email protected]