a real-time data ingestion system or: how i learned to stop worrying and love avro by maciej arciuch...
TRANSCRIPT
A Real-Time Data Ingestion System
Or: How I Learned to Stop Worrying and Love the Bomb Avro
Maciej Arciuch
Allegro.pl
● biggest online auction website in Poland
● sites in other countries● “Polish eBay” (but better!)
Clickstream at Allegro.pl
● how do our users behave?● ~ 400 M of raw clickstream events
daily● collected at the front-end● web and mobile devices● valuable source of information
Legacy system
● HDFS, Flume and MapReduce● Main issues:
○batch processing - per hour or day○data formats○how to make data more accessible
for others?
How to do it better? (1)
stream processing: Spark Streaming and Kafka - data available “almost” instantly
new applications:securityrecommendations & search
How to do it better? (2)
Use Avro● mature software, good support in Hadoop
ecosystem● space-efficient● schema: structure + doc placeholder● the same format for stream and batch
processing● backward/forward compatibility control
How to do it better? (3)
Create a central schema repository:● single source of truth● all the elements of system refer to the latest
version● validate backward/forward compatibility on
commit● immutable schemas● propagate info to Hive metastore, files, HTMLs
How to do it better? (4)
How to do it better? (5)
New system:● two separate Kafka instances (buffer and
destination)● if your infrastructure is down – you still collect
data● collectors – only save HTTP requests, no logic● logic in Spark Streaming● dead letter queue – you can reprocess failed
messages
How to do it better? (6)
New system:● data saved to HDFS in hourly batches using
LinkedIn’s Camus (now obsolete, but good tool)● Hive tables and partitions created
automatically (look for camus2hive on Github)
Why Spark Streaming?
● pros:○momentum○good integration with YARN - better
resource utilization, easy scaling○good integration with Kafka○reuse batch Spark code
● cons:○micro-batching○as complex as Spark
Key take-aways
● Kafka, Avro, Spark - solid building blocks
● Use a central schema repository
Q/A?
Thank you!
http://github.com/allegrohttp://allegro.tech