a real-time data ingestion system or: how i learned to stop worrying and love avro by maciej arciuch...

A Real-Time Data Ingestion System

Or: How I Learned to Stop Worrying and Love the Bomb Avro

Maciej Arciuch

Allegro.pl

● biggest online auction website in Poland

● sites in other countries● “Polish eBay” (but better!)

Clickstream at Allegro.pl

● how do our users behave?● ~ 400 M of raw clickstream events

daily● collected at the front-end● web and mobile devices● valuable source of information

Legacy system

● HDFS, Flume and MapReduce● Main issues:

○batch processing - per hour or day○data formats○how to make data more accessible

for others?

How to do it better? (1)

stream processing: Spark Streaming and Kafka - data available “almost” instantly

new applications:securityrecommendations & search


Use Avro● mature software, good support in Hadoop

ecosystem● space-efficient● schema: structure + doc placeholder● the same format for stream and batch

processing● backward/forward compatibility control


Create a central schema repository:● single source of truth● all the elements of system refer to the latest

version● validate backward/forward compatibility on

commit● immutable schemas● propagate info to Hive metastore, files, HTMLs


New system:● two separate Kafka instances (buffer and

destination)● if your infrastructure is down – you still collect

data● collectors – only save HTTP requests, no logic● logic in Spark Streaming● dead letter queue – you can reprocess failed

messages


New system:● data saved to HDFS in hourly batches using

LinkedIn’s Camus (now obsolete, but good tool)● Hive tables and partitions created

automatically (look for camus2hive on Github)

Why Spark Streaming?

● pros:○momentum○good integration with YARN - better

resource utilization, easy scaling○good integration with Kafka○reuse batch Spark code

● cons:○micro-batching○as complex as Spark

Key take-aways

● Kafka, Avro, Spark - solid building blocks

● Use a central schema repository

Thank you!

http://github.com/allegrohttp://allegro.tech

a real-time data ingestion system or: how i learned to stop worrying and love avro by maciej arciuch...

Technology