![Page 1: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/1.jpg)
A Real-Time Data Ingestion System
Or: How I Learned to Stop Worrying and Love the Bomb Avro
Maciej Arciuch
![Page 2: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/2.jpg)
Allegro.pl
● biggest online auction website in Poland
● sites in other countries● “Polish eBay” (but better!)
![Page 3: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/3.jpg)
Clickstream at Allegro.pl
● how do our users behave?● ~ 400 M of raw clickstream events
daily● collected at the front-end● web and mobile devices● valuable source of information
![Page 4: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/4.jpg)
Legacy system
● HDFS, Flume and MapReduce● Main issues:
○batch processing - per hour or day○data formats○how to make data more accessible
for others?
![Page 5: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/5.jpg)
How to do it better? (1)
stream processing: Spark Streaming and Kafka - data available “almost” instantly
new applications:securityrecommendations & search
![Page 6: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/6.jpg)
How to do it better? (2)
Use Avro● mature software, good support in Hadoop
ecosystem● space-efficient● schema: structure + doc placeholder● the same format for stream and batch
processing● backward/forward compatibility control
![Page 7: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/7.jpg)
How to do it better? (3)
Create a central schema repository:● single source of truth● all the elements of system refer to the latest
version● validate backward/forward compatibility on
commit● immutable schemas● propagate info to Hive metastore, files, HTMLs
![Page 8: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/8.jpg)
How to do it better? (4)
![Page 9: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/9.jpg)
How to do it better? (5)
New system:● two separate Kafka instances (buffer and
destination)● if your infrastructure is down – you still collect
data● collectors – only save HTTP requests, no logic● logic in Spark Streaming● dead letter queue – you can reprocess failed
messages
![Page 10: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/10.jpg)
How to do it better? (6)
New system:● data saved to HDFS in hourly batches using
LinkedIn’s Camus (now obsolete, but good tool)● Hive tables and partitions created
automatically (look for camus2hive on Github)
![Page 11: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/11.jpg)
Why Spark Streaming?
● pros:○momentum○good integration with YARN - better
resource utilization, easy scaling○good integration with Kafka○reuse batch Spark code
● cons:○micro-batching○as complex as Spark
![Page 12: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/12.jpg)
Key take-aways
● Kafka, Avro, Spark - solid building blocks
● Use a central schema repository
![Page 13: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/13.jpg)
Q/A?
![Page 14: A real-time data ingestion system or: How I learned to stop worrying and love Avro by Maciej Arciuch at Big Data Spain 2015](https://reader035.vdocuments.net/reader035/viewer/2022081605/58f9b9671a28abe4508b4589/html5/thumbnails/14.jpg)
Thank you!
http://github.com/allegrohttp://allegro.tech