applications on spark prof. harold liu beijing institute of technology december 2015

Applications on Spark

Prof. Harold LiuBeijing Institute of Technology

December 2015

Who Are Using Spark These Days?

https://spark-summit.org/2

From the figure above, over 1,000 companies have taken Spark platform into productions, including famous traditional manufacturers like TOYOTA and O2O company like Uber and airbnb.

It indicates that the Spark user field has been expanded, not only in the Internet based industry, but also to traditional industries.

Lots of big data framework distributors, including the former Hadoop distributors like Hortonworks and Cloudera, are beginning to take Spark into deployment, which will have a bigger impact in its spread.

3

Open Source Spark Community

The figure shows that the number of contributors has increased rapidly from 2010 to 2014. Among these contributors, lots of Chinese organizations and developers show their enthusiasm on Spark. Now, the biggest Spark cluster of over 8,000 nodes is in Tencent and the highest amount of processed data per job is 1PB, recorded by Alibaba and Databricks.

4

Architecture of Spark

5

Entertainment: Tecent Company Background: The biggest social service provider in China.

Data Background: By the end of 2015, the active QQ users per month have exceeded 8,000 million. The active Wechat user per month have exceeded 6,000 million. They will bring over 200TB data every day.

Business Requirement: Over 90% data need to be processed online.

6

Tencent Distributed Data Warehouse

7

TDW collects all product level data and provides data storage and analysis services.

TDW supports PB-level data storage and computing. It has two parts: one is off-line M/R and the other is online computing by Storm.

Hadoop V.S Spark on M/R Running Mode Compute Resource Running

Time （ min）Cost （ Slot*s）

MapReduce 200 Map+100 Reduce 120 693872

Spark 200 Executor 33 396000

Spark 400 Executor 21 504000

8

Spark works much faster than Hadoop. The running time is only a quarter of that of Hadoop. Compute efficiency can be faster when adding more executors.

Overall, when facing data mining problems, traditional Hadoop M/R framework has serious performance problem, while the Spark can deal with the problem based on its iterative and in-memory computing.

E-commerce ： Taobao Company Background

The biggest C2C e-commerce company in China and the Spark pioneer user (since 2012)

Data Background Up to 2014, Taobao has over 5,000 million registered members

and 1,200 million active members. Taobao has over 90 billion turnovers on November 11, 2014. Its various businesses bring TB-lever data every day.

Business Requirement In the past few years, Taobao has been using Yun Ti based on

Hadoop. But Hadoop will encounter lots of problems in iterative computing. So Spark comes to its view.

9

Spark in Taobao

The figure shows the history of using Spark in Taobao. Taobao has been using Spark when Spark is very young (2012).

10

Yarn version:0.23.7

10 nodes cluster

200 nodes Yarn cluster

Spark Development Process in Taobao

Before putting the job into production servers, the job will be tested on test servers.

And the code will be merged to local repository or push to the open source community.

11

Recommender System in TaobaoThe recommender system combines Spark, Spark MLlib and Spark Streaming frameworks.

It can perform both offline and online analysis that covers most parts of business requests in Taobao.

12

Test of K-Means Algorithm

From the memory aspect, increasing worker’s memory will cut the running time. And increase worker numbers will have better performance.

13

Telecom: Telefonica Company Background

Telefonica is a Spanish telecommunication company who provides comprehensive services including mobile phone, internet, data and wired television services.

Data Background Telefonica is the biggest multi-national enterprise in Spain who

provides customer services for over 40 countries. Its various businesses bring huge data.

Business Requirement As the volume of data is increasing rapidly, network security

problem comes to its sight, such as DDoS attack, SQL injection attack, account theft etc.

Using big data analysis technology to prevent Cyber crime has become urgent to the company. 14

Why Spark?

• Spark provides full stack applications (i.e., SQL, Streaming, MLlib, GraphX)

• Easy to use spark to analyze historical data and streaming data.

• Support various applications and data sources in order to deal with complex application scenarios

• Leverage the SQL language to use the power of Spark • The number of components in Spark is much fewer than

that of Hadoop15

Components of Spark and Hadoop

From the figure above, the number of components in Spark is about half of that in Hadoop. Then, using Spark can potentially have much less errors because of less components.

16

Spark Production Architecture in Telefonica

It use distributed message queue system called “Kafka” to collect data from various sources.

Then, data will be consumed by Storm for pre-processing.

Finally, data will be processed by Spark or saved in Cassandra.

17

Data collection: KafkaData pre-processing: StormBatch processing: Cassandra+Spark

Retail: Euclid Company Background

Euclid Analysis is a geo-data analysis company who provides solutions to customers based on offline positional information.

Data Background Euclid mainly relies on WiFi devices to collect data from the

physical world.

Business Requirement Euclid’s main job is to support location based analysis services

for customers. Through collecting customer behavior data, it tries to know

customer’s behavior and shopping feature, and suggestion future behaviors.

18

Retail Customer Features

• Through the data collected from WiFi devices, customers can be divided into three parts: frequent customers, pass-by customers and quick-leave customers.

• Some of them like to buy products, some spend a lot of time in store and some like to travel around in a zone.

Analysis Procedure with Spark

First, mobile data are collected by WiFi devices through the pinged signals, which include device MAC address, magnitude of signal and other information.

Then, these data will be sent to cloud and processed on Spark cluster.

Finally, customers will know the analysis result on web.

Other Area: PubMatic Company Background

PubMatic is an advertisement company It developed the first real-time advertisement analysis system in the

world marketing field.

Data Background PubMatic has 6 geo-data data centers with 6 PB data to

manage. Every day it will post 12 billion ads and deal with 1,000 billion

bids. Now 22TB data are produced by its system.

Business Requirement Because of its owned complex and various ad data, PubMatic

needs to process the data in real-time.

System Architecture in PubMatic

As we can see from the figure above, various streaming data (flows) are fed into memory which will be process by Spark. Finally, the data will be saved in HDFS and Amazon S3.

Spark v.s. Hive on Query Performance

When the data volume is 192GB, it will cost 550 seconds on Spark while Hive needs 850s to deal with the same problem.

As the data volume is increasing, the running time of Spark is 40% less then Hive on average.

Effect of Using Spark in PubMatic• Spark supports both offline and online data

processing.

• It has active community support and be compatible with Hadoop ecosystem.

• Through the use of Spark Streaming, Spark SQL and Spark Mllib technologies together, PubMatic can provide real-time ads service and business analysis report to customers in a faster speed than ever before.

applications on spark prof. harold liu beijing institute of technology december 2015

Documents