scalable tools - part i introduction to scalable...

38
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, [email protected] http://web.cs.iastate.edu/~adisak/MBDS2018/

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Scalable Tools - Part I

Introduction to Scalable Tools

Adisak Sukul, Ph.D.,

Lecturer,

Department of Computer Science,

[email protected]

http://web.cs.iastate.edu/~adisak/MBDS2018/

Page 2: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Scalable Tools session

• Before we begin:

• Do you have a VirtualBox and Ubuntu vm

created?

• You can copy it from a usb disk

• Options 2: Run on cloud (if you can't run it

locally):

• Setup Google cloud or Amazon EC2 with

Python and Spark

• We will be using Spark, Python and PySpark.2

Page 3: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

What is Big Data?

3

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 4: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Why scale?

• In early 2000s, every company have to paying

more and more to DBMS company.

4

Page 5: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Scalable tools for Big Data

• MapReduce is a programming model and an

associated implementation for processing and

generating big data sets with

a parallel, distributed algorithm on a cluster.

5

Page 6: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

What is the problems with Big Data

in Traditional System

6

Page 7: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Traditional scenario

• Manageable workload

7

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 8: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

When data increased, traditional

systems would fail• Data come in to fast (high velocity)

• Data come in unstructured (high verity)

8

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 9: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

How to solve this problem?• Issue 1: Too many order per hours?

• Answer??

9

Hire more Cook! (distributed workers)

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 10: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

• Same thing happened with the servers and

stroage

10

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 11: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

11

• Issue 2: Food shelf becomes Bottleneck

• Now, how to solve it???Distributed and Parallel Approach

Data locality concept in Hadoop: data is locally available for each processing unit

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 12: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

• Sounds good?

• How do we solve Big Data problems (storing

and processing Big Data) by using Distributed

and Parallel Approach like that?

12

• Yes, we can use Hadoop!

• Hadoop is a framework that allow us to store

and process large data sets in parallel and

distributed fashion

Page 13: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

• Hadoop is a framework that allow us to store and

process large data sets in parallel and distributed

fashion

13

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 14: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Who came up with MapReduce

concept?

14

Page 15: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

15

Page 16: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Hadoop Master/Slave Architecture

16

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 17: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Hadoop Master/Slave Architecture

cont.1

17

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 18: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Hadoop Master/Slave Architecture

cont.2

got backup worker for all projects

18

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 19: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

How it translate to actual architecture

19

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 20: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Let’s play a game• Spit to four group,

• Assign 1 manager, 1 assistant

• Assistant collect result, time the process

• Group A: everybody read the whole paper (5 pages), manager combine (average) the

result

• Group B: each person read one page, manager combine the result

• Group A: everybody read the whole paper (5 pages), manager combine (average) the

result

• Missing Page 2 result

• Group B: each person read one page, manager combine the result

• Missing Page 2 result

• Task for team member:

• Read the paper

• Count the word (not case-sensitive):

• Year

• Dream

• Will

• Describe

• Soul

20

Page 21: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

21

Page 22: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

22

Page 23: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

23

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 24: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

HDFS Data Block

24

Page 25: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Fault tolerance

25

Page 26: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Fault tolerance: Replication Factor

26

Page 27: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Example: MapReduce for word count

process

27

Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka

https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s

Page 28: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

28

Page 29: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

29

Page 30: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Apache Spark

30

Page 31: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Apache Spark• is a lightning fast real-time processing framework.

• It does in-memory computations to analyze data in real-time.

• It came into picture as Apache Hadoop MapReduce was

performing batch processing only and lacked a real-time

processing feature.

• Hence, Apache Spark was introduced as it can perform stream

processing in real-time and can also take care of batch

processing.

31

Page 32: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Apache Spark

• It leverages Apache Hadoop for both storage

and processing.

• It uses HDFS (Hadoop Distributed File system)

for storage.

32

Page 33: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

33

Page 34: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Spark is fast!

34

Page 35: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

But it could cast more, depend on the memory cost

35

Page 36: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

pyspark

• PySpark, you can work with RDDs in Python

programming language also. It is because of a

library called Py4j that they are able to achieve

this.

• PySpark offers PySpark Shell which links the

Python API to the spark core and initializes the

Spark context. Majority of data scientists and

analytics experts today use Python because of

its rich library set. Integrating Python with Spark

is a boon to them.

36

Page 37: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

Spark benchmark (PySpark and

Pandas)• https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

• Benchmarking Apache Spark on a Single Node Machine The benchmark

involves running the SQL queries over the table “store_sales” (scale 10 to 260) in Parquet

file format.

37

Page 38: Scalable Tools - Part I Introduction to Scalable Toolsweb.cs.iastate.edu/~adisak/MBDS2018/slides/1-intro... · Apache Spark • is a lightning fast real-time processing framework

• What we learn from this?

def NewDataProject():

if dataset is large:

use Spark or Hadoop

else:

use Python Pandas

38