scalable tools - part i introduction to scalable...
TRANSCRIPT
Scalable Tools - Part I
Introduction to Scalable Tools
Adisak Sukul, Ph.D.,
Lecturer,
Department of Computer Science,
http://web.cs.iastate.edu/~adisak/MBDS2018/
Scalable Tools session
• Before we begin:
• Do you have a VirtualBox and Ubuntu vm
created?
• You can copy it from a usb disk
• Options 2: Run on cloud (if you can't run it
locally):
• Setup Google cloud or Amazon EC2 with
Python and Spark
• We will be using Spark, Python and PySpark.2
What is Big Data?
3
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
Why scale?
• In early 2000s, every company have to paying
more and more to DBMS company.
4
Scalable tools for Big Data
• MapReduce is a programming model and an
associated implementation for processing and
generating big data sets with
a parallel, distributed algorithm on a cluster.
5
What is the problems with Big Data
in Traditional System
6
Traditional scenario
• Manageable workload
7
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
When data increased, traditional
systems would fail• Data come in to fast (high velocity)
• Data come in unstructured (high verity)
8
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
How to solve this problem?• Issue 1: Too many order per hours?
• Answer??
9
Hire more Cook! (distributed workers)
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
• Same thing happened with the servers and
stroage
10
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
11
• Issue 2: Food shelf becomes Bottleneck
• Now, how to solve it???Distributed and Parallel Approach
Data locality concept in Hadoop: data is locally available for each processing unit
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
• Sounds good?
• How do we solve Big Data problems (storing
and processing Big Data) by using Distributed
and Parallel Approach like that?
12
• Yes, we can use Hadoop!
• Hadoop is a framework that allow us to store
and process large data sets in parallel and
distributed fashion
• Hadoop is a framework that allow us to store and
process large data sets in parallel and distributed
fashion
13
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
Who came up with MapReduce
concept?
14
15
Hadoop Master/Slave Architecture
16
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
Hadoop Master/Slave Architecture
cont.1
17
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
Hadoop Master/Slave Architecture
cont.2
got backup worker for all projects
18
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
How it translate to actual architecture
19
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
Let’s play a game• Spit to four group,
• Assign 1 manager, 1 assistant
• Assistant collect result, time the process
• Group A: everybody read the whole paper (5 pages), manager combine (average) the
result
• Group B: each person read one page, manager combine the result
• Group A: everybody read the whole paper (5 pages), manager combine (average) the
result
• Missing Page 2 result
• Group B: each person read one page, manager combine the result
• Missing Page 2 result
• Task for team member:
• Read the paper
• Count the word (not case-sensitive):
• Year
• Dream
• Will
• Describe
• Soul
20
21
22
23
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
HDFS Data Block
24
Fault tolerance
25
Fault tolerance: Replication Factor
26
Example: MapReduce for word count
process
27
Reference: Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Hadoop Training | Edureka
https://www.youtube.com/watch?v=mafw2-CVYnA&t=1995s
28
29
Apache Spark
30
Apache Spark• is a lightning fast real-time processing framework.
• It does in-memory computations to analyze data in real-time.
• It came into picture as Apache Hadoop MapReduce was
performing batch processing only and lacked a real-time
processing feature.
• Hence, Apache Spark was introduced as it can perform stream
processing in real-time and can also take care of batch
processing.
31
Apache Spark
• It leverages Apache Hadoop for both storage
and processing.
• It uses HDFS (Hadoop Distributed File system)
for storage.
32
33
Spark is fast!
34
But it could cast more, depend on the memory cost
35
pyspark
• PySpark, you can work with RDDs in Python
programming language also. It is because of a
library called Py4j that they are able to achieve
this.
• PySpark offers PySpark Shell which links the
Python API to the spark core and initializes the
Spark context. Majority of data scientists and
analytics experts today use Python because of
its rich library set. Integrating Python with Spark
is a boon to them.
36
Spark benchmark (PySpark and
Pandas)• https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
• Benchmarking Apache Spark on a Single Node Machine The benchmark
involves running the SQL queries over the table “store_sales” (scale 10 to 260) in Parquet
file format.
37
• What we learn from this?
def NewDataProject():
if dataset is large:
use Spark or Hadoop
else:
use Python Pandas
38