open source stak of big data techs open suse asia

39
The Open Source Stack of Big Data Technology Muhammad Rifqi Ma'arif [email protected] | [email protected] openSUSE Asia Summit 2016

Upload: muhammad-rifqi

Post on 14-Feb-2017

207 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Open source stak of big data techs   open suse asia

The Open Source Stack of Big Data Technology

Muhammad Rifqi Ma'arif [email protected] | [email protected]

openSUSE Asia Summit 2016

Page 2: Open source stak of big data techs   open suse asia

2

Presentation Online

• Big Data – Formal Introduction• The Technological Stack• Implementing Big Data Tech.• Beyond Hadoop

Page 3: Open source stak of big data techs   open suse asia

Big Data -Formal Introduction

Page 4: Open source stak of big data techs   open suse asia

4

The World is Changing

• Old World → Few companies are generating data and the rest of the world are consuming the data.

• Current and Future World → All of us generating data and all of us consuming the data.

Page 5: Open source stak of big data techs   open suse asia

5

Page 6: Open source stak of big data techs   open suse asia

6

The Four Elements

Page 7: Open source stak of big data techs   open suse asia

7

Page 8: Open source stak of big data techs   open suse asia

The Technological Stack

Page 9: Open source stak of big data techs   open suse asia

9

Page 10: Open source stak of big data techs   open suse asia

10

Hadoop Ecosystem – The Anchestor

Page 11: Open source stak of big data techs   open suse asia

11

Hadoop in a Nutshell

• Hadoop was created by Doug Coutting and Mike Carafella in 2005

• A data processing framework

• Large scaled data

• Distributed manner

• Horizontal scaling on commodity hardware

Page 12: Open source stak of big data techs   open suse asia

12

HDFS (Hadoop Distributed File Systems)

• Scalable distributed filesystem• Distribute data on local disks on

several nodes handled by low cost commodity hardware.

• HDFS design goals:‒ Data Replication – helps

handle hardware failures‒ Move computation close to

data• Singe name node as a master

(the boss) and multiple data nodes who listen to the master and manage their own logical storage

Page 13: Open source stak of big data techs   open suse asia

13

Moving computation to data...

• Old fashion: ‒ Separated data → integration → computation → information

• HDFS fashion: ‒ Separated data → computation → integration → information

Page 14: Open source stak of big data techs   open suse asia

14

MapReduce – The Programming Framework

• Originated at Google, and they said it's a simple programming model to process large scale data in parallel and distributed way

https://www.tutorialspoint.com/map_reduce/

Page 15: Open source stak of big data techs   open suse asia

15

MapReduce – The Hello Word

https://www.tutorialspoint.com/map_reduce/

Page 16: Open source stak of big data techs   open suse asia

16

MapReduce – The Hello Word Scripts

mapper

reducer

Page 17: Open source stak of big data techs   open suse asia

17

Wee need to make the elephant faster

Page 18: Open source stak of big data techs   open suse asia

18

Inside The ZooCoordination of config, data naming and synchronization of Hadoop projects

Monitoring and management of Hadoop clusters and nodes.

Tool for data ingestion from various data sources into Hadoop

A workflow scheduller tool to manage MapReduce Jobs

A tool for managing data transfer between Hadoop and Relational Database Management System (RDBMS)

Data Mining/Machine Learning library that works directly with Hadoop Data. You can also use R with its lib RHadoop

Scripting language for a analyzing large dataset. Compiled to MapReduce Jobs

Facilitates easy ad-hoc queries and summarization and to data which stored in HDFS with the SQL-like interface named HiveQL

A non-relational and distributed database system that run on top HDFS file system.

Page 19: Open source stak of big data techs   open suse asia

Implementing Big Data

Page 20: Open source stak of big data techs   open suse asia

20

The Lambda Architecture

http://jameskinley.tumblr.com

Page 21: Open source stak of big data techs   open suse asia

21

Lambda Architecture Workflow

• All data entering the system is dispatched to both the batch layer and the speed layer for processing.

• The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.

• The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.

• The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

• Any incoming query can be answered by merging results from batch views and real-time views.

Page 22: Open source stak of big data techs   open suse asia

22

Plotting Arsenal to Lambda Architecture

Batch Layer

Speed Layer

Serving Layer Query

Page 23: Open source stak of big data techs   open suse asia

23

Simpler Implementation - Datawarehouse

Page 24: Open source stak of big data techs   open suse asia

24

Page 25: Open source stak of big data techs   open suse asia

25

Datawarehouse using Hadoop Framework

Schema

Page 26: Open source stak of big data techs   open suse asia

26

Sqoop (SQL to Hadoop)

• Importing MySQL table values to HDFS can be done straightforwardly with Sqoop

Page 27: Open source stak of big data techs   open suse asia

27

Apache Pig

• Pig provides an engine for executing data flows in parallel on Hadoop and makes use of HDFS and MapReuce

• Pig philosophy → Pigs eat anything Input data can come in any format – popular formats.

• Pig includes a language called Pig Latin for expressing data flows

‒ Pig Latin includes operators for many of the traditional data operations (not to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD and STORE.

‒ Express data transformation tasks in just a few lines of code

‒ 10 lines of Pig Latin = ~200 lines of Java

‒ Simplifying the process of writing MapReduce Program

• The most important is You can create UDF in Pig!

Page 28: Open source stak of big data techs   open suse asia

28

How Pig Works in A Nutshell

Page 29: Open source stak of big data techs   open suse asia

29

Pig Latin Example – The Wordcount

Page 30: Open source stak of big data techs   open suse asia

30

Apache Hive

• Hive is an open source, peta-byte scale date warehousing framework based on Hadoop that was developed by the Data Infrastructure Team at Facebook

• MapReduce is powerful but writing M/R program just like ask application developers to specify physical execution plan in the database on their code!

• Combine the simplicity of SQL and the power of MapReduce

‒ Efficient implementations of SQL statements on top of map reduce via Hive Query Language (HiveQL)

Page 31: Open source stak of big data techs   open suse asia

31

Hive Architecture

http://www.hadooptpoint.com/hadoop-hive-architecture/

Page 32: Open source stak of big data techs   open suse asia

32

Datawarehouse using Hadoop Framework

Schema

Page 33: Open source stak of big data techs   open suse asia

Beyond Hadoop

Page 34: Open source stak of big data techs   open suse asia

34

Lambda Architecture

Batch Layer

Speed Layer

Serving Layer Query/Viz

Page 35: Open source stak of big data techs   open suse asia

35

Another Data Processing Framework

SMACK Stack

Spark Mesos Akka Cassandra Kafka

http://www.natalinobusa.com

Page 36: Open source stak of big data techs   open suse asia

36

Pick up your own weapon

Page 37: Open source stak of big data techs   open suse asia

37

Refferences

• Thomas Bernardz, 2015, Big Data Wokshop – Quenssland University of Technology

• Guid Schmutz, 2014, Big Data and Fast Data, http://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

• James Kinely, 2015, The Lambda architecture: principles for architecting realtime Big Data systems, http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for

• https://www.tutorialspoint.com• http://hadoopoints.com• http://hortonworks.com

Page 38: Open source stak of big data techs   open suse asia

Thank you.

Join the conversation,contribute & have a lot of fun!www.opensuse.org

Page 39: Open source stak of big data techs   open suse asia

General DisclaimerThis document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further, openSUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners.

LicenseThis slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license.

Details can be found at https://creativecommons.org/licenses/by-sa/4.0/

Credits

TemplateRichard Brown [email protected]

Design & InspirationopenSUSE Design Teamhttp://opensuse.github.io/branding-guidelines/