open source stak of big data techs open suse asia

The Open Source Stack of Big Data Technology

Muhammad Rifqi Ma'arif [email protected] | [email protected]

openSUSE Asia Summit 2016

mailto:[email protected]

2

Presentation Online

• Big Data – Formal Introduction• The Technological Stack• Implementing Big Data Tech.• Beyond Hadoop

Big Data -Formal Introduction

4

The World is Changing

• Old World → Few companies are generating data and the rest of the world are consuming the data.

• Current and Future World → All of us generating data and all of us consuming the data.

6

The Four Elements

The Technological Stack

10

Hadoop Ecosystem – The Anchestor

11

Hadoop in a Nutshell

• Hadoop was created by Doug Coutting and Mike Carafella in 2005

• A data processing framework

• Large scaled data

• Distributed manner

• Horizontal scaling on commodity hardware

12

HDFS (Hadoop Distributed File Systems)

• Scalable distributed filesystem• Distribute data on local disks on

several nodes handled by low cost commodity hardware.

• HDFS design goals:‒ Data Replication – helps

handle hardware failures‒ Move computation close to

data• Singe name node as a master

(the boss) and multiple data nodes who listen to the master and manage their own logical storage

13

Moving computation to data...

• Old fashion: ‒ Separated data → integration → computation → information

• HDFS fashion: ‒ Separated data → computation → integration → information

14

MapReduce – The Programming Framework

• Originated at Google, and they said it's a simple programming model to process large scale data in parallel and distributed way

https://www.tutorialspoint.com/map_reduce/

15

MapReduce – The Hello Word

https://www.tutorialspoint.com/map_reduce/

16

MapReduce – The Hello Word Scripts

mapper

reducer

17

Wee need to make the elephant faster

18

Inside The ZooCoordination of config, data naming and synchronization of Hadoop projects

Monitoring and management of Hadoop clusters and nodes.

Tool for data ingestion from various data sources into Hadoop

A workflow scheduller tool to manage MapReduce Jobs

A tool for managing data transfer between Hadoop and Relational Database Management System (RDBMS)

Data Mining/Machine Learning library that works directly with Hadoop Data. You can also use R with its lib RHadoop

Scripting language for a analyzing large dataset. Compiled to MapReduce Jobs

Facilitates easy ad-hoc queries and summarization and to data which stored in HDFS with the SQL-like interface named HiveQL

A non-relational and distributed database system that run on top HDFS file system.

Implementing Big Data

20

The Lambda Architecture

http://jameskinley.tumblr.com

21

Lambda Architecture Workflow

• All data entering the system is dispatched to both the batch layer and the speed layer for processing.

• The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.

• The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.

• The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

• Any incoming query can be answered by merging results from batch views and real-time views.

22

Plotting Arsenal to Lambda Architecture

Batch Layer

Speed Layer

Serving Layer Query

23

Simpler Implementation - Datawarehouse

25

Datawarehouse using Hadoop Framework

Schema

26

Sqoop (SQL to Hadoop)

• Importing MySQL table values to HDFS can be done straightforwardly with Sqoop

27

Apache Pig

• Pig provides an engine for executing data flows in parallel on Hadoop and makes use of HDFS and MapReuce

• Pig philosophy → Pigs eat anything Input data can come in any format – popular formats.

• Pig includes a language called Pig Latin for expressing data flows

‒ Pig Latin includes operators for many of the traditional data operations (not to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD and STORE.

‒ Express data transformation tasks in just a few lines of code

‒ 10 lines of Pig Latin = ~200 lines of Java

‒ Simplifying the process of writing MapReduce Program

• The most important is You can create UDF in Pig!

28

How Pig Works in A Nutshell

29

Pig Latin Example – The Wordcount

30

Apache Hive

• Hive is an open source, peta-byte scale date warehousing framework based on Hadoop that was developed by the Data Infrastructure Team at Facebook

• MapReduce is powerful but writing M/R program just like ask application developers to specify physical execution plan in the database on their code!

• Combine the simplicity of SQL and the power of MapReduce

‒ Efficient implementations of SQL statements on top of map reduce via Hive Query Language (HiveQL)

31

Hive Architecture

http://www.hadooptpoint.com/hadoop-hive-architecture/

32

Datawarehouse using Hadoop Framework

Schema

Beyond Hadoop

34

Lambda Architecture

Batch Layer

Speed Layer

Serving Layer Query/Viz

35

Another Data Processing Framework

SMACK Stack

Spark Mesos Akka Cassandra Kafka

http://www.natalinobusa.com

36

Pick up your own weapon

37

Refferences

• Thomas Bernardz, 2015, Big Data Wokshop – Quenssland University of Technology

• Guid Schmutz, 2014, Big Data and Fast Data, http://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

• James Kinely, 2015, The Lambda architecture: principles for architecting realtime Big Data systems, http://jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for

• https://www.tutorialspoint.com• http://hadoopoints.com• http://hortonworks.com

http://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

http://www.slideshare.net/gschmutz/big-data-and-fast-data-lambda-architecture-in-action

https://www.tutorialspoint.com/

Thank you.

Join the conversation,contribute & have a lot of fun!www.opensuse.org

General DisclaimerThis document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further, openSUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners.

LicenseThis slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license.

Details can be found at https://creativecommons.org/licenses/by-sa/4.0/

Credits

TemplateRichard Brown [email protected]

Design & InspirationopenSUSE Design Teamhttp://opensuse.github.io/branding-guidelines/

mailto:[email protected]

open source stak of big data techs open suse asia

Data & Analytics