big data, hadoop, nosql db - introduction

27
Big Data, Hadoop, NoSQL DB - Introduction Ing. Ľuboš Takáč, PhD. University of Žilina November, 2013

Upload: kvaderlipa

Post on 10-May-2015

1.031 views

Category:

Technology


6 download

DESCRIPTION

The presenatation is about new idea of storing and processing distributed data.

TRANSCRIPT

Page 1: Big data, Hadoop, NoSQL DB - introduction

Big Data, Hadoop, NoSQL DB - Introduction

Ing. Ľuboš Takáč, PhD.

University of Žilina

November, 2013

Page 2: Big data, Hadoop, NoSQL DB - introduction

Overview

• Big Data

• Hadoop

– HDFS

– Map Reduce Paradigm

• NoSQL Databases

Page 3: Big data, Hadoop, NoSQL DB - introduction

Big Data

• the origin of the term “BIG DATA” is unclear

• there are a lot of definitions,

e.g. “Big data is now almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional

data management technologies.” Matt Aslett

Page 4: Big data, Hadoop, NoSQL DB - introduction

Big Data

• Can be defined by (original) 3V

– Volume (a lot of data)

– Variety (various structured)

– Velocity (fast processing)

– other V

• Veracity (IBM)

• Value (Oracle)

• Etc.

Page 5: Big data, Hadoop, NoSQL DB - introduction

Where are Big Data Generated

Page 6: Big data, Hadoop, NoSQL DB - introduction

Sample of Big Data Use Cases Today

Page 7: Big data, Hadoop, NoSQL DB - introduction

Hadoop

• new idea to store and process distributed data

• open source project based on google GFS (Google distributed File System) and Map Reduce Paradigm

– google published papers in 2003-2004 about GFS and Map Reduce

• open source community led by Dough Cutting applied this tools on open search engine Nutch

• 2006 became an own research project named HADOOP

Page 8: Big data, Hadoop, NoSQL DB - introduction

Different Approach for Data Processing

powerful hardware commodity hardware

Page 9: Big data, Hadoop, NoSQL DB - introduction

HDFS (Hadoop Distributed File System) • the core part of Hadoop

• open source implementation of Google's GFS (Google File System)

• designed for commodity hardware

• responsible for distributing files throughout the cluster (connected PCs in hadoop)

• designed for high throughput rather than low latency

• typical files are in GB size

• files are broken down into blocks (64MB, 128MB)

• blocks are replicated (typical 3 replicas)

• rack aware, write once (append)

• fault tolerance

Page 10: Big data, Hadoop, NoSQL DB - introduction

HDFS – example of using

• $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg

– (it is something like virtual folder, after copying all PC in cluster can access those files)

• $ bin/hadoop dfs -ls /user/hadoop

– (virtual folder is accessible via common commands)

Page 11: Big data, Hadoop, NoSQL DB - introduction

Map Reduce Paradigm

• processing of data stored in HDFS

• map task – works locally on a part of the overall data

• reduce task – collect and process the results of mapped task

Page 12: Big data, Hadoop, NoSQL DB - introduction

Map Reduce Example “Hello World”

• text files over HDFS

• word count – counting the frequency of words

Page 13: Big data, Hadoop, NoSQL DB - introduction

Map Reduce Example (Code)

Map

ph

ase R

edu

ce Ph

ase

Page 14: Big data, Hadoop, NoSQL DB - introduction

Map Reduce Example (How it works)

Page 15: Big data, Hadoop, NoSQL DB - introduction

Map Reduce Task (Execution)

• $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir

• $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000

Page 16: Big data, Hadoop, NoSQL DB - introduction

Map Reduce Task – Monitoring & Debugging

• hadoop has interactive web interface for watching tasks and cluster

• log files

Page 17: Big data, Hadoop, NoSQL DB - introduction
Page 18: Big data, Hadoop, NoSQL DB - introduction
Page 19: Big data, Hadoop, NoSQL DB - introduction

Hadoop Ecosystem

• the other tools usable in hadoop (or made for hadoop)

Page 20: Big data, Hadoop, NoSQL DB - introduction

Hadoop Ecosystem • Hadoop (HDFS, Map Reduce Framework)

• Avro (data serialization)

• Chukwa (monitoring large clustered systems)

• Flume (data collection and navigation)

• HBase (real-time read and write database)

• Hive (data summarization and querying)

• Lucene (text search)

• Pig (programming and query language)

• Sqoop (data transfer between hadoop and databases)

• Oozie (work flow and job orchestration)

• etc.

Page 21: Big data, Hadoop, NoSQL DB - introduction

Hadoop Distributions

• open source (hard to configure), http://hadoop.apache.org/

• commercial solutions

– debugged ready-made solutions with support

– include proprietary software and hardware

– user friendly interfaces, also in cloud

– IBM • InfoSphere BigInsights

• Cloudera

– ORACLE • Exadata

• Exalytics

Page 22: Big data, Hadoop, NoSQL DB - introduction

NoSQL Databases

• SQL – Traditional relational DBMS

• not every data management/analysis problem is best solved exclusively using a traditional relational DBMS

• NoSQL = No SQL = not using traditional relational DBMS

• NoSQL = not only SQL

• NoSQL is not substitution for SQL DBMS and even they do not try to replace them

• often used for Big Data

Page 23: Big data, Hadoop, NoSQL DB - introduction

NoSQL Databases

• designed for fast retrieval and appending operations

• no data structures

• types

– document store

– graph databases

– key-value store

– etc.

• key-value store (like relational table with two columns, key and value)

Page 24: Big data, Hadoop, NoSQL DB - introduction

NoSQL Databases

• advantages

– low latency, high throughput

– highly parallelizable, massive scalability

– simplicity of design, easy to set up

– relaxed consistency => higher performance and availability

• disadvantages

– no declarative query language => more programming

– relaxed consistency => fewer guarantees

– absence of model => data model is inside the application (a big step back)

• examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.

Page 25: Big data, Hadoop, NoSQL DB - introduction

Summary • Big Data

– unstructured typically generated data (sensors, applications) with potential

– often not used before

– volume, variety, velocity => hard to process it by traditional technologies

• Hadoop

– open source technology for storing and processing distributed data

– processing Big Data on commodity hardware cluster

– HDFS, Map Reduce (and the other components of Hadoop Ecosystem)

• NoSQL Databases

– not using traditional relational DBMS

– typically key-value stores, easy

– designed for fast retrieval and appending operations

– highly parallelizable

Page 26: Big data, Hadoop, NoSQL DB - introduction

References • [1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012.

• [2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013.

• [3] O. Dolák, Big Data, http://www.systemonline.cz, 2012.

• [4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data, ISBN 978-0-07-180817-0, 2013.

• [5] http://www.go-globe.com, 2013.

• [6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf, 2012.

• [7] http://wiki.apache.org/hadoop, 2013.

• [8] http://hadoop.apache.org, 2013.

• [9] L22: SC Report, Map Reduce, The University of Utah

• [10] http://bigdatauniversity.com, 2013.

• [11] http://en.wikipedia.org/wiki/NoSQL

Page 27: Big data, Hadoop, NoSQL DB - introduction

Thank you for your attention!

[email protected]