big data, hadoop, nosql db - introduction

Big Data, Hadoop, NoSQL DB - Introduction

Ing. Ľuboš Takáč, PhD.

University of Žilina

November, 2013

Overview

• Big Data

• Hadoop

– HDFS

– Map Reduce Paradigm

• NoSQL Databases

Big Data

• the origin of the term “BIG DATA” is unclear

• there are a lot of definitions,

e.g. “Big data is now almost universally understood to refer to the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional

data management technologies.” Matt Aslett

Big Data

• Can be defined by (original) 3V

– Volume (a lot of data)

– Variety (various structured)

– Velocity (fast processing)

– other V

• Veracity (IBM)

• Value (Oracle)

• Etc.

Where are Big Data Generated

Sample of Big Data Use Cases Today

Hadoop

• new idea to store and process distributed data

• open source project based on google GFS (Google distributed File System) and Map Reduce Paradigm

– google published papers in 2003-2004 about GFS and Map Reduce

• open source community led by Dough Cutting applied this tools on open search engine Nutch

• 2006 became an own research project named HADOOP

Different Approach for Data Processing

powerful hardware commodity hardware

HDFS (Hadoop Distributed File System) • the core part of Hadoop

• open source implementation of Google's GFS (Google File System)

• designed for commodity hardware

• responsible for distributing files throughout the cluster (connected PCs in hadoop)

• designed for high throughput rather than low latency

• typical files are in GB size

• files are broken down into blocks (64MB, 128MB)

• blocks are replicated (typical 3 replicas)

• rack aware, write once (append)

• fault tolerance

HDFS – example of using

• $ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg

– (it is something like virtual folder, after copying all PC in cluster can access those files)

• $ bin/hadoop dfs -ls /user/hadoop

– (virtual folder is accessible via common commands)

Map Reduce Paradigm

• processing of data stored in HDFS

• map task – works locally on a part of the overall data

• reduce task – collect and process the results of mapped task

Map Reduce Example “Hello World”

• text files over HDFS

• word count – counting the frequency of words

Map Reduce Example (Code)

Map

ph

ase R

edu

ce Ph

ase

Map Reduce Example (How it works)

Map Reduce Task (Execution)

• $ bin/hadoop jar WordCount.jar /user/hadoop/input_dir /user/hadoop/output_dir

• $ bin/hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000

Map Reduce Task – Monitoring & Debugging

• hadoop has interactive web interface for watching tasks and cluster

• log files

Hadoop Ecosystem

• the other tools usable in hadoop (or made for hadoop)

Hadoop Ecosystem • Hadoop (HDFS, Map Reduce Framework)

• Avro (data serialization)

• Chukwa (monitoring large clustered systems)

• Flume (data collection and navigation)

• HBase (real-time read and write database)

• Hive (data summarization and querying)

• Lucene (text search)

• Pig (programming and query language)

• Sqoop (data transfer between hadoop and databases)

• Oozie (work flow and job orchestration)

• etc.

Hadoop Distributions

• open source (hard to configure), http://hadoop.apache.org/

• commercial solutions

– debugged ready-made solutions with support

– include proprietary software and hardware

– user friendly interfaces, also in cloud

– IBM • InfoSphere BigInsights

• Cloudera

– ORACLE • Exadata

• Exalytics

NoSQL Databases

• SQL – Traditional relational DBMS

• not every data management/analysis problem is best solved exclusively using a traditional relational DBMS

• NoSQL = No SQL = not using traditional relational DBMS

• NoSQL = not only SQL

• NoSQL is not substitution for SQL DBMS and even they do not try to replace them

• often used for Big Data

NoSQL Databases

• designed for fast retrieval and appending operations

• no data structures

• types

– document store

– graph databases

– key-value store

– etc.

• key-value store (like relational table with two columns, key and value)

NoSQL Databases

• advantages

– low latency, high throughput

– highly parallelizable, massive scalability

– simplicity of design, easy to set up

– relaxed consistency => higher performance and availability

• disadvantages

– no declarative query language => more programming

– relaxed consistency => fewer guarantees

– absence of model => data model is inside the application (a big step back)

• examples: MongoDB, Neo4j, Dynamo, HBase, Allegro, Cassandra, etc.

Summary • Big Data

– unstructured typically generated data (sensors, applications) with potential

– often not used before

– volume, variety, velocity => hard to process it by traditional technologies

• Hadoop

– open source technology for storing and processing distributed data

– processing Big Data on commodity hardware cluster

– HDFS, Map Reduce (and the other components of Hadoop Ecosystem)

• NoSQL Databases

– not using traditional relational DBMS

– typically key-value stores, easy

– designed for fast retrieval and appending operations

– highly parallelizable

References • [1] JP. Dijcks, Oracle: Big Data for the Enterprise, Jan. 2012.

• [2] Ľ. Takáč, Data Processing over Very Large Databases, PhD thesis, 2013.

• [3] O. Dolák, Big Data, http://www.systemonline.cz, 2012.

• [4] P. Zikopoulos, D. Deroos, K. Parasuraman, T. Deutsch, D. Corrigan, J. Giles, Harness the Power of Big Data, ISBN 978-0-07-180817-0, 2013.

• [5] http://www.go-globe.com, 2013.

• [6] Kanik T., Kováč M., NOSQL - Non-Relational Database Systems as the New Generation of DBMS, OSSConf, 2012.

• [7] http://wiki.apache.org/hadoop, 2013.

• [8] http://hadoop.apache.org, 2013.

• [9] L22: SC Report, Map Reduce, The University of Utah

• [10] http://bigdatauniversity.com, 2013.

• [11] http://en.wikipedia.org/wiki/NoSQL

Thank you for your attention!

[email protected]

big data, hadoop, nosql db - introduction

Technology

data model

overall data

term big data

power of big data

paradigm processing

summary big data unstructured

generated data sensors

hadoop ecosystem hadoop