big data

38
IN THE NAME OF GOD BIG DATA ANALYTICS HADOOP AND CASSANDRA Author: Samira Riki

Upload: samira-riki

Post on 17-Aug-2015

40 views

Category:

Engineering


1 download

TRANSCRIPT

IN THE NAME OF GOD

BIG DATA ANALYTICS HADOOP AND CASSANDRA

Author: Samira Riki

A airline jet collect 10 terabytes of sensor data

for every 30 minutes of flying time.

NYSE generates about one terabyte of new trade

data per day to perform stock trading analytics to

determine trends for optimal trades.

3

Twitter has over 500 milion registered users.

79% of US Twitter users are more likely to buy from brands

they follow.

67% of US Twitter users are more likely to buy from brands

they follow.

57% of all companies that use social media for business use

Twitter.

“Big Data is the frontier of a firm's ability to

store, process, and access (SPA) all the data

it needs to operate effectively, make

decisions, reduce risks, and serve

customers.”

... How big is BIG?

Let’s look at

Big Data

in a different way…

Byte

Byte : one grain of rice

Kilobyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

One Byte Exabyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Zettabyte

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL! Yottabyte

Hobbyist Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL!

Desktop

Hobbyist Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL!

Desktop

Hobbyist

Internet

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL!

Desktop

Hobbyist

Internet

Big Data

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL!

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL!

Desktop

Hobbyist

The Future?

Internet

Big Data

Byte : one grain of rice

Kilobyte : cup of rice

Megabyte : 8 bags of rice

Gigabyte : 3 Semi trucks

Terabyte : 2 Container Ships

Petabyte : Blankets Manhattan

Exabyte : Blankets west coast states

Zettabyte : Fills the Pacific Ocean

Yottabyte : A EARTH SIZE RICE BALL!

Process data in parallel? -not simple

23

An idea: parallelism

A problem: Parallelism is Hard

Synchronization

Deadlock

Limited bandwidth

Timing issues and co-ordination

Split and Aggregation

Coputer are complicate

Driver failure

Data availability

Hey! We have Distributed computing!!!

Yes,we have distributed computing and it also come up with

some challenges

24

Resource sharing

Concurrency

Fault tolerance

Heterogeneity

Transparency

To address most of these challenges(but not all) Hadoop

come in.

Hadoop origin

25

• An Elephant can’t jump.But can carry heavy load!!!

• Apache Haddop is a framework that allows for the distributed

processing of large data sets across clusters of commodity

computers using a simple programming model.it is designed to scale

up from single servers to thousands of machines,each providing

computation and storage.

• Hadoop is an open-source implementation of Google

MapReduce,GFS(distributed file system).

• Hadoop was created by Doug Cutting the creator of Apache

Lucene,the widely used text search library.

Hadoop Architecture

26

Hadoop designed and built on two independent frame works.

Hadoop= HDFS + Map reduce

HDFS(Storage and File system):HDFS is a reliable distributed file system

that provides high-throughput access to data.

MapReduce(processing):MapReduce is a framework for performing high

performance distributed data processing using the divide and aggregate

programming paradigm.

Hadoop has a master/slave architecture for both storage and

processing.

Hadoop Master and Slave Architecture

27

The components of HDFS are

Name Node

Data Node

Secondary Name Node

28

29

30

The components of MapRedeuce are:

Job Tracker

Task Trackers

Who uses Hadoop?

31

Amazon/A9

Facebook

Google

IBM

Joost

Last.fm New York Times

PowerSet

Yahoo!

Twitter

LinkedIn

Cassandra

32

• Apache Cassandra is an open source distributed database

management system designed to handle large amounts of data

across many commodity servers, providing high availability with no

single point of failure. Cassandra offers robust support for clusters

spanning multiple datacenters.

Main features

33

Cassandra places a high value on performance.

In 2012, University of Toronto researchers studying NoSQL systems concluded that "In terms of scalability, there is a clear winner throughout our experiments.

Decentralized

Supports replication and multi data center replication

Scalability

Fault-tolerant

Query language

MapReduce support

The data model

34

New use cases

35

• Geographic data

• Weather data

• Rfid

• Travel schedules

• Hotel reservation

Big Data isn’t big,

if you know how to

use it.

References

37

1.Big data:the next frontier for innovation,competition

and productivity-McKinsy&company

2. Big Data Meets Big Data Analytics-SAS Company

3. Big data tutorial-Marko Grobelnik

4. Big Data Spectrum

38

Q?