leveraging open source for big data stack

22
Leveraging Open Source Big Data Stack 06/07/2022 Copyright © 2011 Flytxt B.V. All rights reserved Prasanth M Sasidhara

Upload: flytxt-bv

Post on 10-Dec-2014

2.811 views

Category:

Technology


2 download

DESCRIPTION

Harnessing Hadoop for Big Data - Series III - Leveraging Open Source for Big Data Stack

TRANSCRIPT

Page 1: Leveraging open source for big data stack

04/10/2023

Leveraging Open Source Big Data Stack

Copyright © 2011 Flytxt B.V. All rights reserved

Prasanth M Sasidharan

Page 2: Leveraging open source for big data stack

04/10/2023 2

Data is Information in raw or unorganized form such as alphabets, numbers, or symbols

Copyright © 2011 Flytxt B.V. All rights reserved

What is data?

What is Big data? Big Data refers to large datasets which are difficult to store, manage and

analyze

Everyday, we create 2.5 trillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone.

Page 3: Leveraging open source for big data stack

Data Explosion !

Page 4: Leveraging open source for big data stack

04/10/2023 4Copyright © 2011 Flytxt B.V. All rights reserved

Global Data Trends

Page 5: Leveraging open source for big data stack

04/10/2023 5

Multiple servers, each working on part of job, each doing same task . Key Challenges: • Work distribution and orchestration • Error recovery • Scalability and management

Copyright © 2011 Flytxt B.V. All rights reserved

Big Data & Distributed Computing

Page 6: Leveraging open source for big data stack

04/10/2023 6Copyright © 2011 Flytxt B.V. All rights reserved

FOSS in Aadhar

Aadhaar is a 12-digit unique number which the Unique Identification Authority of India (UIDAI) will issue for all residents in India

The number will be stored in a centralized database and linked to the basic demographics and biometric information – photograph, ten fingerprints and iris – of each individual.

It is unique and robust enough to eliminate the large number of duplicate and fake identities in government and private databases

Page 7: Leveraging open source for big data stack

04/10/2023 7Copyright © 2011 Flytxt B.V. All rights reserved

Lets Meet a Stack!

Infrastructure Layer

Application Layer

Page 8: Leveraging open source for big data stack

04/10/2023 8

What’s Virtualization? Virtualization allows multiple operating system instances to

run concurrently on a single computer; it is a means of separating hardware from a single operating system.

Copyright © 2011 Flytxt B.V. All rights reserved

Infrastructure for Big Data Analysis

Page 9: Leveraging open source for big data stack

04/10/2023 9

◦ Also called virtual machine manager (VMM), is one of many hardware virtualization techniques allowing multiple operating systems, termed guests, to run concurrently on a host computer

◦ Originally developed in the 1970s as part of the IBM S/360

Copyright © 2011 Flytxt B.V. All rights reserved

What’s Hypervisor?

Xen® hypervisor

Page 10: Leveraging open source for big data stack

04/10/2023 10Copyright © 2011 Flytxt B.V. All rights reserved

Advantages of FOSS

Flexibility and Freedom

Reliability

Auditability

Fast Deployment

Cost

Page 11: Leveraging open source for big data stack

04/10/2023 11Copyright © 2011 Flytxt B.V. All rights reserved

Cost For Reproducing YouTube

System

Capital Expenditures($M)

Ann Expenses,ex HW Support ($M)

Hardware Software Total Staff Support Total

Oracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0

Alternative openSource, commodity hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1

Page 12: Leveraging open source for big data stack

04/10/2023 12Copyright © 2011 Flytxt B.V. All rights reserved

Get Involved! Find out about Apache projects (http://projects.apache.org/

Join mailing lists

Pick up a Bug

Suggest ideas or Fixes

Checkout the latest code / Download releases

Change the sourcefiles to incorporate your change or addition

Provide appropriate source code documentation and follow project's

coding conventions.

Check Whether the software still compiles and runs correctly

Run any unit or regression tests the software may have

Send the patch for Review & committing

Page 13: Leveraging open source for big data stack

04/10/2023 13Copyright © 2011 Flytxt B.V. All rights reserved

Notable Users of Hadoop(Source: http://en.wikipedia.org/wiki/Hadoop)

• Adobe• Amazon• AOL• eBay• Facebook• Fox Interactive Media• IBM• Last.fm• LinkedIn

• Meebo• The New York Times• Rackspace• StumbleUpon• Twitter• Yahoo

References

• Hadoop: The Definitive Guide-MapReduce for the Cloud

• HBase: The Definitive Guide

• Hive Wiki (http://wiki.apache.org/hadoop/Hive)

• Pig Wiki (http://wiki.apache.org/pig/)

Page 14: Leveraging open source for big data stack

04/10/2023 14

Customization Specific to our business lines

Mahout Enhancements for additional Machine Learning Algorithms

Hive Customization

Oozie Enhancements

Hadoop Enhancements

We won the IEEE cloud computing challenge

Copyright © 2011 Flytxt B.V. All rights reserved

Open Source Initiatives @ FlyTXT

Page 15: Leveraging open source for big data stack

04/10/2023 15Copyright © 2011 Flytxt B.V. All rights reserved

THANK YOU

Page 16: Leveraging open source for big data stack

04/10/2023 16

Copyright © 2011 Flytxt B.V. All rights reserved

Extra Slides

Page 17: Leveraging open source for big data stack

04/10/2023 17Copyright © 2011 Flytxt B.V. All rights reserved

Major Contributors to Hadoop….

Page 18: Leveraging open source for big data stack

04/10/2023 18Copyright © 2011 Flytxt B.V. All rights reserved

Page 19: Leveraging open source for big data stack

04/10/2023 19Copyright © 2011 Flytxt B.V. All rights reserved

Quantity of Global Data

130 2,720 7,910

Exabyte

2005

2012

2015*

Page 20: Leveraging open source for big data stack

Numbers behind the News!!

Twitter produces over 230 million tweets per day

Wal-Mart is logging one million transactions per hour

Facebook creates over 30 billion pieces of content ranging from web links, news, blogs, photo

India's mobile subscription base at 873.61 mn users 

India has a population of 1.21 billion

Page 21: Leveraging open source for big data stack

04/10/2023 21Copyright © 2011 Flytxt B.V. All rights reserved

• Oozie – Open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™ - Developed at Yahoo!

• HBase – Column-store database based on Google’s BigTable. Holds extremely large data sets (Petabytes)

• Hive – SQL based data warehousing app with features for analyzing very large data sets - Developed at Facebook

• Zoo Keeper – Distributed consensus engine providing Leader election, service discovery, distributed locking / mutual exclusion

• Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis steps

• Ganglia - a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids

• Apache Mahout - Free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform

Lets meet the Big data Stack

Page 22: Leveraging open source for big data stack

04/10/2023 22Copyright © 2011 Flytxt B.V. All rights reserved