big data and analytics - ada...

BigDataandAnalyticsHadoopEcosystem

Dr.Abzetdin AdamovSchoolofInformationTechnologyandEngineering

ADAUniversityhttp://site.ada.qu.edu.az/~aadamov

PreviouslyCoveredTopics

• KeydifferencesofTraditionalandBigDataArchitecture• TransferringComputationPoweragainstTransferringData• SchemaonReadvsSchemaonWrite• HadoopCore– Storage:HDFSArchitecture• HadoopCore– Processing:MapReduce Architecture

Objectives

• Vagrant+Provisioning+VirtualBox =RepeatableMultiWMs• Hadoop2.0vsHadoop1.0• HadoopEcosystemComponentsClassification• HadoopEcosystemComponentsKeyFeatures

HadoopEcosystemComponents

CompaniesbuildingontopofHadoop

• AmazonWebServices• Cloudera• Hortonworks• IBM• Intel• MapR Technologies• Microsoft• PivotalSoftware• Teradata

PoweredbyApacheHadoop

• https://wiki.apache.org/hadoop/PoweredBy

• ThousandscompaniesandorganizationswithHadoopClustersizefromseveraltohundredsthousandsnodes(40.000atYahoo)

HadoopCore=Storage+Compute

storage storage

CPU RAM

YetAnotherResourceNegotiator(YARN)

HadoopDistributedFileSystem(HDFS)

Hadoop2.0vsHadoop1.0

Hadoop1.0Bottlenecks:HDFS/MapReduce

Hadoop2.0Architechture

YARN/MRv2vsMRv1Architecture

Hadoop2.0vsHadoop1.0– Processing

TheHadoopEcosystem

Hadoop

HortonworksHadoopDistribution

ClassificationofHadoopEcosystemComponents

AdministrationandServerCoordination Hue

DistributedStorage

ResourceManagement

ProcessingFramework

Analytics

Ambari Zookeeper

DataManagement Flume Sqoop

WorkflowEngine Oozie

WorkflowEngine Avro

MapReduce

Mahout

MapReduce v2

MapReduce Pig HBase

Tez Hoya

ClassificationofHadoopEcosystemComponents

HadoopEcosystemComponents

DataManagementFrameworks

Framework Description

HadoopDistributedFileSystem(HDFS)

AJava-based, distributedfilesystemthatprovidesscalable,reliable,high-throughputaccesstoapplication datastoredacrosscommodityservers

YetAnotherResourceNegotiator(YARN)

Aframeworkforcluster resourcemanagementandjobscheduling

OperationsFrameworksFramework Description

Ambari AWeb-basedframework forprovisioning,managing,andmonitoringHadoopclusters

ZooKeeper Ahigh-performance coordinationservicefordistributedapplications

Cloudbreak AtoolforprovisioningandmanagingHadoopclustersinthecloud

Oozie Aserver-basedworkflowengine usedtoexecuteHadoopjobs

Ambari WEBUI(REST)

DataAccessFrameworksFramework DescriptionPig Ahigh-levelplatformforextracting, transforming,oranalyzinglargedatasets

Hive AdatawarehouseinfrastructurethatsupportsadhocSQLqueries

HCatalog Atableinformation,schema,andmetadatamanagementlayersupportingHive,Pig,MapReduce,andTezprocessing

Cascading Anapplication developmentframeworkforbuildingdataapplications,abstractingthedetailsofcomplexMapReduceprogramming

HBase Ascalable,distributed NoSQLdatabasethatsupportsstructureddatastorageforlargetables

Phoenix Aclient-sideSQLlayer overHBasethatprovideslow-latencyaccesstoHBasedata

Accumulo Alow-latency,largetabledatastorageandretrievalsystemwithcell-levelsecurity

Storm Adistributed computationsystemforprocessingcontinuousstreamsofreal-timedata

Solr Adistributedsearch platformcapableofindexingpetabytesofdata

Spark A fast,generalpurposeprocessingengineusetobuildandrunsophisticatedSQL,streaming,machinelearning,orgraphicsapplications

GovernanceandIntegrationFrameworksFramework DescriptionFalcon Adatagovernancetoolprovidingworkfloworchestration, datalifecycle

management,anddatareplicationservices.WebHDFS ARESTAPI that usesthestandardHTTPverbstoaccess,operate,andmanage

HDFSHDFSNFSGateway A gatewaythatenables accesstoHDFSasanNFSmountedfile systemFlume A distributed,reliable,andhighly-availableservicethatefficientlycollects,

aggregates,andmovesstreamingdataSqoop Asetoftoolsfor importingandexportingdatabetweenHadoopandRDBM

systemsKafka Afast,scalable,durable,andfault-tolerantpublish-subscribemessagingsystemAtlas Ascalableandextensible setofcoregovernanceservicesenablingenterprisesto

meetcomplianceanddataintegrationrequirements

SecurityFrameworksFramework DescriptionHDFS A storagemanagementservice providingfile anddirectorypermissions,even

moregranularfileanddirectoryaccesscontrollists,andtransparentdataencryption

YARN Aresourcemanagement servicewithaccesscontrollistscontrollingaccesstocomputeresourcesandYARNadministrativefunctions

Hive Adatawarehouseinfrastructure serviceprovidinggranularaccesscontrolstotablecolumnsandrows

Falcon Adatagovernancetoolprovidingaccesscontrol liststhatlimitwhomaysubmitHadoopjobs

Knox AgatewayprovidingperimetersecuritytoaHadoopclusterRanger Acentralized securityframeworkofferingfine-grainedpolicycontrolsforHDFS,

Hive,HBase,Knox,Storm,Kafka,andSolr

EcosystemComponentVersions

HadoopEcosystemComponents’KeyFeatures

HADOOPECOSYSTEMCOMPONENTS

Its important to understand the components in Hadoop Ecosystem to build right solutions for a given business problem.

ClassificationoftheHadoopEcosystemComponents

HadoopisstraightanswerforprocessingBigData.

HadoopEcosystemhasacombinationoftechnologieswhichproficientadvantageinsolvingData-orientedbusinessproblem.

COREHADOOPHadoopDistributedFileSystem(HDFS)Standsfor:managingbigdatasetswithHighVolume, VelocityandVariety.

MapReduceStandsfor:processinghighvolumedistributeddata

YetAnotherResourceNegotiator(YARN)Standsfor:resourcemanagement,jobscheduling andmonitoring

DATAACCESSApachePigStandsfor:highlevellanguagebuiltontopofMapReduce foranalyzinglargedatasetsandforDataFlow.

ApacheHiveStandsfor:highlevelquery languageanddatawarehouseinfrastructurebuilton topofHadoopforproviding datasummarization,queryandanalysis.

DATASTORAGE

ApacheHBaseStandsfor:NoSQLdatabasebuiltforhostinglargetableswithbillionsofrowsandmillionsofcolumnsontopofHadoop.

CasandraStandsfor:NoSQLdatabasebasedonkey-valuemodeldesigned forlinearscalabilityandhighavailability.

INTERACTION-VISUALIZATION-DEVELOPMENT

HcatalogStandsfor:providing integrationofHivemetadataforotherHadoopapplicationslikePig,MapReduce andothers.

LuceneStandsfor:high-performance, full-featuredtextsearchengine librarywrittenentirelyinJava.

HamaStandsfor:distributed frameworkbasedonBulkSynchronousParallel(BSP)computing formassivescientificcomputations likematrix,graphandnetworkalgorithms.

CrunchStandsfor:writing, testingandrunningMapReduce pipelines.

DATAINELLIGENCE

ApacheDrillStandsfor:lowlatencySQLqueryengineforHadoopandNoSQL.

ApacheMahoutStandsfor:scalablemachinelearning librarydesigned forbuilding predictiveanalyticsonBigData.Mahoutnowhasimplementations apachesparkforfasterinmemorycomputing.

DATAINTEGRATIONApacheSqoopStandsfor:lowlatencySQLqueryengine forHadoopandNoSQL.

ApacheFlumeStandsfor:distributed, reliable,andavailableserviceforefficientlycollecting,aggregating,andmovinglargeamountsoflogdata.

ApacheChukwaStandsfor:scalablelogcollectorusedformonitoring largedistributed filessystems.

MANAGEMENT,MONITORINGandORCHESTRATION

ApacheAmbariStandsfor:simplifying Hadoopmanagementbyproviding aninterfaceforprovisioning,managingandmonitoring ApacheHadoopClusters.

ApacheZookeeperStandsfor:maintainingconfiguration informationnaming,providing distributedsynchronization, andprovidinggroupservices.

ApacheOozieStandsfor:schedulingworkflowtomanageApacheHadoop jobs.

WhereCanWeUseMachineLearning(DataScience)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproduction levels

YARNasaDataOperatingSystem

ApplicationsRunNativelyINHadoop

HDFS2(Redundant,ReliableStorage)

YARN(ClusterResourceManagement)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPCMPI(OpenMPI)

EXISTING(Slider)

SEARCH(Solr)

Applicationsnowrun“in”Hadoop,insteadof“on”Hadoop.

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

Q&A ?Abzetdin Adamov,Assoc Prof.Emailmeat:aadamov@ada.edu.azFollowmeat:@Linktomeat:www.linkedin.com/in/adamovVisitmyblogat:aadamov.wordpress.com

big data and analytics - ada...

Documents

apache hadoop yarn - hortonworks meetup presentation

bigdata et hadoop

1.3.1. patch information for hadoop - hortonworks manuals

hortonworks & bilot data driven transformations with hadoop

overview of bigdata (hadoop & spark)

bigdata analysis with mongo-hadoop

installing hortonworks hadoop for windows

hadoop benchmark: evaluating cloudera, hortonworks, and mapr

bigdata and hadoop

introduction and overview of bigdata, hadoop, distributed...

learn hadoop and bigdata technologies

learn what is hadoop-and-bigdata

page 1 © hortonworks inc. 2014 hdp with advanced security...

dell emc hortonworks hadoop solution · dell emc...

intro to bigdata , hadoop and mapreduce

apache hadoop bigdata-in-banking

bigdata hadoop bigdata analytics_mcal

hortonworks hadoop system admin guide 20130819

hortonworks data platform - hadoop security guide · pdf...

set up hortonworks hadoop with sql...