big data and analytics - ada...
Post on 15-Mar-2020
11 Views
Preview:
TRANSCRIPT
BigDataandAnalyticsHadoopEcosystem
Dr.Abzetdin AdamovSchoolofInformationTechnologyandEngineering
ADAUniversityhttp://site.ada.qu.edu.az/~aadamov
PreviouslyCoveredTopics
• KeydifferencesofTraditionalandBigDataArchitecture• TransferringComputationPoweragainstTransferringData• SchemaonReadvsSchemaonWrite• HadoopCore– Storage:HDFSArchitecture• HadoopCore– Processing:MapReduce Architecture
Objectives
• Vagrant+Provisioning+VirtualBox =RepeatableMultiWMs• Hadoop2.0vsHadoop1.0• HadoopEcosystemComponentsClassification• HadoopEcosystemComponentsKeyFeatures
HadoopEcosystemComponents
CompaniesbuildingontopofHadoop
• AmazonWebServices• Cloudera• Hortonworks• IBM• Intel• MapR Technologies• Microsoft• PivotalSoftware• Teradata
PoweredbyApacheHadoop
• https://wiki.apache.org/hadoop/PoweredBy
• ThousandscompaniesandorganizationswithHadoopClustersizefromseveraltohundredsthousandsnodes(40.000atYahoo)
HadoopCore=Storage+Compute
storage storage
storage storage
CPU RAM
YetAnotherResourceNegotiator(YARN)
HadoopDistributedFileSystem(HDFS)
Hadoop2.0vsHadoop1.0
Hadoop1.0Bottlenecks:HDFS/MapReduce
Hadoop2.0Architechture
YARN/MRv2vsMRv1Architecture
Hadoop2.0vsHadoop1.0– Processing
TheHadoopEcosystem
Hadoop
HortonworksHadoopDistribution
ClassificationofHadoopEcosystemComponents
AdministrationandServerCoordination Hue
DistributedStorage
ResourceManagement
ProcessingFramework
API
Analytics
Ambari Zookeeper
DataManagement Flume Sqoop
WorkflowEngine Oozie
WorkflowEngine Avro
HDFS
YARN
MapReduce
Mahout
MapReduce v2
MapReduce Pig HBase
Tez Hoya
Hive
ClassificationofHadoopEcosystemComponents
HadoopEcosystemComponents
DataManagementFrameworks
Framework Description
HadoopDistributedFileSystem(HDFS)
AJava-based, distributedfilesystemthatprovidesscalable,reliable,high-throughputaccesstoapplication datastoredacrosscommodityservers
YetAnotherResourceNegotiator(YARN)
Aframeworkforcluster resourcemanagementandjobscheduling
OperationsFrameworksFramework Description
Ambari AWeb-basedframework forprovisioning,managing,andmonitoringHadoopclusters
ZooKeeper Ahigh-performance coordinationservicefordistributedapplications
Cloudbreak AtoolforprovisioningandmanagingHadoopclustersinthecloud
Oozie Aserver-basedworkflowengine usedtoexecuteHadoopjobs
Ambari WEBUI(REST)
DataAccessFrameworksFramework DescriptionPig Ahigh-levelplatformforextracting, transforming,oranalyzinglargedatasets
Hive AdatawarehouseinfrastructurethatsupportsadhocSQLqueries
HCatalog Atableinformation,schema,andmetadatamanagementlayersupportingHive,Pig,MapReduce,andTezprocessing
Cascading Anapplication developmentframeworkforbuildingdataapplications,abstractingthedetailsofcomplexMapReduceprogramming
HBase Ascalable,distributed NoSQLdatabasethatsupportsstructureddatastorageforlargetables
Phoenix Aclient-sideSQLlayer overHBasethatprovideslow-latencyaccesstoHBasedata
Accumulo Alow-latency,largetabledatastorageandretrievalsystemwithcell-levelsecurity
Storm Adistributed computationsystemforprocessingcontinuousstreamsofreal-timedata
Solr Adistributedsearch platformcapableofindexingpetabytesofdata
Spark A fast,generalpurposeprocessingengineusetobuildandrunsophisticatedSQL,streaming,machinelearning,orgraphicsapplications
GovernanceandIntegrationFrameworksFramework DescriptionFalcon Adatagovernancetoolprovidingworkfloworchestration, datalifecycle
management,anddatareplicationservices.WebHDFS ARESTAPI that usesthestandardHTTPverbstoaccess,operate,andmanage
HDFSHDFSNFSGateway A gatewaythatenables accesstoHDFSasanNFSmountedfile systemFlume A distributed,reliable,andhighly-availableservicethatefficientlycollects,
aggregates,andmovesstreamingdataSqoop Asetoftoolsfor importingandexportingdatabetweenHadoopandRDBM
systemsKafka Afast,scalable,durable,andfault-tolerantpublish-subscribemessagingsystemAtlas Ascalableandextensible setofcoregovernanceservicesenablingenterprisesto
meetcomplianceanddataintegrationrequirements
SecurityFrameworksFramework DescriptionHDFS A storagemanagementservice providingfile anddirectorypermissions,even
moregranularfileanddirectoryaccesscontrollists,andtransparentdataencryption
YARN Aresourcemanagement servicewithaccesscontrollistscontrollingaccesstocomputeresourcesandYARNadministrativefunctions
Hive Adatawarehouseinfrastructure serviceprovidinggranularaccesscontrolstotablecolumnsandrows
Falcon Adatagovernancetoolprovidingaccesscontrol liststhatlimitwhomaysubmitHadoopjobs
Knox AgatewayprovidingperimetersecuritytoaHadoopclusterRanger Acentralized securityframeworkofferingfine-grainedpolicycontrolsforHDFS,
Hive,HBase,Knox,Storm,Kafka,andSolr
EcosystemComponentVersions
HadoopEcosystemComponents’KeyFeatures
HADOOPECOSYSTEMCOMPONENTS
Its important to understand the components in Hadoop Ecosystem to build right solutions for a given business problem.
ClassificationoftheHadoopEcosystemComponents
HadoopisstraightanswerforprocessingBigData.
HadoopEcosystemhasacombinationoftechnologieswhichproficientadvantageinsolvingData-orientedbusinessproblem.
COREHADOOPHadoopDistributedFileSystem(HDFS)Standsfor:managingbigdatasetswithHighVolume, VelocityandVariety.
MapReduceStandsfor:processinghighvolumedistributeddata
YetAnotherResourceNegotiator(YARN)Standsfor:resourcemanagement,jobscheduling andmonitoring
DATAACCESSApachePigStandsfor:highlevellanguagebuiltontopofMapReduce foranalyzinglargedatasetsandforDataFlow.
ApacheHiveStandsfor:highlevelquery languageanddatawarehouseinfrastructurebuilton topofHadoopforproviding datasummarization,queryandanalysis.
DATASTORAGE
ApacheHBaseStandsfor:NoSQLdatabasebuiltforhostinglargetableswithbillionsofrowsandmillionsofcolumnsontopofHadoop.
CasandraStandsfor:NoSQLdatabasebasedonkey-valuemodeldesigned forlinearscalabilityandhighavailability.
INTERACTION-VISUALIZATION-DEVELOPMENT
HcatalogStandsfor:providing integrationofHivemetadataforotherHadoopapplicationslikePig,MapReduce andothers.
LuceneStandsfor:high-performance, full-featuredtextsearchengine librarywrittenentirelyinJava.
HamaStandsfor:distributed frameworkbasedonBulkSynchronousParallel(BSP)computing formassivescientificcomputations likematrix,graphandnetworkalgorithms.
CrunchStandsfor:writing, testingandrunningMapReduce pipelines.
DATAINELLIGENCE
ApacheDrillStandsfor:lowlatencySQLqueryengineforHadoopandNoSQL.
ApacheMahoutStandsfor:scalablemachinelearning librarydesigned forbuilding predictiveanalyticsonBigData.Mahoutnowhasimplementations apachesparkforfasterinmemorycomputing.
DATAINTEGRATIONApacheSqoopStandsfor:lowlatencySQLqueryengine forHadoopandNoSQL.
ApacheFlumeStandsfor:distributed, reliable,andavailableserviceforefficientlycollecting,aggregating,andmovinglargeamountsoflogdata.
ApacheChukwaStandsfor:scalablelogcollectorusedformonitoring largedistributed filessystems.
MANAGEMENT,MONITORINGandORCHESTRATION
ApacheAmbariStandsfor:simplifying Hadoopmanagementbyproviding aninterfaceforprovisioning,managingandmonitoring ApacheHadoopClusters.
ApacheZookeeperStandsfor:maintainingconfiguration informationnaming,providing distributedsynchronization, andprovidinggroupservices.
ApacheOozieStandsfor:schedulingworkflowtomanageApacheHadoop jobs.
WhereCanWeUseMachineLearning(DataScience)
Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates
Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens
PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security
Retail• Productrecommendation• Inventorymanagement• Priceoptimization
Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis
Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproduction levels
YARNasaDataOperatingSystem
ApplicationsRunNativelyINHadoop
HDFS2(Redundant,ReliableStorage)
YARN(ClusterResourceManagement)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPCMPI(OpenMPI)
EXISTING(Slider)
SEARCH(Solr)
Applicationsnowrun“in”Hadoop,insteadof“on”Hadoop.
Next Generation AnalyticsIterative & ExploratoryData is the structure
Traditional AnalyticsStructured & Repeatable
Structure built to store data
42
ModernDataApplicationsapproachtoInsights
Start with hypothesisTest against selected data
Data leads the way Explore all data, identify correlations
Analyze after landing… Analyze in motion…
Q&A ?Abzetdin Adamov,Assoc Prof.Emailmeat:aadamov@ada.edu.azFollowmeat:@Linktomeat:www.linkedin.com/in/adamovVisitmyblogat:aadamov.wordpress.com
top related