big data 101 v1

www.geekseat.com.au Agile Software Development

Welcome to “Big Data” JungleWelly Tambunan

([email protected])

Solution and Integration Architect LeadAnalytics & Data warehouse Department

Outlines Big Data Overview and History Introduction to Hadoop Hadoop Ecosystem Hadoop Distribution

Cloudera

Big Data Architecture ETL vs ELT Talend for ETL Tools

Big Data Overview and History

Google Search Engine Search Engine Architecture

Crawler

Indexer

Search Algorithm / Page Rank

Doug Cutting and Search Engine Apache Lucene

Apache Nutch

Google File System + Map Reduce Hadoop Birth

Hadoop HDFS ( Hadoop Distributed File System ) Map Reduce Hadoop = HDFS + Map Reduce Hadoop = Storage + Processing Feature

schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes)

durable once data is written it should never be lost

capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB)

automatically rebalanced to even out disk space consumption throughout cluster

Hadoop Ecosystem SQL on Hadoop

HIVE

Impala

Hbase Hue Kafka Oozie Sqoop

Hadoop Ecosystem Yarn Zookeeper Spark

Batch

Streaming

Flink Batch

Streaming

Hadoop Distribution

Cloudera ( Danamon choice ) Hortonworks MapR IBM etc

Cloudera Demo Cloudera Manager Hue File

Format CSV

Parquet

Avro

Compression Gzip

Snappy

Deflate

Read as Database from Hive

Impala

ETL vs ELT

Extract Transform Load Extract Load Transform

Talend for ETL/ELT Tools

Demo for Standard Job with Database Demo for Batch Job Demo for Streaming Job

Announcement https://weltam.wordpress.com/ is back with Big Data Flavor

https://weltam.wordpress.com/

Questions ?

Rock On !

big data 101 v1

Internet