big data 101 v1
TRANSCRIPT
www.geekseat.com.au Agile Software Development
Welcome to “Big Data” JungleWelly Tambunan
Solution and Integration Architect LeadAnalytics & Data warehouse Department
Outlines Big Data Overview and History Introduction to Hadoop Hadoop Ecosystem Hadoop Distribution
Cloudera
Big Data Architecture ETL vs ELT Talend for ETL Tools
Big Data Overview and History
Google Search Engine Search Engine Architecture
Crawler
Indexer
Search Algorithm / Page Rank
Doug Cutting and Search Engine Apache Lucene
Apache Nutch
Google File System + Map Reduce Hadoop Birth
Hadoop HDFS ( Hadoop Distributed File System ) Map Reduce Hadoop = HDFS + Map Reduce Hadoop = Storage + Processing Feature
schemaless with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes)
durable once data is written it should never be lost
capable of handling component failure without human intervention (e.g. CPU, disk, memory, network, power supply, MB)
automatically rebalanced to even out disk space consumption throughout cluster
Hadoop Ecosystem SQL on Hadoop
HIVE
Impala
Hbase Hue Kafka Oozie Sqoop
Hadoop Ecosystem Yarn Zookeeper Spark
Batch
Streaming
Flink Batch
Streaming
Hadoop Distribution
Cloudera ( Danamon choice ) Hortonworks MapR IBM etc
Cloudera Demo Cloudera Manager Hue File
Format CSV
Parquet
Avro
Compression Gzip
Snappy
Deflate
Read as Database from Hive
Impala
ETL vs ELT
Extract Transform Load Extract Load Transform
Talend for ETL/ELT Tools
Demo for Standard Job with Database Demo for Batch Job Demo for Streaming Job
Announcement https://weltam.wordpress.com/ is back with Big Data Flavor
Questions ?
Rock On !