hadoop world overview trends and topics

Post on 25-Jan-2017

136 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Trends and Topics

Valentyn KropovSolutions Architect, SAG, SoftServe

Agenda1. Conference Overview

2. Bright Future of Hadoop Map-Reduce

3. Apache Spark Data Frames

4. Cloudera Kudu

5. Most Popular Reference Architecture

6. Use Cases

#1Conference Overview

#2Bright Future of

Hadoop MapReduce

Spark is a Future

Cloudera Anounces One Platform Initiative (Sep, 9 2015)

Spark is a Present

It appeared in 72% of presentations and use-cases

At Hadoop World Conference

Spark is Easier to Code

Map Reduce / Java Spark / Scala

Spark is Faster

Up to 100x faster!

Spark is Interactive

Spark is Real-Time

And they have Power • 400 contributors• From 100+ companies• Databricks (1 y.o, 30->100

people, $47 million)• Cloudera (370 patches, 43k

lines of code)

Cloudera One Platform: Read More

http://goo.gl/jSK0h6

#3Spark Data Frames

Most of Data is Still Structured!

• No Sorting?• No Joins?• No Aggregations?• No Filtering?• No cross-DB connections?

Data Frame is…• API

• like a Table (RDBMS)

• or Data Frame (Python/R)

• Abstraction layer over

RDD

Construct Data Frame# Constructs a DataFrame from Hive users = context.table("users") # from JSON files in S3 logs = context.load("s3n://data.json", "json")

Filtering

# Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21)

Group By

# Count the number of young users by gender young.groupBy("gender").count()

Joins!

# Join users with another DataFrame called logs users.join(logs, logs.userId == users.userId, "left_outer")

Spark Languages

Spark Survey 2015

Why Not Python + RDD?

Data Frames and Python

• Compiled into JVM bytecode

• Data Never Leaves the JVM

• Python passes commands only

• Commands are pushed down

Data Frames Performance

Data Frames: Read More

http://www.slideshare.net/JonHaddad/enter-the-snake-pit-for-fast-and-easy-spark

#4Cloudera’s Kudu

What’s Kudu?• Columnar Storage for Hadoop• Not just a file-format • Supports low-latency random access (ms)• Good alternative for Impala + Parquet• Integrates with Spark, Hadoop, Impala• It’s in Beta now

Faster than Parquet

Kudu: Architecture

Kudu: use-cases

• Write: newly-arrived data immediately

available to users

• Time-Series applications which needs to

support both random and scattered reads

Kudu: Read More

http://getkudu.io/

#5Most Popular

Reference Architecture

Reference Architecture

Yarn (90%)Mesos (10%)

Kafka

• Highly-scalable

• Fault-tolerant (commit-log)

• Partition-based load-balancing

Spark Streaming

• Processes data in micro batches (Dstream,

windows slides)

• Supports data locality with Cassandra

• Real-time data science (Data Frames, Mlib)

• BI Support (Spark SQL)

Cassandra• No SPOF

• Masterless (easy operations and scaling)

• Replicates data across data-centers

• Most mature and fast growing

• Evolves into New SQL (transactions)

• SQL-like-CQL

Spark

• Is Awesome for Analytics (both

real-time and batch)

Reference Architecture: Read More

http://www.datastax.com/dev/blog/streaming-big-data-with-spark-spark-streaming-kafka-cassandra-and-akka

#6Netflix Big Data

Platform

Netflix: Size

•20PB DW on S3•Read ~10% of data daily•Write ~10% of read data

daily•500 billion events daily

Netflix: Analyze

•300 Data Scientists•Python, R, Scala, etc

Netflix: Compute and Storage

• Separate Compute and storage (S3)• To have heterogeneous

clusters• And no-downtime upgrades

Netflix: Architecture

Netflix: Read More

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43373

#7Big Data

Mission to Mars

Mission Orion

Mission Orion: Size

• 350k measurands• 2TB / hour• 1200 telemetry sensors• 3 x 1GB networks busy• Data retention is 25 years

Data Reader/Simulator IngestPacket

Measurands (GPBs) Kafka

Message Bus

Packet Measurands

(GPBs) Deduplication

(Spark)

HBase Writer(Spark)

mach5-sample Obj

Splitter + Decom (GDS)

C++ client Reads Packets and

Decommutates

Tlm Data

Packet Measurands GPB File

(represents a Packet(s) and contains

decommutated measurands)

Header Metadataapid:seqctr:time: value1

…..

apid:seqctr:time: valueN

mach5-sample (Spark)

Packet Measurands

(GPBs)

Lockheed Martin Proprietary Information

StorageAnalytics

HDFS

HFiles (HBase-RDD)

Mach-5 Data Ingest for Orion

HBase

Web/UITomcatGlassfish

Etc.

TraceFOSS

widgets

Aggregation

(Spark)

Alerting(Spark)

Limit Checking(Spark)

Orion: Read More

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43181

Thanks!

valentine.kropov@gmail.com

top related