hadoop world overview trends and topics

Trends and Topics

Valentyn KropovSolutions Architect, SAG, SoftServe

Agenda1. Conference Overview

2. Bright Future of Hadoop Map-Reduce

3. Apache Spark Data Frames

4. Cloudera Kudu

5. Most Popular Reference Architecture

6. Use Cases

#1Conference Overview

#2Bright Future of

Hadoop MapReduce

Spark is a Future

Cloudera Anounces One Platform Initiative (Sep, 9 2015)

Spark is a Present

It appeared in 72% of presentations and use-cases

At Hadoop World Conference

Spark is Easier to Code

Map Reduce / Java Spark / Scala

Spark is Faster

Up to 100x faster!

Spark is Interactive

Spark is Real-Time

And they have Power • 400 contributors• From 100+ companies• Databricks (1 y.o, 30->100

people, $47 million)• Cloudera (370 patches, 43k

lines of code)

Cloudera One Platform: Read More

http://goo.gl/jSK0h6

#3Spark Data Frames

Most of Data is Still Structured!

• No Sorting?• No Joins?• No Aggregations?• No Filtering?• No cross-DB connections?

Data Frame is…• API

• like a Table (RDBMS)

• or Data Frame (Python/R)

• Abstraction layer over

Construct Data Frame# Constructs a DataFrame from Hive users = context.table("users") # from JSON files in S3 logs = context.load("s3n://data.json", "json")

Filtering

# Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21)

Group By

# Count the number of young users by gender young.groupBy("gender").count()

Joins!

# Join users with another DataFrame called logs users.join(logs, logs.userId == users.userId, "left_outer")

Spark Languages

Spark Survey 2015

Why Not Python + RDD?

Data Frames and Python

• Compiled into JVM bytecode

• Data Never Leaves the JVM

• Python passes commands only

• Commands are pushed down

Data Frames Performance

Data Frames: Read More

http://www.slideshare.net/JonHaddad/enter-the-snake-pit-for-fast-and-easy-spark

#4Cloudera’s Kudu

What’s Kudu?• Columnar Storage for Hadoop• Not just a file-format • Supports low-latency random access (ms)• Good alternative for Impala + Parquet• Integrates with Spark, Hadoop, Impala• It’s in Beta now

Faster than Parquet

Kudu: Architecture

Kudu: use-cases

• Write: newly-arrived data immediately

available to users

• Time-Series applications which needs to

support both random and scattered reads

Kudu: Read More

http://getkudu.io/

#5Most Popular

Reference Architecture

Yarn (90%)Mesos (10%)

• Highly-scalable

• Fault-tolerant (commit-log)

• Partition-based load-balancing

Spark Streaming

• Processes data in micro batches (Dstream,

windows slides)

• Supports data locality with Cassandra

• Real-time data science (Data Frames, Mlib)

• BI Support (Spark SQL)

Cassandra• No SPOF

• Masterless (easy operations and scaling)

• Replicates data across data-centers

• Most mature and fast growing

• Evolves into New SQL (transactions)

• SQL-like-CQL

• Is Awesome for Analytics (both

real-time and batch)

Reference Architecture: Read More

http://www.datastax.com/dev/blog/streaming-big-data-with-spark-spark-streaming-kafka-cassandra-and-akka

#6Netflix Big Data

Platform

Netflix: Size

•20PB DW on S3•Read ~10% of data daily•Write ~10% of read data

daily•500 billion events daily

Netflix: Analyze

•300 Data Scientists•Python, R, Scala, etc

Netflix: Compute and Storage

• Separate Compute and storage (S3)• To have heterogeneous

clusters• And no-downtime upgrades

Netflix: Architecture

Netflix: Read More

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43373

#7Big Data

Mission to Mars

Mission Orion

Mission Orion: Size

• 350k measurands• 2TB / hour• 1200 telemetry sensors• 3 x 1GB networks busy• Data retention is 25 years

Data Reader/Simulator IngestPacket

Measurands (GPBs) Kafka

Message Bus

Packet Measurands

(GPBs) Deduplication

(Spark)

HBase Writer(Spark)

mach5-sample Obj

Splitter + Decom (GDS)

C++ client Reads Packets and

Decommutates

Tlm Data

Packet Measurands GPB File

(represents a Packet(s) and contains

decommutated measurands)

Header Metadataapid:seqctr:time: value1

apid:seqctr:time: valueN

mach5-sample (Spark)

Packet Measurands

(GPBs)

Lockheed Martin Proprietary Information

StorageAnalytics

HFiles (HBase-RDD)

Mach-5 Data Ingest for Orion

Web/UITomcatGlassfish

TraceFOSS

widgets

Aggregation

(Spark)

Alerting(Spark)

Limit Checking(Spark)

Orion: Read More

http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43181

Thanks!

valentine.kropov@gmail.com

hadoop world overview trends and topics

Data & Analytics

developing trends and current topics in anesthesiology...

social media trends for 2015 - norcat hot topics series

topics, trends, and resources in natural language...

trends & hot topics - deschutes.org · trends & hot topics...

researchers’ perceptions of dh trends and topics in the

current topics and trends on durability of building

hot topics and enforcement trends in bribery and

hadoop based obesity monitoring system › people ›...

· (page views ? hourly? monthly hadoop node hadoop node...

advanced topics: mapreduce ece 454 computer systems...

hot topics and trends in od and talent management 2014

hot topics and emerging trends in hyperion planning

the enterprise and connected data, trends in the apache...

big data and hadoop - history, technical deep dive, and...

hadoop trends & hadoop on ec2

topics, trends, and resources in natural language...

current trends and topics for investment company ... ·...

advanced topics on mapreduce with hadoop jiaheng lu...

week 1 research topics, trends, questions_1

hadoop trends