bi 2.0 hadoop everywhere

BI 2.0 – Hadoop Everywhere. What New Skills are Required.

Dmitry Tolpeko, EPAM Systems - December 2014

What is Big Data?

• Not just large data volume

• Approach to perform data intensive computation in a scalable way

Data Volume Problem

• Machine Generated Data

• I/O Limitations

What was Before

• Expensive Commercial MPP Systems (Teradata, Oracle, Netezza etc.)

• Open source products (MySQL, PostgreSQL etc.)

• NoSQL products

• Cloud

Scalable Applications

• Threads and Concurrency

• Distributed processing

• Scheduling

• Sharding

• Fault-Tolerance

Infrastructure code takes almost all time

Hadoop

• Distributed File System – Any data, any format

• Generic Framework for Scalable Applications (not only SQL):

• SQL-on-Hadoop• Map Reduce, Spark, Tez• Graph• Machine Learning, Neural Networks etc.

• Ideal for Cloud – Scalable and Elastic

Must Have

• Universal Data Platform for Enterprise – Landing Zone (Active HDD)

• Oracle Big Data Appliance

• IBM BigInsight

• Microsoft HDInsight

• Teradata

• HP

Forrester: “Hadoop is no longer optional”

Developers

• Cool frameworks

• Distributed systems

• Real-time, Machine Learning etc.

New Concepts

• Data Lake

• Lambda Architecture

• Active Archive

BI Trends

• Moving workload to Hadoop

• Analyzing unstructured data

• Streaming and real time –Proactive and Predictive analytics

Problems

• Data Quality

• Data Consistency

• Security

Two Worlds

• Business Intelligence and Data Warehousing

• Software Development

Java and Scala

• Products and tools

• Custom Frameworks

• Real time

• Advanced processing

New Skills for Software Developers

Distributed processing, machine learning

Traditional Data warehousing and BI concepts:

• Kimball, Inmon

• Facts and Dimensions (slowly changing dimensions i.e.)

Data Science

Gradual Transformation from traditional Data Analysis to Machine Learning

• Analysts

• BI/Database Developers

• QA

Like Architect in Software Development

Python

Rich libraries:

• Pandas

• NumPy

• SciPy

SQL-on-Hadoop

Dozen of SQL-on-Hadoop tools:

• Schemaless SQL – Apache Drill i.e.

• SQL on top of NoSQL

• SQL for Graphs, Machine Learning etc.

Everywhere

It is Ecosystem not a set of tools

Yesterday: Map Reduce

Today: YARN, Spark etc.

Tomorrow: ?

Must have skill set

Like Linux

Thank [email protected]

www.dmtolpeko.com

Join Belarus Hadoop User Group at LinkedIn

mailto:[email protected]

http://www.dmtolpeko.com/

bi 2.0 hadoop everywhere

Data & Analytics