bi 2.0 hadoop everywhere
TRANSCRIPT
BI 2.0 – Hadoop Everywhere. What New Skills are Required.
Dmitry Tolpeko, EPAM Systems - December 2014
What is Big Data?
• Not just large data volume
• Approach to perform data intensive computation in a scalable way
What was Before
• Expensive Commercial MPP Systems (Teradata, Oracle, Netezza etc.)
• Open source products (MySQL, PostgreSQL etc.)
• NoSQL products
• Cloud
Scalable Applications
• Threads and Concurrency
• Distributed processing
• Scheduling
• Sharding
• Fault-Tolerance
Infrastructure code takes almost all time
Hadoop
• Distributed File System – Any data, any format
• Generic Framework for Scalable Applications (not only SQL):
• SQL-on-Hadoop• Map Reduce, Spark, Tez• Graph• Machine Learning, Neural Networks etc.
• Ideal for Cloud – Scalable and Elastic
Must Have
• Universal Data Platform for Enterprise – Landing Zone (Active HDD)
• Oracle Big Data Appliance
• IBM BigInsight
• Microsoft HDInsight
• Teradata
• HP
Forrester: “Hadoop is no longer optional”
BI Trends
• Moving workload to Hadoop
• Analyzing unstructured data
• Streaming and real time –Proactive and Predictive analytics
New Skills for Software Developers
Distributed processing, machine learning
Traditional Data warehousing and BI concepts:
• Kimball, Inmon
• Facts and Dimensions (slowly changing dimensions i.e.)
Data Science
Gradual Transformation from traditional Data Analysis to Machine Learning
• Analysts
• BI/Database Developers
• QA
Like Architect in Software Development
SQL-on-Hadoop
Dozen of SQL-on-Hadoop tools:
• Schemaless SQL – Apache Drill i.e.
• SQL on top of NoSQL
• SQL for Graphs, Machine Learning etc.
Everywhere
It is Ecosystem not a set of tools
Yesterday: Map Reduce
Today: YARN, Spark etc.
Tomorrow: ?
Must have skill set
Like Linux