bds14 big data analytics to the masses

23
Big Data Analytics to the masses Why it has failed and how we can fix it Jose Luis Lopez Pino @jllopezpino

Upload: jose-luis-lopez-pino

Post on 03-Jul-2015

1.083 views

Category:

Technology


0 download

DESCRIPTION

Slides from my talk at Big Data Spain 2014 in Madrid. In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm. This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples. In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems. Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.

TRANSCRIPT

Page 1: BDS14 Big Data Analytics to the masses

Big Data Analytics to the masses

Why it has failed and how we can fix it

Jose Luis Lopez Pino @jllopezpino

Page 2: BDS14 Big Data Analytics to the masses

Who am I?

BI Consultant

Large-Scale & Distributed

Founding

Data Engineer

Page 3: BDS14 Big Data Analytics to the masses

Big Data is like Tourism But if you aren’t an expert,

you can’t make the most of itIt seems easy to do

Page 4: BDS14 Big Data Analytics to the masses

Struggle to analyze Big Data

Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc., 2013Also: Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions

Page 5: BDS14 Big Data Analytics to the masses

Tools

Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014

Page 6: BDS14 Big Data Analytics to the masses

Tools (Now)

Original: Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014

Page 7: BDS14 Big Data Analytics to the masses

Deep analytics

Page 8: BDS14 Big Data Analytics to the masses

Libraries!

We need libraries...

Query languages

Write your own MR/RDD/Transformations

Page 9: BDS14 Big Data Analytics to the masses

… comprehensive ones!

Page 10: BDS14 Big Data Analytics to the masses

Say it with memes!

When you doDeep analytics in small data

using R and CRAN packages

When you dodeep analytics in BIG data

using R and CRAN packages

Page 11: BDS14 Big Data Analytics to the masses

When you try to program it using MapReduce

When you try to program it using Apache Spark /

Apache Flink

When you try to use a library scalable to large data sets

Page 12: BDS14 Big Data Analytics to the masses

Can’t we do it better?

- Make it similar to normal R programs.

- Hide complexity.- Make file manipulation easier.- Part of the computing in the

cluster and part of the computer in the client.

Page 13: BDS14 Big Data Analytics to the masses

Our approach

Page 14: BDS14 Big Data Analytics to the masses

Our approach

Page 15: BDS14 Big Data Analytics to the masses

Behind the scenes: Before

Page 16: BDS14 Big Data Analytics to the masses

Behind the scenes: After

Page 17: BDS14 Big Data Analytics to the masses

Without writing significantly different code

Page 18: BDS14 Big Data Analytics to the masses

Competitive or even faster than R native code in small data

Page 19: BDS14 Big Data Analytics to the masses

Competitive even in highly iterative programs in small data

Page 20: BDS14 Big Data Analytics to the masses

And it scales

Page 21: BDS14 Big Data Analytics to the masses

Some relevant findings

- Transmission time was not significant.- Stratosphere/Flink was competitive even in

small datasets.- Changes in the code were required.- Ensemble scenarios are the most exciting

ones.

Page 22: BDS14 Big Data Analytics to the masses

4 Takeaways from this talk

- We still need to bring Big Data to the right people in the right place.

- We need comprehensive libraries.- We need to move data back and forth.- Use a syntax that the users are familiar with.

Page 23: BDS14 Big Data Analytics to the masses

That’s all!- Have you found this talk interesting?

- Follow me: @jllopezpino- Looking for a job? (SEM Data Analyst,

Senior Analyst)- GYG is hiring:

- Are you interested in Data + Energy?- Keep in touch: