java as a fundamental working tool of the data scientist · 2015-02-28 · all your personal data...
TRANSCRIPT
![Page 1: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/1.jpg)
Speaker : Alexey Zinoviev
Java as a fundamental working tool of the Data Scientist
![Page 2: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/2.jpg)
About
● I am a <graph theory, machine learning, traffic jams prediction,
BigData algorythms> scientist
● But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark>
programmer
![Page 3: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/3.jpg)
![Page 4: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/4.jpg)
One of these fine days...
![Page 5: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/5.jpg)
We need in Python dev 'cause Data Mining
![Page 6: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/6.jpg)
You're a programmer, not an analyst
![Page 7: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/7.jpg)
Write your backends!
![Page 8: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/8.jpg)
Let’s talk about it, Java-boy...
![Page 9: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/9.jpg)
Data mining
Mining coal in your data
![Page 10: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/10.jpg)
Hey, man, predict me something!
![Page 11: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/11.jpg)
Man or sofa?
![Page 12: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/12.jpg)
● Which loan applicants are high-risk?
● How do we detect phone card fraud?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
Typical questions for DM
![Page 13: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/13.jpg)
What is Data Mining?
13/72
![Page 14: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/14.jpg)
Statistics?
![Page 15: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/15.jpg)
Tag cloud?
![Page 16: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/16.jpg)
Data visualization?
![Page 17: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/17.jpg)
Not OLAP, 100%
![Page 18: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/18.jpg)
1. Selection2. Pre-processing3. Transformation4. Data Mining5. Interpretation/Evaluation
Magic part of KDD (Knowledge Discovery in Databases)
![Page 19: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/19.jpg)
1. Share your date with us
2. Our magic manipulations
3. Building an answering machine
4. PROFIT!!!
How it really works
![Page 20: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/20.jpg)
Data
20/72
![Page 21: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/21.jpg)
● Facebook users, tweets
● Weather
● Sea routes
● Trade transactions
● Goverment
● Medicine (genomic data)
● Telecommuncations (phone call records)
Data examples
![Page 22: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/22.jpg)
● Relational Databases
● Data warehouses (Historical data)
● Files in CSV or in binary format
● Internet or electronic mails
● Scientific, research (R, Octave,
Matlab)
Data sources
![Page 23: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/23.jpg)
Target Data & Personal Data
23/72
![Page 24: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/24.jpg)
![Page 25: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/25.jpg)
● All your personal data (PD) are being
deeply mined
● The industry of collecting, aggregating,
and brokering PD is “database marketing.”
● 1.1 billion browser cookies, 200 million
mobile profiles, and an average of 1,500
pieces of data per consumer in Acxiom
Pay with your personal data
![Page 26: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/26.jpg)
Preprocessing
26/72
![Page 27: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/27.jpg)
● Select small pieces● Define default values for missed data● Remove strange signals from data● Merge some tables in one if required
![Page 28: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/28.jpg)
Pattern mining
28/72
![Page 29: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/29.jpg)
Association rule learning
![Page 30: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/30.jpg)
It is the process of finding model of function that describes and
distinguishes data class to predict the class of objects whose class
label is unknown.
What is Cluster Analysis?
![Page 31: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/31.jpg)
● Statistical process for estimating the
relationships among variables
● The estimation target is function (it can
be probability distribution)
● Can be linear, polynomial, nonlinear and
etc.
Regression
![Page 32: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/32.jpg)
● Training set of classified examples (supervised learning)
● Test set of non-classified items● Main goal: find a function (classifier) that
maps input data to a category● Computer vision, drug discovery, speech
recognition, biometric indentification, credit scoring
Classification
![Page 33: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/33.jpg)
Decision trees
![Page 34: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/34.jpg)
Decision trees
![Page 35: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/35.jpg)
● There are two classes of objects A & B
● Define the class of new object, based on
information about its neighbors
● Changing the boundaries of an new object
area, we form a set of neighbors.
● New object is B becuase majority of the
neighbors is a B.
kNN (k-nearest neighbor)
![Page 36: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/36.jpg)
Skills & Tools
36/72
![Page 37: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/37.jpg)
![Page 38: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/38.jpg)
![Page 39: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/39.jpg)
Fashion Languages
39/72
![Page 40: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/40.jpg)
![Page 41: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/41.jpg)
● It’s free
● Not full implemented stack of ML
algorythms
● All your matrix are belong to us!
● Single thread model
● Java support
Why not Octave?
![Page 42: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/42.jpg)
![Page 43: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/43.jpg)
● 25% of R packages are written in Java
● Syntax is too sweet
● You should read 1000 lines in docs to
write 1 line of code
● Single thread model for 95%
algorythms
Why not R?
![Page 44: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/44.jpg)
● Now Python is an idol for young scientists
due to the low barrier to entry
● We are not Python developers
● High-level language
● Have you ever heard about a Jython?
● Long long way to real Highload
production
Why not Python?
![Page 45: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/45.jpg)
DM libraries in Python
![Page 46: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/46.jpg)
Java ecosystem
46/72
![Page 47: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/47.jpg)
● Java API for Data mining, JSR 73 and JSR 247
● javax.datamining.supervised defines the supervised
function-related interfaces
● javax.datamining.algorithm contains all mining algorithm
subclass packages
● JDM 2.0 adds Text Mining, Time series and so on..
JDM
![Page 48: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/48.jpg)
● Connectors to R, Octave, Matlab, Hadoop,
NoSQL/SQL databases
● Source code of all algorythms in Java
● Preprocessing tools: discretization,
normalization, resampling, attribute
selection, transforming and combining
Weka
![Page 49: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/49.jpg)
![Page 50: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/50.jpg)
![Page 51: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/51.jpg)
Weka + Hadoop
![Page 52: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/52.jpg)
● It’s codebase of algorythms in pattern
mining field
● It has cool examples and implementation
of 78 algorythms
● Cool performance results in specific area
● Codebase grows very fast
SPMF
![Page 53: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/53.jpg)
● Driven by Ng et al.’s paper “MapReduce for Machine Learning
on Multicore”
● Next algorythms were adopted: Locally Weighted Linear
Regression(LWLR), Naive Bayes (NB), k-means, Logistic
Regression, Neural Network (NN), Principal Components Analysis
(PCA), Support Vector Machine (SVM) and so on..
● The complexity was reduced in n times for n processors.
Mahout
![Page 54: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/54.jpg)
● DataModel (File, MySQL, PostgreSQL,
Mongo, Cassandra)
● UserSimilarity
● ItemSimilarity
● UserNeighborhood
● Recommender
Mahout
![Page 55: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/55.jpg)
● Advanced Implementations of Java’s Collections Framework for
better Performance.
● Very close to Apache Giraph
● New algorythms will build on Spark platform
● Spark shell
● Spring + Mahout demo
● Collaborative Filtering, Classification, Clustering, Dimensionality
Reduction, Miscellaneous are supported
Mahout
![Page 56: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/56.jpg)
![Page 57: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/57.jpg)
● MapReduce in memory
● Up to 50x faster than Hadoop
● Support for Shark (like Hive), MLlib
(Machine learning), GraphX (graph
processing)
● RDD is a basic building block (immutable
distributed collections of objects)
Spark
![Page 58: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/58.jpg)
Spark
![Page 59: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/59.jpg)
Java 7 search example:JavaRDD<String> lines = sc.textFile("hdfs://log.txt").filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("Tomcat"); }});long numErrors = lines.count();
Java 8 search example:
JavaRDD<String> lines = sc.textFile("hdfs://log.txt") .filter(s -> s.contains("Tomcat"));long numErrors = lines.count();
Spark + Java 8
![Page 60: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/60.jpg)
Mahout’s killer
![Page 61: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/61.jpg)
● Classification and regression. collaborative filtering and
clustering, Dimensionality reduction and Optimization are
supported
● It extends scikit-learn (Python lib) and Mahout and run on
Spark
● Well documented and integrated with many Java solutions
MLlib
![Page 62: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/62.jpg)
Size Classification Tools
LinesSample Data
Analysis and Visualization Whiteboard, bash
KBs - low MBsPrototype Data
Analysis and Visualization Matlab, Octave, R
MBs - low GBsOnline Data
Storage MySQL (DBs)
MBs - low GBsOnline Data
Analysis NumPy, SciPy, Weka, BLAS/LAPACK
GBs - TBs - PBs BigData
Storage HDFS, HBase, Cassandra
GBs - TBs - PBsBig Data
Analysis Hive, Mahout, Hama, Giraph,MLlib
![Page 63: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/63.jpg)
Large graph processing tools
63/72
![Page 64: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/64.jpg)
![Page 65: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/65.jpg)
![Page 66: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/66.jpg)
Graph Number of vertexes
Number of edges
Volume Data/per day
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB
Facebook (friends graph)
1,1 * 10^9 160 * 10^9 1 PB 15 TB
Road graph of EU
18 * 10^6 42 * 10^6 20 GB 50 MB
Road graph of this city
250 000 460 000 500 MB 100 KB
![Page 67: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/67.jpg)
● High complexity of graph problem reduction to key-value model
● Iteration algorythms, but multiple chained jobs in M/R with full saving and reading of each state
Think like a vertex…
MapReduce for iterative calculations
![Page 68: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/68.jpg)
![Page 69: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/69.jpg)
С++ API
![Page 70: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/70.jpg)
![Page 71: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/71.jpg)
● “Mahout in Action”, Owen et. al., Manning Pub.
● “Pattern Recognition and Machine Learning”, Christopher Bishop,
Springer Pub.
● “Elements of Statistical Learning: Data Mining, Inference, and
Prediction”, Hastie et. al., Springer Pub.
● “Collective Intelligence in Action” Satnam Alag et. al., Manning
Pub.
Books and papers
![Page 72: Java as a fundamental working tool of the Data Scientist · 2015-02-28 · All your personal data (PD) are being deeply mined The industry of collecting, aggregating, and brokering](https://reader033.vdocuments.net/reader033/viewer/2022042119/5e982925198309296c02b95a/html5/thumbnails/72.jpg)
Your questions?