tech 3camp presentation
TRANSCRIPT
© 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved.
Hadoop – a distributedanalytical platform
Jakub Wszolek ([email protected])TECH 3camp - 2015
© 2013 Acxiom Corporation. All Rights Reserved.
ETL processes
6
Hadoop StreamingHive
MRJOB
Data Loading
© 2013 Acxiom Corporation. All Rights Reserved.
ETL processes
7
Hadoop StreamingHive
MRJOB
Data Loading
Hive Tables (internal/external)
© 2013 Acxiom Corporation. All Rights Reserved.
ETL processes
8
Hadoop StreamingHive
MRJOB
Data Loading
Hive Tables (internal/external)
Data Science
© 2013 Acxiom Corporation. All Rights Reserved.
ETL processes
9
Hadoop StreamingHive
MRJOB
Data Loading
Hive Tables (internal/external)
Data Science
© 2013 Acxiom Corporation. All Rights Reserved.
Worth to check..
• MRJOB - https://pythonhosted.org/mrjob/-Hadoop streaming -Keep all MapReduce code for one job in a single class-mrjob lets you run your code without Hadoop at all-mrjob makes debugging much easier
• Snakebite - https://github.com/spotify/snakebite-pure python HDFS client-protobuf for communicating with the NameNode-CLI for Hadoop-Extreamlly fast!
10
© 2013 Acxiom Corporation. All Rights Reserved.
Still under heavy loading
0
0,5
1
1,5
2
2,5
3
3,5
4
July August September October November
Data Loads [TB]
Data Loads [TB] Expon. (Data Loads [TB])
11
© 2013 Acxiom Corporation. All Rights Reserved.
Complex analysis
12
• RevR + RStudio
• DataScience
• Trend analysis, advanced clustering
• Predictive models
• Classifiers
© 2013 Acxiom Corporation. All Rights Reserved.
Apache Mahout
• Library of scalable machine-learning algorithms• Implemented on top of Apache Hadoop
• Using the MapReduce paradigm• Provides the data science tools to automatically find meaningful patterns in those big data sets
• http://mahout.apache.org/
13
© 2013 Acxiom Corporation. All Rights Reserved.
What Mahout Does• Mahout supports four main data science use cases:-Collaborative filtering – mines user behavior and
makes product recommendations (e.g. Amazon
recommendations)
-Clustering – takes items in a particular class
-Classification – learns from existing categorizations
and then assigns
-Frequent itemset mining – analyzes items in a group
14
© 2013 Acxiom Corporation. All Rights Reserved.
Clustering - business use case• Helps marketers improve their customer base and work on the target areas.
• Group people according to different criteria’s (such as willingness, purchasing power etc.) based on their similarity in many ways related to the product under consideration.
• Helps in identification of groups of houses on the basis of their value, type and geographical locations.
15
© 2013 Acxiom Corporation. All Rights Reserved.
Sequences and Vectors
• Hadoop Sequence file- flat file consisting of binary key/value pairs- It is extensively used in MapReduce as input/output formats
-Each record is a <key,value> pair-Key and Value needs to be a class of org.apache.hadoop.io.Text
-KEY = record name/filename/uniqe ID-VALUE = content as UTF-8 encoded String
• Vectors-Typical vector representation ie. Weka, Matlab
19
© 2013 Acxiom Corporation. All Rights Reserved.
HDFS data file to Vector
20
List<NamedVector> vector = new LinkedList<NamedVector>();NamedVector v1;v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");vector.add(v1);
Configuration config = new Configuration();FileSystem fs = FileSystem.get(config);
Path path = new Path("datasamples/data");
//write a SequenceFile form a VectorSequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);VectorWritable vec = new VectorWritable();for(NamedVector v:vector){
vec.set(v);writer.append(new Text(v.getName()), v);
}writer.close();
© 2013 Acxiom Corporation. All Rights Reserved.
Kmeans clustering in action• Place the file on HDFS• Convert the file into sequence and vector-mahout arff.vector-d /home/cloudera/Mahout/input_data-o /user/cloudera/mahout/arff/vec_data-t /home/cloudera/Mahout/arff/dict
• Run mahout kmeans-mahout kmeans --input <hdfs_ata_files> --output <kmeans-output> --numClusters 3 --clusters <clusters-0-final> --maxIter 20 --method mapreduce
21
© 2013 Acxiom Corporation. All Rights Reserved.
Kmeans clustering in action
• See the cluster as text file-mahout clusterdump-i <hdfs_input>
- -o <output_file> -p <clusteredPoints>
• See the cluster as graphml file- -of GRAPH_ML
22
© 2013 Acxiom Corporation. All Rights Reserved.
Acxiom DSSH
25
• Data Science Safe Haven (DSSH)
• Detailed measurements that show how digital
marketing is driving purchasing behaviors
• Actionable recommendations on how to adjust
your digital marketing to reach your goals
• Insights on how your key customer segments
are engaging in digital channels• http://www.acxiom.com/data-science-safe-haven/