tech 3camp presentation

26
© 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved. Hadoop – a distributed analytical platform Jakub Wszolek ( [email protected]) TECH 3camp 2015

Upload: jwacxiom

Post on 15-Apr-2017

291 views

Category:

Data & Analytics


0 download

TRANSCRIPT

©  2013  Acxiom  Corporation.  All  Rights  Reserved. ©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Hadoop – a  distributedanalytical platform

Jakub  Wszolek  ([email protected])TECH  3camp  -­ 2015

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

BigData is not  Hadoop only

2

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Hadoop galactic

3

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

ETL  processes

4

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

ETL  processes

5

Hadoop  StreamingHive

MRJOB

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

ETL  processes

6

Hadoop  StreamingHive

MRJOB

Data  Loading

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

ETL  processes

7

Hadoop  StreamingHive

MRJOB

Data  Loading

Hive  Tables  (internal/external)

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

ETL  processes

8

Hadoop  StreamingHive

MRJOB

Data  Loading

Hive  Tables  (internal/external)

Data  Science

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

ETL  processes

9

Hadoop  StreamingHive

MRJOB

Data  Loading

Hive  Tables  (internal/external)

Data  Science

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Worth  to  check..

• MRJOB -­ https://pythonhosted.org/mrjob/-Hadoop  streaming  -Keep  all  MapReduce code  for  one  job  in  a  single  class-mrjob lets  you  run  your  code  without  Hadoop  at  all-mrjob makes  debugging  much  easier

• Snakebite -­ https://github.com/spotify/snakebite-pure  python  HDFS  client-protobuf for  communicating  with  the  NameNode-CLI  for  Hadoop-Extreamlly fast!

10

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Still  under  heavy  loading

0

0,5

1

1,5

2

2,5

3

3,5

4

July August September October November

Data  Loads  [TB]

Data  Loads  [TB] Expon.    (Data  Loads  [TB])

11

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Complex analysis

12

• RevR  +  RStudio  

• DataScience

• Trend  analysis,  advanced clustering

• Predictive models

• Classifiers

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Apache  Mahout

• Library  of  scalable  machine-­learning algorithms• Implemented  on  top  of  Apache  Hadoop

• Using  the  MapReduce paradigm• Provides  the  data  science tools  to  automatically  find  meaningful  patterns  in  those  big  data  sets

• http://mahout.apache.org/

13

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

What  Mahout  Does• Mahout  supports four  main  data  science use  cases:-Collaborative  filtering – mines  user  behavior  and  

makes  product  recommendations   (e.g.  Amazon  

recommendations)

-Clustering – takes  items  in  a  particular  class

-Classification – learns  from  existing  categorizations  

and  then  assigns

-Frequent  itemset mining – analyzes   items  in  a  group

14

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Clustering  -­ business  use  case• Helps  marketers  improve  their  customer  base  and  work  on  the  target areas.  

• Group  people according  to  different  criteria’s  (such  as  willingness,  purchasing  power  etc.)  based  on  their  similarity in  many  ways  related  to  the  product  under  consideration.

• Helps  in  identification of  groups  of  houses  on  the  basis  of  their  value,  type  and  geographical  locations.

15

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

K-­means

16

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

K-­means

17

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Hadoop  data  preparation

18

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Sequences  and  Vectors

• Hadoop  Sequence  file- flat  file  consisting  of  binary  key/value  pairs- It  is  extensively  used  in MapReduce as  input/output  formats

-Each  record  is  a  <key,value>  pair-Key  and  Value  needs  to  be  a  class  of  org.apache.hadoop.io.Text

-KEY  =  record  name/filename/uniqe ID-VALUE  =  content  as  UTF-­8  encoded  String

• Vectors-Typical  vector  representation   ie.  Weka,  Matlab

19

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

HDFS  data  file  to  Vector

20

List<NamedVector> vector = new LinkedList<NamedVector>();NamedVector v1;v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");vector.add(v1);

Configuration config = new Configuration();FileSystem fs = FileSystem.get(config);

Path path = new Path("datasamples/data");

//write a SequenceFile form a VectorSequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);VectorWritable vec = new VectorWritable();for(NamedVector v:vector){

vec.set(v);writer.append(new Text(v.getName()), v);

}writer.close();

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Kmeans clustering  in  action• Place  the  file  on  HDFS• Convert  the  file  into  sequence  and  vector-mahout  arff.vector-­d  /home/cloudera/Mahout/input_data-­o  /user/cloudera/mahout/arff/vec_data-­t  /home/cloudera/Mahout/arff/dict

• Run  mahout  kmeans-mahout  kmeans -­-­input  <hdfs_ata_files>   -­-­output  <kmeans-­output>   -­-­numClusters 3   -­-­clusters  <clusters-­0-­final>  -­-­maxIter 20   -­-­method  mapreduce

21

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Kmeans clustering  in  action

• See  the  cluster  as  text  file-mahout  clusterdump-­i <hdfs_input>  

- -­o  <output_file>  -­p  <clusteredPoints>

• See  the  cluster  as  graphml file- -­of  GRAPH_ML

22

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Results

23

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Results

24

©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Acxiom  DSSH

25

• Data  Science  Safe  Haven  (DSSH)

• Detailed  measurements  that  show  how  digital  

marketing  is  driving  purchasing  behaviors

• Actionable  recommendations  on  how  to  adjust  

your  digital  marketing  to  reach  your  goals

• Insights  on  how  your  key  customer  segments  

are  engaging  in  digital  channels• http://www.acxiom.com/data-­science-­safe-­haven/

©  2013  Acxiom  Corporation.  All  Rights  Reserved. ©  2013  Acxiom  Corporation.  All  Rights  Reserved.

Questions?

Thank you!