recommender.system.presentation.pjug.05.20.2014

Applied Recommender Systems Bob Brehm5/20/2014

Hadoop MapReduce Overview Mahout Overview Hive Overview Review recommender systems Introduction to Spring XD Demonstrations as we go

Presentation Topics

Hadoop Overview

History [7] 2003: Apache Nutch (open-source web

search engine) was created by Doug Cutting and Mike Caferalla.

2004: Google File System and MapReduce papers published.

2005: Hadoop was created in Nutch as an open source inplementation to GFS and MapReduce.

Hadoop Overview

Today Hadoop is an independent Apache Project consisting of 4 modules: [6]

Hadoop common HDFS – distributed, scalable file system YARN (V2) – job scheduling and cluster

resource management MapReduce – system for parallel

processing of large data sets Hadoop market size is over $3 billion!

Hadoop Overview

Other Hadoop Related projects include Hive – data warehouse infrastructure Mahout – Machine learning library

While there are many more projects the rest of the talk will be focused on these two as well as MapReduce and HDFS.

Hadoop Overview

NameNode – keeps track of all DataNodes JobTracker – main scheduler Data Node – individual data clusters TaskTracker – sequences each DataNode

Hadoop Overview

HDFS basic command examples: Put – copies from local to HDFS

hadoop fs -put localfile /user/hadoop/hadoopfile

Mkdir – makes a directory hadoop fs -mkdir /user/hadoop/dir1

/user/hadoop/dir2 Tail – Displays last kilobyte of file

hadoop fs -tail pathname

Very similar to Linux commands

Hadoop Overview

Input data – wrangling can be difficult Mapper – split data into key value pairs Sort – sort values by key Reducer – Combine values by key

Hadoop Overview

Wordcount (HelloWorld) – Counts occurrences of each word in a document

Half of TF-IDF

Hadoop Overview

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

Hadoop Overviewpublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

}

Hadoop Overview

Setup the data:

/usr/joe/wordcount/input - input directory in HDFS

/usr/joe/wordcount/output - output directory in HDFS

$ hadoop fs -ls /usr/joe/wordcount/input/

/usr/joe/wordcount/input/file01

/usr/joe/wordcount/input/file02

$ hadoop fs -cat /usr/joe/wordcount/input/file01

Hello World Bye World

$ hadoop fs -cat /usr/joe/wordcount/input/file02

Hello Hadoop Goodbye Hadoop

Hadoop Overview

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:

$ mkdir wordcount_classes

$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java

$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .

Run the application:

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

Output:

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000

Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

Hadoop Overview

Interesting facts about MapReduce MapReduce can run on any type of file including

images. Hadoop streaming technology allows other

languages to use MapReduce. Python, R, Ruby. Can include a Combiner method that can

streamline traffic Not required to include a Reducer (image

processing, ETL) Hadoop includes a JobTracker WebUI MRUnit – Junit test framework

Hadoop Overview

Spring for Apache Hadoop project Configure and run MapReduce jobs as

container managed objects Provide template helper classes for

HDFS, Hbase, Pig and Hive. Use standard Spring approach for

Hadoop! Access all Spring goodies – Messaging,

Persistence, Security, Web Services, etc.

Hive

Hive is an alternative to writing MapReduce jobs. Hive compiles to MapReduce.

Hive programs are written in HiveQL. Similar to to SQL.

Examples: Create table: hive> CREATE TABLE pokes (foo

INT, bar STRING); Loading data: hive> LOAD DATA LOCAL

INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Hive

Examples (cont):

Getting data out of hive: INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

Join: FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

Hive may reduce the amount of code you have to write when you are doing data wrangling.

It's a tool that has it's place and is useful to know.

Mahout

Started as a subproject of Lucene in 2008. Idea behind Mahout is that is provides a

framework for the development and deployment of Machine Learning algorithms.

Currently it has three distinct capabilities: Classification

Clustering

Recommenders

Mahout

Support for recommenders include: Data model – provides connections to data

UserSimilarity – provides similarity to users

ItemSimilarity – provides similarity to items

UserNeighborhood – find a neighborhood (mini cluster) of like-minded users.

Recommender – the producer of recommendations.

Algorithms!

Intro to Recommenders

What is a recommender?

Wikipedia [3]: A subclass of [an] information filtering system that seek to

predict the 'rating' or 'preference' that user would give to an item

My addition: A subclass of machine-learning.

Recommender model [2]: Users

Items

Ratings

Community

What is a recommender? [2]

Recommender types

Non-personalized [2] Content-based filtering (user-item) [2] Hybrid [3] Collaborative filtering (user-user, item-item)

[2]

Collaborative Filtering

We will now look at item-item collaborative filtering as the recommendation algorithm.

Answers the question: what items are similar to the ones you like?

Popularized by Amazon who found that item-item scales better, can be done in real time, and generate high-quality results. [8]

Specifically we will look at Pearson Correlation Coefficient algorithm.


Pearson's correlation coefficient - defined as the covariance of the two variables divided by the product of their standard deviations.


Idea is to examine a log file for user's movie ratings. Data looks like this:

109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"




Steps used for the analysis: Run a hive script to extract the user data

from a log file Run Mahout command from the

command line (could be done programmatically as well).

Examine the contents.

Collaborative filtering

<hive-runner id="hiveRunner"> <script> CREATE TABLE MAHOUT_INPUT_A ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS SELECT cookie as user, regexp_extract(request, "GET /rate\\?movie=(\\d+) & rating=(\\d) HTTP/1.1", 1) as movie, CAST(regexp_extract(request, "GET /rate\\?movie=(\\d+) & rating=(\\d) HTTP/1.1", 2) as double) as rating from ACCESS_LOGS WHERE regexp_extract(request, "GET /rate\\?movie=(\\d+) & rating=(\\d) HTTP/1.1", 2) != ""; </script> </hive-runner>

Collaborative filteringpublic class HiveApp {

private static final Log log = LogFactory.getLog(HiveApp.class);

public static void main(String[] args) throws Exception { AbstractApplicationContext context = new

ClassPathXmlApplicationContext("/META-INF/spring/hive-context.xml", HiveApp.class);

context.registerShutdownHook();

HiveRunner runner = context.getBean(HiveRunner.class); runner.call();

}}


Hive output looks like this (This is the format that Mahout requires):

UserId, MovieID, relationship strength

943,373,3.0943,391,2.0943,796,3.0943,237,4.0943,840,4.0943,230,1.0943,229,2.0943,449,1.0943,450,1.0943,228,3.0

Collaborative filtering

Rerun Mahout with a different correlation say SIMILARITY_EUCLIDEAN_DISTANCE

Do A/B comparison in production Gather statistics over time See if one algorithm is better than others.

Spring XD

XD - Spring.io project that extends the work that Spring Data team did on Spring for Apache Hadoop project.

High throughput distributed data ingestion into HDFS from a variety of input sources.

Real-time analytics at ingestion time, e.g. gathering metrics and counting values.

Hadoop workflow management via batch jobs that combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or Cascading).

High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database.

Spring XD

Configure a stream using XD. Simple case:

Spring XD

More typical Corporate Use Case Stream:

Spring XD

Admin UI

Thanks!

References

[1] Introduction to recommender systems. Joseph Konstan.

[2] Intro to recommendations. Coursera.

[3] Recommender system. Wikipedia.

[4] An Algorithmic Framework for Performing Collaborative Filtering.

[5] Hybrid Web Recommender Systems.

[6] Hadoop web site.

[7] Apache Hadoop. Wikipedia

[8] Amazon.com Recommendations paper. cs.umd.edu.

[9] Cloudera Data Science Training. Cloudera.

recommender.system.presentation.pjug.05.20.2014

Documents

etl hadoop

hadoop overview spring

hadoop overview namenode

hadoop overview history

hadoop overview setup

hdfs hadoop fs

hadoop common hdfs

apache hadoop project