recommender.system.presentation.pjug.05.20.2014
TRANSCRIPT
Applied Recommender Systems Bob Brehm5/20/2014
Hadoop MapReduce Overview Mahout Overview Hive Overview Review recommender systems Introduction to Spring XD Demonstrations as we go
Presentation Topics
Hadoop Overview
History [7] 2003: Apache Nutch (open-source web
search engine) was created by Doug Cutting and Mike Caferalla.
2004: Google File System and MapReduce papers published.
2005: Hadoop was created in Nutch as an open source inplementation to GFS and MapReduce.
Hadoop Overview
Today Hadoop is an independent Apache Project consisting of 4 modules: [6]
Hadoop common HDFS – distributed, scalable file system YARN (V2) – job scheduling and cluster
resource management MapReduce – system for parallel
processing of large data sets Hadoop market size is over $3 billion!
Hadoop Overview
Other Hadoop Related projects include Hive – data warehouse infrastructure Mahout – Machine learning library
While there are many more projects the rest of the talk will be focused on these two as well as MapReduce and HDFS.
Hadoop Overview
NameNode – keeps track of all DataNodes JobTracker – main scheduler Data Node – individual data clusters TaskTracker – sequences each DataNode
Hadoop Overview
HDFS basic command examples: Put – copies from local to HDFS
hadoop fs -put localfile /user/hadoop/hadoopfile
Mkdir – makes a directory hadoop fs -mkdir /user/hadoop/dir1
/user/hadoop/dir2 Tail – Displays last kilobyte of file
hadoop fs -tail pathname
Very similar to Linux commands
Hadoop Overview
Input data – wrangling can be difficult Mapper – split data into key value pairs Sort – sort values by key Reducer – Combine values by key
Hadoop Overview
Wordcount (HelloWorld) – Counts occurrences of each word in a document
Half of TF-IDF
Hadoop Overview
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
Hadoop Overviewpublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
}
Hadoop Overview
Setup the data:
/usr/joe/wordcount/input - input directory in HDFS
/usr/joe/wordcount/output - output directory in HDFS
$ hadoop fs -ls /usr/joe/wordcount/input/
/usr/joe/wordcount/input/file01
/usr/joe/wordcount/input/file02
$ hadoop fs -cat /usr/joe/wordcount/input/file01
Hello World Bye World
$ hadoop fs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Hadoop Overview
Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile WordCount.java and create a jar:
$ mkdir wordcount_classes
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf /usr/joe/wordcount.jar -C wordcount_classes/ .
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Output:
$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000
Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2
Hadoop Overview
Interesting facts about MapReduce MapReduce can run on any type of file including
images. Hadoop streaming technology allows other
languages to use MapReduce. Python, R, Ruby. Can include a Combiner method that can
streamline traffic Not required to include a Reducer (image
processing, ETL) Hadoop includes a JobTracker WebUI MRUnit – Junit test framework
Hadoop Overview
Spring for Apache Hadoop project Configure and run MapReduce jobs as
container managed objects Provide template helper classes for
HDFS, Hbase, Pig and Hive. Use standard Spring approach for
Hadoop! Access all Spring goodies – Messaging,
Persistence, Security, Web Services, etc.
Hive
Hive is an alternative to writing MapReduce jobs. Hive compiles to MapReduce.
Hive programs are written in HiveQL. Similar to to SQL.
Examples: Create table: hive> CREATE TABLE pokes (foo
INT, bar STRING); Loading data: hive> LOAD DATA LOCAL
INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Hive
Examples (cont):
Getting data out of hive: INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
Join: FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
Hive may reduce the amount of code you have to write when you are doing data wrangling.
It's a tool that has it's place and is useful to know.
Mahout
Started as a subproject of Lucene in 2008. Idea behind Mahout is that is provides a
framework for the development and deployment of Machine Learning algorithms.
Currently it has three distinct capabilities: Classification
Clustering
Recommenders
Mahout
Support for recommenders include: Data model – provides connections to data
UserSimilarity – provides similarity to users
ItemSimilarity – provides similarity to items
UserNeighborhood – find a neighborhood (mini cluster) of like-minded users.
Recommender – the producer of recommendations.
Algorithms!
Intro to Recommenders
What is a recommender?
Wikipedia [3]: A subclass of [an] information filtering system that seek to
predict the 'rating' or 'preference' that user would give to an item
My addition: A subclass of machine-learning.
Recommender model [2]: Users
Items
Ratings
Community
What is a recommender? [2]
Recommender types
Non-personalized [2] Content-based filtering (user-item) [2] Hybrid [3] Collaborative filtering (user-user, item-item)
[2]
Recommender types
Non-personalized [2] Content-based filtering (user-item) [2] Hybrid [3] Collaborative filtering (user-user, item-item)
[2]
Collaborative Filtering
We will now look at item-item collaborative filtering as the recommendation algorithm.
Answers the question: what items are similar to the ones you like?
Popularized by Amazon who found that item-item scales better, can be done in real time, and generate high-quality results. [8]
Specifically we will look at Pearson Correlation Coefficient algorithm.
Collaborative Filtering
Pearson's correlation coefficient - defined as the covariance of the two variables divided by the product of their standard deviations.
Collaborative Filtering
Idea is to examine a log file for user's movie ratings. Data looks like this:
109.170.148.120 - - [06/Jan/1998:01:48:18 -0500] "GET /rate?movie=268&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
109.170.148.120 - - [05/Jan/1998:22:48:57 -0800] "GET /rate?movie=345&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
109.170.148.120 - - [05/Jan/1998:22:50:15 -0800] "GET /rate?movie=312&rating=4 HTTP/1.1" 200 7 "http://clouderamovies.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0" "USER=286"
Collaborative Filtering
Steps used for the analysis: Run a hive script to extract the user data
from a log file Run Mahout command from the
command line (could be done programmatically as well).
Examine the contents.
Collaborative filtering
<hive-runner id="hiveRunner"> <script> CREATE TABLE MAHOUT_INPUT_A ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS SELECT cookie as user, regexp_extract(request, "GET /rate\\?movie=(\\d+) & rating=(\\d) HTTP/1.1", 1) as movie, CAST(regexp_extract(request, "GET /rate\\?movie=(\\d+) & rating=(\\d) HTTP/1.1", 2) as double) as rating from ACCESS_LOGS WHERE regexp_extract(request, "GET /rate\\?movie=(\\d+) & rating=(\\d) HTTP/1.1", 2) != ""; </script> </hive-runner>
Collaborative filteringpublic class HiveApp {
private static final Log log = LogFactory.getLog(HiveApp.class);
public static void main(String[] args) throws Exception { AbstractApplicationContext context = new
ClassPathXmlApplicationContext("/META-INF/spring/hive-context.xml", HiveApp.class);
context.registerShutdownHook();
HiveRunner runner = context.getBean(HiveRunner.class); runner.call();
}}
Collaborative Filtering
Hive output looks like this (This is the format that Mahout requires):
UserId, MovieID, relationship strength
943,373,3.0943,391,2.0943,796,3.0943,237,4.0943,840,4.0943,230,1.0943,229,2.0943,449,1.0943,450,1.0943,228,3.0
Collaborative filtering
Rerun Mahout with a different correlation say SIMILARITY_EUCLIDEAN_DISTANCE
Do A/B comparison in production Gather statistics over time See if one algorithm is better than others.
Spring XD
XD - Spring.io project that extends the work that Spring Data team did on Spring for Apache Hadoop project.
High throughput distributed data ingestion into HDFS from a variety of input sources.
Real-time analytics at ingestion time, e.g. gathering metrics and counting values.
Hadoop workflow management via batch jobs that combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or Cascading).
High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database.
Spring XD
Configure a stream using XD. Simple case:
Spring XD
More typical Corporate Use Case Stream:
Spring XD
Admin UI
??
Thanks!
References
[1] Introduction to recommender systems. Joseph Konstan.
[2] Intro to recommendations. Coursera.
[3] Recommender system. Wikipedia.
[4] An Algorithmic Framework for Performing Collaborative Filtering.
[5] Hybrid Web Recommender Systems.
[6] Hadoop web site.
[7] Apache Hadoop. Wikipedia
[8] Amazon.com Recommendations paper. cs.umd.edu.
[9] Cloudera Data Science Training. Cloudera.