scientific article recommendation with mahout
Post on 28-May-2015
938 Views
Preview:
DESCRIPTION
TRANSCRIPT
Scientific Article Recommendation
with Mahout
Kris Jack, PhD
Senior Data Mining Engineer
Use Case
➔ Good researchers are on top of their game➔ Large amount of research produced➔ Takes time to get at what you need
➔ Help researchers by recommending relevant research
1.5 million+ users; the 20 largest user bases:
University of CambridgeStanford University
MITUniversity of Michigan
Harvard UniversityUniversity of OxfordSao Paulo University
Imperial College LondonUniversity of Edinburgh
Cornell UniversityUniversity of California at Berkeley
RWTH AachenColumbia University
Georgia TechUniversity of Wisconsin
UC San DiegoUniversity of California at LA
University of FloridaUniversity of North Carolina50m research articles
1.5 million+ users; the 20 largest user bases:
University of CambridgeStanford University
MITUniversity of Michigan
Harvard UniversityUniversity of OxfordSao Paulo University
Imperial College LondonUniversity of Edinburgh
Cornell UniversityUniversity of California at Berkeley
RWTH AachenColumbia University
Georgia TechUniversity of Wisconsin
UC San DiegoUniversity of California at LA
University of FloridaUniversity of North Carolina50m research articles
We need a recommender that
scales up, coping with our data and future
growth
➔ How does Mahout's recommender work?
➔ How well does it perform out of the box?
➔ How well does it perform after some tuning?
Questions
Mahout's Recommender
Generating recommendations through matrix multiplication
This is item-based recommendations as similarity is based on items, not users
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Turing Babbage Einstein Newton
Comp Sci 1
Physics 1
Res
earc
h A
rtic
les
Researchers
Physics 2
Comp Sci 2
Input (all user preferences)
Turing Babbage Einstein Newton
Comp Sci 1
Physics 1
Res
earc
h A
rtic
les
Researchers
Physics 2
Comp Sci 2
1.5M
50M
Input (all user preferences)
300M prefs
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)R
esea
rch
A
rtic
les
Turing
Recommendations(item x user)
X =
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
How well doesit work?
Mendeley Suggest
Running on Amazon's Elastic Map Reduce
On demand use and easy to cost
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
3
Mahout'sPerformance
Let's tune it!
1. Reduce processing time
2. Improve quality
1. Reduce processing time
➔ Mahout's recommender is already efficient➔ But your data may have unusual properties➔ Hadoop may need a helping hand➔ Let's see what's going on...
Task Allocation 37 hours to complete
1 reducer allocated, despite having 48 available...
Task Allocation
job.getConfiguration().set("mapred.max.split.size",String.valueOf(splitSize));
Allocating more mappers on a per job basis
job.getConfiguration().setInt("mapred.reduce.tasks",numReducers);
Allocating more reducers on a per job basis
Task Allocation 37 hours to complete14 hours
From 1 → 40 reducers
Partitioners 14 hours to complete
Partitioners 14 hours to complete
~50KB
~500MB
InputSampler.Sampler<IntWritable, Text> sampler =new InputSampler.RandomSampler<IntWritable, Text>(...);
InputSampler.writePartitionFile(conf, sampler);conf.setPartitionerClass(TotalOrderPartitioner.class);
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/
Partitioners 14 hours to complete
2 hours
Evenly distributed
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
3
-4.1K(63%)
Mahout'sPerformance
2. Improve quality
➔ Mahout provides item-based CF➔ We have many more items than users➔ Typically, user-based is more appropriate
➔ So let's make one!
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)R
esea
rch
A
rtic
les
Turing
Recommendations(item x user)
X =
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
Res
earc
h
Art
icle
sTuring
A User's Preferences(item x user)
Res
earc
h
Art
icle
s
Researchers
All User Preferences (item x user)
Res
earc
h
Art
icle
s
Research Articles
2 11 10 00 0
2 22 2
0 00 0
Item Similarity (item x item)R
esea
rch
A
rtic
les
Turing
Recommendations(item x user)
X =
1. Prep. pref. matrix (1-3)2. Gen. sim. matrix (4-6)3. Multiply matrices (7-10)
item.RecommenderJob
user
User Similarity (user x user)
Researchers
Re
sea
rch
ers
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
-1.4K(58%)
+1 (67%)
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
Mahout'sPerformance
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
-0.7K(70%)
Mahout'sPerformance
-4.1K(63%)
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
-6.2K(95%)
Mahout'sPerformance
+1 (67%)
Conclusions
Conclusions
➔ Mahout is doing a great job of powering Mendeley Suggest➔ Large scale data set➔ Good quality recommendations
➔ Tuning helps➔ Help Hadoop with task allocation if necessary➔ Partition your data appropriately➔ We save 95% resources
➔ Use an appropriate algorithm➔ Item- vs user-based (MAHOUT-1004)➔ We increase precision by 66.6%
Nor
mal
ised
Am
azon
Hou
rs
No. Good Recommendations/10
0
1K
2K
3K
4K
5K
6K
7K
0 0.5 1 1.5 2 2.5
Costly & Bad Costly & Good
Cheap & Bad Cheap & Good
6.5K, 1.5Orig. item-based
Cust. item-based➔2.4K, 1.5
Orig. user-based➔1K, 2.5
3
Cust. user-based➔0.3K, 2.5
-6.2K(95%)
Mahout'sPerformance
+1 (67%)
http://www.mendeley.com/profiles/kris-jack/
top related