market basket analysis algorithm with map/reduce of cloud computing
Upload: jongwook-woo-big-data-artist-professor-at-california-state-university-los-angeles
Post on 25-Dec-2014
2.527 views
DESCRIPTION
Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing presented at PDPTA 2011 (http://www.world-academy-of-science.org/worldcomp11/ws/conferences/pdpta11)TRANSCRIPT
Jongwook Woo
HiPICHiPIC
CSULA
Market Basket Analysis Algorithm with Map/Reduce
of Cloud Computing
2011 PDPTA2011 PDPTA
Jongwook Woo, PhDJongwook Woo, PhD
High-Performance Internet Computing Center (HiPIC)
Computer Information Systems Department
California State University, Los Angeles
HiPICHiPIC
Jongwook Woo
CSULA
Contents
Map/Reduce Brief Introduction
Market Basket Analysis
Map/Reduce Algorithm for MBA
Experimental Result
Conclusion
HiPICHiPIC
Jongwook Woo
CSULA
What is Map/Reduce Cloud Computing
ClouderaHortonWorks
AWS
Paral
lel
Compu
ting
HiPICHiPIC
Jongwook Woo
CSULA
Have you heard about Cloud Computing?
First Impression
In late 2007, the New York Times wanted to make available over the web its entire archive of articles,
– 11 million in all, dating back to 1851.
– four-terabyte pile of images in TIFF format.
– needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files.
• not a particularly complicated but large computing chore,– requiring a whole lot of computer processing time.
– a software programmer at the Times, Derek Gottfrid, • playing around with Amazon Web Services, Elastic Compute Cloud
(EC2), – uploaded the four terabytes of TIFF data into Amazon's Simple
Storage System (S3) – In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and
ready to be served up to visitors to the Times site.
– The total cost for the computing job? $240• 10 cents per computer-hour times 100 computers times 24 hours
HiPICHiPIC
Jongwook Woo
CSULA
What is MapReduce
Functions borrowed from functional programming languages (eg. Lisp)
Provides Restricted parallel programming model
User implements Map() and Reduce()Libraries (Hadoop) take care of EVERYTHING else
– Parallelization– Fault Tolerance– Data Distribution– Load Balancing
Useful for huge (peta- or Terra-bytes) but non-complicated data
New York Times case Log file for web companies
HiPICHiPIC
Jongwook Woo
CSULA
MapConvert data to (key, value) pairs
map() functions run in parallel, creating different intermediate values from
different input data sets
HiPICHiPIC
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values into one or more final values for that same output key
reduce() functions also run in parallel, each working on a different output key
Bottleneck: reduce phase can’t start until map phase is
completely finished.
HiPICHiPIC
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Map()
Input <logFilename, file text>Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Sums all values for the same key and emits <url, TotalCount>
– eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Collect the list of pair of transaction items most frequently occurred together at a store(s)
Traditional Business Intelligence Analysis
much better opportunity to make a profit by controlling the order of products and marketing – control the stocks more intelligently – arrange items on shelves – promote items together etc.
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Transactions in Store A: Input dataTransaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, breadTransaction 3: baguette, soda, hering, cracker,
beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, cokeTransaction 6: apples, peppers, avocado, steakTransaction 7: sardines, apples, peppers,
avocado, steak…
What is a pair of items that people frequently buy at Store A?
HiPICHiPIC
Jongwook Woo
CSULA
Map Algorithm
1: Reads each transaction of input file and generates the data set of the items:
(<V1>, <V2>, …, <Vn>) where < Vn>: (vn1, vn2,.. vnm)
2: Sort all data set <Vn> and generates sorted data set <Un>:
(<U1>, <U2>, …, <Un>) where < Un>: (un1, un2,.. unm)
3: Loop For each item from un1 to unm of < Un >
3.a: generate the data set <Yn>: (yn1, yn2,.. ynl); ynl: (unx, uny) where unx ≢ uny
3.b: increment the occurrence of ynl; note: (key, value) = (ynl, number of occurrences)
4. Data set is created as input of Reducer:
(key, <value>) = (ynl, <number of occurrences>)
HiPICHiPIC
Jongwook Woo
CSULA
Reduce Algorithm
1: Take (ynl, <number of occurrences>)
as input data from multiple Map nodes
2. Add the values for ynl to have (ynl,
total number of occurrences) as output
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
1. Transactions in Store A
Transaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, bread…
2. Distribute Transaction data to Map nodes
3. Pair of Items restructured in each Map node
Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)>
Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)>
…
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
Note: order of pairs should be sorted as it becomes a key
For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream)
3. Pair of Items sorted in MBA
Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)>
Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)>
…
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
4. Output of Map node Pair of Items in (key, value) structure in each Map
node (key, value): (pair of items, number of occurences) ((cracker, icecream), 1)((beer, cracker), 1) ((beer, icecream),1)(chicken, pizza), 1)((chicken, coke), 1)((chicken, bread) , 1)((coke, pizza), 1)((bread, pizza), 1)((coke , bread), 1) …
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
5. Data Aggregation/Combine
(key, <value>): (pair of items, list number of occurences)
((cracker, icecream), <1, 1, …, 1>)
((beer, cracker), <1, 1, …, 1>)
((beer, icecream), <1, 1, …, 1>)
(chicken, pizza), <1, 1, …, 1>)
…
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
4. Reduce nodes
(key, value): (pair of items, total number of occurences)
((cracker, icecream), 421)
((beer, cracker), 341)
((beer, icecream), 231)
(chicken, pizza), 111)
…
HiPICHiPIC
Jongwook Woo
CSULA
Map/Reduce for MBA
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
((coke, pizza), <1, 1, …, 1>)((ham, juice), <1, 1, …, 1>)
((coke, pizza), 3,421) ((ham, juice), 2,346)
Input Trax Data
Reduce2()
((coke, pizza), 1)((bear, corn), 1)…
((ham, juice), 1)((coke, pizza), 1)…
HiPICHiPIC
Jongwook Woo
CSULA
Experimental Result
5 transaction files for the experiment:
400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions).
run on small instances of AWS EC2
each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor
1.7GB memory160GB storage on 32 bits platform.
The data are executed on 2, 5, 10, 15, and 20 nodes
HiPICHiPIC
Jongwook Woo
CSULA
Experimental Result
Execution time (sec)
6.7M (400MB)
13M (800MB)
26M (1.6GB)
2 9,133 NA NA
5 5,442 8,717 15,963
10 2,910 5,998 8,845
15 2,792 2,917 5,898
20 2,868 2,911 5,671
HiPICHiPIC
Jongwook Woo
CSULA
Experimental Result
Execution time (sec)
02000400060008000
10000120001400016000
sec
2 5 10 15 20
No of nodes
Execution time
400
800
1600
HiPICHiPIC
Jongwook Woo
CSULA
Conclusion
The Market Basket Analysis Algorithm on Map/Reduce is presented
data mining analysis to find the most frequently occurred pair of products in baskets at a store.
The associated items can be paired with Map/Reduce approach.
Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper
a bottle-neck for distributing, aggregating, and reducing the data set among nodes
HiPICHiPIC
Jongwook Woo
CSULA