market basket analysis algorithm with map/reduce of cloud computing

Jongwook Woo

HiPICHiPIC

CSULA

Market Basket Analysis Algorithm with Map/Reduce

of Cloud Computing

2011 PDPTA2011 PDPTA

Jongwook Woo, PhDJongwook Woo, PhD

[email protected]

High-Performance Internet Computing Center (HiPIC)

Computer Information Systems Department

California State University, Los Angeles

HiPICHiPIC

Jongwook Woo

CSULA

Contents

Map/Reduce Brief Introduction

Market Basket Analysis

Map/Reduce Algorithm for MBA

Experimental Result

Conclusion

HiPICHiPIC

Jongwook Woo

CSULA

What is Map/Reduce Cloud Computing

ClouderaHortonWorks

AWS

Paral

lel

Compu

ting

HiPICHiPIC

Jongwook Woo

CSULA

Have you heard about Cloud Computing?

First Impression

In late 2007, the New York Times wanted to make available over the web its entire archive of articles,

– 11 million in all, dating back to 1851.

– four-terabyte pile of images in TIFF format.

– needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files.

• not a particularly complicated but large computing chore,– requiring a whole lot of computer processing time.

– a software programmer at the Times, Derek Gottfrid, • playing around with Amazon Web Services, Elastic Compute Cloud

(EC2), – uploaded the four terabytes of TIFF data into Amazon's Simple

Storage System (S3) – In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and

ready to be served up to visitors to the Times site.

– The total cost for the computing job? $240• 10 cents per computer-hour times 100 computers times 24 hours

HiPICHiPIC

Jongwook Woo

CSULA

What is MapReduce

Functions borrowed from functional programming languages (eg. Lisp)

Provides Restricted parallel programming model

User implements Map() and Reduce()Libraries (Hadoop) take care of EVERYTHING else

– Parallelization– Fault Tolerance– Data Distribution– Load Balancing

Useful for huge (peta- or Terra-bytes) but non-complicated data

New York Times case Log file for web companies

HiPICHiPIC

Jongwook Woo

CSULA

MapConvert data to (key, value) pairs

map() functions run in parallel, creating different intermediate values from

different input data sets

HiPICHiPIC

Jongwook Woo

CSULA

Reduce

reduce() combines those intermediate values into one or more final values for that same output key

reduce() functions also run in parallel, each working on a different output key

Bottleneck: reduce phase can’t start until map phase is

completely finished.

HiPICHiPIC

Jongwook Woo

CSULA

Example: Sort URLs in the largest hit order

Map()

Input <logFilename, file text>Parses file and emits <url, hit counts> pairs

– eg. <http://hello.com, 1>

Reduce()

Sums all values for the same key and emits <url, TotalCount>

– eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>

HiPICHiPIC

Jongwook Woo

CSULA

Market Basket Analysis (MBA)

Collect the list of pair of transaction items most frequently occurred together at a store(s)

Traditional Business Intelligence Analysis

much better opportunity to make a profit by controlling the order of products and marketing – control the stocks more intelligently – arrange items on shelves – promote items together etc.

HiPICHiPIC

Jongwook Woo

CSULA

Market Basket Analysis (MBA)

Transactions in Store A: Input dataTransaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, breadTransaction 3: baguette, soda, hering, cracker,

beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, cokeTransaction 6: apples, peppers, avocado, steakTransaction 7: sardines, apples, peppers,

avocado, steak…

What is a pair of items that people frequently buy at Store A?

HiPICHiPIC

Jongwook Woo

CSULA

Map Algorithm

1: Reads each transaction of input file and generates the data set of the items:

(<V1>, <V2>, …, <Vn>) where < Vn>: (vn1, vn2,.. vnm)

2: Sort all data set <Vn> and generates sorted data set <Un>:

(<U1>, <U2>, …, <Un>) where < Un>: (un1, un2,.. unm)

3: Loop For each item from un1 to unm of < Un >

3.a: generate the data set <Yn>: (yn1, yn2,.. ynl); ynl: (unx, uny) where unx ≢ uny

3.b: increment the occurrence of ynl; note: (key, value) = (ynl, number of occurrences)

4. Data set is created as input of Reducer:

(key, <value>) = (ynl, <number of occurrences>)

HiPICHiPIC

Jongwook Woo

CSULA

Reduce Algorithm

1: Take (ynl, <number of occurrences>)

as input data from multiple Map nodes

2. Add the values for ynl to have (ynl,

total number of occurrences) as output

HiPICHiPIC

Jongwook Woo

CSULA

Market Basket Analysis (Cont’d)

1. Transactions in Store A

Transaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, bread…

2. Distribute Transaction data to Map nodes

3. Pair of Items restructured in each Map node

Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)>

Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)>

…

HiPICHiPIC

Jongwook Woo

CSULA


Note: order of pairs should be sorted as it becomes a key

For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream)

3. Pair of Items sorted in MBA

Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)>

Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)>

…

HiPICHiPIC

Jongwook Woo

CSULA


4. Output of Map node Pair of Items in (key, value) structure in each Map

node (key, value): (pair of items, number of occurences) ((cracker, icecream), 1)((beer, cracker), 1) ((beer, icecream),1)(chicken, pizza), 1)((chicken, coke), 1)((chicken, bread) , 1)((coke, pizza), 1)((bread, pizza), 1)((coke , bread), 1) …

HiPICHiPIC

Jongwook Woo

CSULA


5. Data Aggregation/Combine

(key, <value>): (pair of items, list number of occurences)

((cracker, icecream), <1, 1, …, 1>)

((beer, cracker), <1, 1, …, 1>)

((beer, icecream), <1, 1, …, 1>)

(chicken, pizza), <1, 1, …, 1>)

…

HiPICHiPIC

Jongwook Woo

CSULA


4. Reduce nodes

(key, value): (pair of items, total number of occurences)

((cracker, icecream), 421)

((beer, cracker), 341)

((beer, icecream), 231)

(chicken, pizza), 111)

…

HiPICHiPIC

Jongwook Woo

CSULA

Map/Reduce for MBA

…

…Map1() Map2() Mapm()

Reduce1 () Reducel()

Data Aggregation/Combine

((coke, pizza), <1, 1, …, 1>)((ham, juice), <1, 1, …, 1>)

((coke, pizza), 3,421) ((ham, juice), 2,346)

Input Trax Data

Reduce2()

((coke, pizza), 1)((bear, corn), 1)…

((ham, juice), 1)((coke, pizza), 1)…

HiPICHiPIC

Jongwook Woo

CSULA

Experimental Result

5 transaction files for the experiment:

400 MB (6.7M transactions), 800MB (13M transactions), 1.6 GB (26M transactions).

run on small instances of AWS EC2

each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor

1.7GB memory160GB storage on 32 bits platform.

The data are executed on 2, 5, 10, 15, and 20 nodes

HiPICHiPIC

Jongwook Woo

CSULA

Experimental Result

Execution time (sec)

6.7M (400MB)

13M (800MB)

26M (1.6GB)

2 9,133 NA NA

5 5,442 8,717 15,963

10 2,910 5,998 8,845

15 2,792 2,917 5,898

20 2,868 2,911 5,671

HiPICHiPIC

Jongwook Woo

CSULA

Experimental Result

Execution time (sec)

02000400060008000

10000120001400016000

sec

2 5 10 15 20

No of nodes

Execution time

400

800

1600

HiPICHiPIC

Jongwook Woo

CSULA

Conclusion

The Market Basket Analysis Algorithm on Map/Reduce is presented

data mining analysis to find the most frequently occurred pair of products in baskets at a store.

The associated items can be paired with Map/Reduce approach.

Once we have the paired items, it can be used for more studies by statically analyzing them even sequentially, which is beyond this paper

a bottle-neck for distributing, aggregating, and reducing the data set among nodes

HiPICHiPIC

Jongwook Woo

CSULA

market basket analysis algorithm with map/reduce of cloud computing

Technology