aka examples

40
AKA identifies unique coupons given different names in the SnipSnap coupon database using a combination of k-means clustering and "smoking gun" feature based rule inference. Github: https:// github.com/snipsnap/aka-service/ Email: [email protected]

Upload: luke-otterblad

Post on 27-May-2015

127 views

Category:

Technology


1 download

DESCRIPTION

aka, smoking gun rule inference

TRANSCRIPT

Page 1: Aka examples

AKA identifies unique coupons given different names in the SnipSnap coupon database using a combination

of k-means clustering and "smoking gun" feature based rule inference.

Github: https://github.com/snipsnap/aka-service/Email: [email protected]

Page 2: Aka examples

Step 1: Matches – same value, description text and activity dates

Page 3: Aka examples

Matches – pairs are shown ,but many more than 2 items are matched into groups

Page 4: Aka examples

More Examples…Different Barcodes – Same Coupon

The above two were matched into a group. The coupon below was also in the same set of American Eagle but NOT put into the same group even though it has some similarity….

Page 5: Aka examples

How does it work?• https://github.com/snipsnap/aka-service• run via the command line• $ python aka.py -db_pswd your_password -

store McDonald’sid face_value offer_details start_date expiriation_date

988767 Free With the purchase of an Egg McMuffin 2013-09-03 2013-10-31

989829 FREE Egg McMuffin with the purchase of an Egg McMuffin 2013-09-03 2013-10-31

997447 Free Egg McMuffin with the purchase of an Egg McMuffin 2013-09-03 2013-10-31

Page 6: Aka examples

Active Coupons for a Store as a Graph• When the aka-service is started, for a particular store each active coupon is converted to dictionary format

and face value and details based features are converted to the python version of a graph and normalized with some language processing.

• Item - > Features

{"CouponA": {[‘free’, ‘with’, ‘the’, ‘purchase’, ‘of’, ‘an’, ‘egg’, ‘mcmuffin’] {"CouponB": {[‘free’, ‘with’, ‘the’, ‘purchase’, ‘of’, ‘an’, ‘egg’, ‘mcmuffin’]

• Features -> Item

{“egg":["CouponA","CouponB"], “mcmuffin": ["CouponA","CouponB"], “free": ["CouponA","CouponB"], “with":["CouponA","CouponB"], “the": ["CouponA","CouponB"]} "purchase": ["CouponA","CouponB"] “of": ["CouponA","CouponB"]} “an": ["CouponA","CouponB"]}

Page 7: Aka examples

Despite different text AKA identifies all of these as the same item

id face_value offer_details start_date expiriation_date aka_guid

988767 Free With the purchase of an Egg McMuffin 2013-09-03 2013-10-31

de5086f0-35bc-11e3-8da3-005056c00008

989829 FREE Egg McMuffin with the purchase of an Egg McMuffin 2013-09-03 2013-10-31

de5086f0-35bc-11e3-8da3-005056c00008

997447 Free Egg McMuffin with the purchase of an Egg McMuffin 2013-09-03 2013-10-31

de5086f0-35bc-11e3-8da3-005056c00008

Page 8: Aka examples

Free is treated as a value keyword (along with % and $ descriptions)

Page 9: Aka examples

But, words and value alone don’t create the match.

Expiry date also matters

Page 10: Aka examples

Coupon with No Barcode connected to the same offer with a barcode

Same offer value (free mini candle) and same data range (September 9-October 6, 2013)

Page 11: Aka examples

Matching picture and computer images

Page 12: Aka examples

A change in degree…but the same coupon

Page 13: Aka examples

Smoking Gun Features

• A Smoking gun feature for a coupon is a piece of information that identifies it as being the same real world item as another coupon (with near certainty).

• There are two sources of such identification in the database. The first is a barcode_id. Multiple coupons that have the same barcode_id are indeed the same physical coupon. The second is a promo_code.

• Two coupons that have the same promo_code are the same coupon 95% + of the time. (Some stores like Dunkin Donuts don’t use unique codes…but more on that later)

Page 14: Aka examples

More Matches

Above two coupons are matched, and are also NOT matched with the below coupon despite having an extremely similar description and validity:

The code in the upper right hand corner (9152 versus 9992 –the smoking gun) helps significantly in separating them into a different identification.

Page 15: Aka examples

Two coupons Not matched, even though they have the same description and similar

text

(they are valid at different times)

Page 16: Aka examples

Finding smoother images

I experimented with using the number of recorded features as an indicator of picture quality – but that didn’t have much correlation. What did work was using the picture with the highest number of redemptions within an aka group

Page 17: Aka examples

Better images

Page 18: Aka examples

The Dollar Store $1 Off coupon problem – likely to be many of those

These four were originally matched. But I had to introduce the notion of a confidence percentage. This is largely because AKA weights the value of an item more heavily than the details words describing the offer (for most stores they have few items that are the same price)

Page 19: Aka examples

More equal prices, but with high confidence set

Page 20: Aka examples

Trouble Spots: AKA identifies same offer due to assumed smoking gun, but while there is the same

barcode there is a different expiry.

Ignoring PLU for Dunkin Donuts (and other publishers that duplicate promocodes) and going with 99% confidence does the trick.

Page 21: Aka examples

There’s Exceptions to every rule

• Coupons are no different• In the settings.yaml (pictured above) you can define exceptions to

global rules. • What pop_smoking_gun tells aka is that for Dunkin’ Donuts the

global rules of promo_code and barcode_id does not apply– for Dunkin Donuts’ they don’t create PLU codes as unique to an offer.

Page 22: Aka examples

Another example

Ignoring PLU for Dunkin Donuts (and other publishers that duplicate promocodes) and going with 99% confidence does the trick.

Page 23: Aka examples

But knowing the store “rules” also helps correct errors (if they stick to unique codes)

http://c346897.r97.cf1.rackcdn.com/cd0faf92-f85e-11e2-9f66-40406c9e1e47.jpg

http://c346897.r97.cf1.rackcdn.com/d32b578e-fd2a-11e2-9be6-40406c9e1e47.jpg

Mechanical Turk expiry: 10/17/2012 Mechanical Turk expiry: 10/7/2012

Since Bed Bath & Beyond id’s and promocodes indicate the same item aka can reconcile the mistake

Page 24: Aka examples

AKA- never misinterpret a store's coupon rules again

ids sharable descrption_text Aka_guid

987120 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

987271 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

988484 1 save 20% on your entire purchase bath body works f139439

75926f4f-328f-11e3-a3cd-005056c00008

989519 1 save 20% on your entire purchase bath body works 9522

75926f4f-328f-11e3-a3cd-005056c00008

989774 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

990040 0 save 20% on your entire purchase bath body works f139492

75926f4f-328f-11e3-a3cd-005056c00008

990943 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

992970 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

992998 0 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

994314 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008

10 coupons all identified as the same item with some marked sharable and some not. Suppose a publisher had submitted coupon 990040 to not be shareable……

Page 25: Aka examples

AKA- never misinterpret a store's coupon rules again

ids sharable descrption_text Aka_guid aka_sharable

987120 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

987271 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

988484 1 save 20% on your entire purchase bath body works f139439

75926f4f-328f-11e3-a3cd-005056c00008 0

989519 1 save 20% on your entire purchase bath body works 9522

75926f4f-328f-11e3-a3cd-005056c00008 0

989774 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

990040 0 save 20% on your entire purchase bath body works f139492

75926f4f-328f-11e3-a3cd-005056c00008 0

990943 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

992970 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

992998 0 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

994314 1 save 20% on your entire purchase bath body works 75926f4f-328f-11e3-a3cd-005056c00008 0

An easy feature could be to treat a single not sharable within an aka group as a “presidential” vote and switch all to not sharable. This can also work for items tagged as manufacturer coupons. You’d basically only need 1 tag from Mechanical Turk (or a from classifier).

Page 27: Aka examples

Kroger’s matches

Kroger’s requires the highest confidence of any store, as many of their coupons are different only by a single word. These will match (incorrectly) without a high confidence set. Listed below is a sample false match made by AKA:

Page 28: Aka examples

Same item in the database twice for Macy’s

http://c346897.r97.cf1.rackcdn.com/59667340-1588-11e3-a8e3-40406c9e1e47-thumb.jpghttp://c346897.r97.cf1.rackcdn.com/ac4dc266-1588-11e3-a7d0-40406c9e1e47-thumb.jpg

Page 30: Aka examples

Rougher Image Connected with a better version at McDonalds

Page 31: Aka examples

Does a big mac by any other name, still taste like a big mac?

Page 32: Aka examples

Digital and print match

Page 33: Aka examples

More Matches

Page 34: Aka examples
Page 35: Aka examples

Better coupon picture identification

Page 36: Aka examples

Occasional data entry errors can lead to bad reconciliation

aka_guid id barcode_id alt_barcode_id

face_value offer_details

2719bf74-40b6-11e3-86dd-22000a91806d

421909 138859 0$5.00 Off $25.00

Save $5.00 On Your Purchase Of $25.00 Or More

2719bf74-40b6-11e3-86dd-22000a91806d

539197 46299 0 Save $1.00 On Any Aveeno Product

2719bf74-40b6-11e3-86dd-22000a91806d

560927 138859 0 Save $1.00 On any

2719bf74-40b6-11e3-86dd-22000a91806d

595323 138859 0 20% Off 1 Regular Priced Item

Here the 99% reliable barcode_id is idenified with 3 different items (for Toys R Us)

Page 37: Aka examples

These three items were matched via barcode which I can only assume is some type of data entry error. The difference is that for every other toys”r”us coupon the smoking gun rules are valid. These items barcodes are recorded incorrectly

Page 38: Aka examples

But it is an isolated error

Page 39: Aka examples

Background for entity resolution (aka collective reconciliation, de-duping)

• Chapter 20 of Beautiful Data “Connecting Data” by Toby Segaran (who I think likely wrote the chapter while working on the YouTube reconciliation).

• Indrajit Bhattacharya’s PhD dissertation, which you can find at: http://www.lib.umd.edu/drum/handle/1903/4241

• About me: Father of 2 lovely daughters with my wife Emma. Programmer, Statistician, Pot Limit Omaha and Mixed Game poker semi-professional (though I don’t get much time for poker nowadays). I'm located in historic Northfield, MN where I share an office with my Jack Russell Terrier, Kirby.

• Email: [email protected].

Page 40: Aka examples

Questions