farrot - filter amazon review ratings over time
TRANSCRIPT
ProblemAmazon doesn't allow filtering review ratings and totals by state or time
http://youtu.be/w78X0IpjI5c
Data setStanford SNAP Amazon reviews
35GB35M reviews
University of Illinois Amazon member info142MBMember location information
joeme 92 5/26 Cleveland, OH United States Joseph M. Kotow B00006HAXWOH
Pipeline
ImportTsv
SNAP REVIEWS in 10 rows per review
UIC MEMBERLOCATIONTSV HappyBaseB00006HAXW Rock Rhythm & Doo Wop Greatest Early Rock unknown A1RSDE9-N6RSZF Joseph M Kotow 9/9 5.0 1042502400 Pittsburgh – Home of the OLDIES I have all of the doo wop DVD’s and this one is as good or better than the 1st ones. Rem…
Pipeline
ImportTsv
SNAP REVIEWS in 10 rows per review
UIC MEMBERLOCATIONTSV HappyBaseB00006HAXW Rock Rhythm & Doo Wop Greatest Early Rock unknown A1RSDE9-N6RSZF Joseph M Kotow 9/9 5.0 1042502400 Pittsburgh – Home of the OLDIES I have all of the doo wop DVD’s and this one is as good or better than the 1st ones. Rem…
PIG to CLEAN, JOIN and AGGREGATE rating reviews and totals
HBase SchemaTable Schemas:
PRODUCTID_STATE, TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYYEAR_EPOCH, TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYMONTH_EPOCH, TOTAL REVIEWS, AVG RATING
PRODUCTID_STATE_BYDAY_EPOCH, TOTAL REVIEWS, AVG RATING
•Example: B00003CWT6_CA_BYMONTH_1008115200000
RetrospectiveDesign Considerations• HBase was used for optimizations for reads, range scans, and scalability • Data was bucketed by state and different time intervals for query performance by avoiding the cost of recalculating aggregates at the expense of storage• Java MR was used to convert multi-row reviews to tabular format Future• Scrape Amazon for new reviews• Filter and display reviews