austin bdug 2011_01_27_small_and_big_data
DESCRIPTION
Short overview of data infrastructure at Bazaarvoice. We use a combination of many different data stores such as MySQL, SOLR, Infobright, MongoDB and Hadoop.TRANSCRIPT
![Page 1: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/1.jpg)
I CAN HAS BIG DATA?Small and Big Data at Bazaarvoice
Alex Pinkin@apinkin
![Page 2: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/2.jpg)
whois apinkin
● Alex PinkinSoftware Engineering Lead, Data Infrastructure team,Bazaarvoice
● Loves both SQL and NoSQL. Can't commit to one! :-)
@apinkin
![Page 3: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/3.jpg)
Big Data?
![Page 4: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/4.jpg)
A few facts about Bazaarvoice
● Bazaarvoice is a SaaS companypowering user generated contentsuch as ratings and reviews on thousands of web sites
● Over 75 Million reviews
● 280 Billion impressions
● 5 Billion Page Views per month
![Page 5: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/5.jpg)
How Do We Do It?
● Client-side integration
● Code and Servers :)
![Page 6: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/6.jpg)
What Do We Run in Prod?
● SQL○ MySQL○ Infobright
● NoSQL○ SOLR○ ElasticSearch○ MongoDB○ CouchDB○ Hadoop
![Page 7: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/7.jpg)
Four Pillars
![Page 8: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/8.jpg)
MySQL and Big Data?!!
● Yes, MySQL is our Master. Mostly used as K/V store.
● Scaling Reads: Replication● Scaling Writes: Sharding● HA: Hot Back-up, Multiple DC
● Pros○ Rock solid○ SQL
● Cons○ Inflexible schema○ Replication lag○ Sharding not built-in○ HA
![Page 9: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/9.jpg)
Search: SOLR/Lucene
● Document Store● Inverted Index
Term Document IDs
rating:5 1,2
rating:4 3
productId: 12345 1,2,3
![Page 10: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/10.jpg)
Analytics
![Page 11: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/11.jpg)
Analytics - Infobright
● Columnar storage○ Compression (10x+)○ Reduced disk I/O
● Partitioning○ Horizontal: Data Packs○ Vertical: Columns
● Knowledge grid ○ MIN(C), MAX(C),
SUM(C), AVG(C),COUNT(DISTINCT(C))
![Page 12: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/12.jpg)
Infobright - Pros and Cons
● Pros○ 30x faster than MySQL on analytics queries○ Open Source
● Cons○ No DML in OSS version○ No MPP (good for up to 5 TB)
![Page 13: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/13.jpg)
Hadoop Use Case
![Page 14: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/14.jpg)
Bazaarvoice EMR - Phase 1
![Page 15: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/15.jpg)
Bazaarvoice EMR - Phase 2
![Page 16: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/16.jpg)
Summary
● We use the best tool for the job
● NoSQL is maturing quickly. Query languages are still in flux though.
● Hadoop is here to stay
● We are (slowly) moving away from MySQL
![Page 17: Austin bdug 2011_01_27_small_and_big_data](https://reader033.vdocuments.net/reader033/viewer/2022051817/547b68b7b379593f2b8b4d7c/html5/thumbnails/17.jpg)
@apinkin