crunching data with google bigquery. jordan tigani at big data spain 2012
Post on 27-Jan-2015
123 Views
Preview:
DESCRIPTION
TRANSCRIPT
Crunching Data with BigQuery Fast analysis of Big Data
Jordan Tigani, Software Engineer
01000001011011100111001101110111011001010111001
00010000001110100011011110010000001110100011010
00011001010010000001010101011011000111010001101
00101101101011000010111010001100101001000000101
00010111010101100101011100110111010001101001011
01111011011100010000001101111011001100010000001
00110001101001011001100110010100101100001000000
11101000110100001100101001000000101010101101110
01101001011101100110010101110010011100110110010
10010110000100000011000010110111001100100001000
00010001010111011001100101011100100111100101110
100101110011001000000011010000110010...........
Big Data at Google
72 hours
100 million gigabytes
SELECT
kick_ass_product_plan AS strategy,
AVG(kicking_factor) AS awesomeness
FROM
lots_of_data
GROUP BY
strategy
+-------------+----------------+
| strategy | awesomeness |
+-------------+----------------+
| "Forty-two" | 1000000.01 |
+-------------+----------------+
1 row in result set (10.2 s)
Scanned 100GB
Regular expressions on 13 billion rows...
13 Billion rows
1 TB of data in 4 tables
FAST! AST
Google's Internal Technology:
Dremel
MapReduce is Flexible but Heavy
Master
Mapper Mapper
• Master constructs the plan and
begins spinning up workers
Distributed Storage
• Mappers read and write to
distributed storage
• Map => Shuffle => Reduce
Reducer • Reducers read and write to
distributed storage
Master
Reducer
Mapper Mapper
Stage 2
MapReduce is Flexible but Heavy
Stage 1
Master
Mapper Mapper
Distributed Storage
Reducer
Dremel vs MapReduce
• MapReduce
o Flexible batch processing
o High overall throughput
o High latency
• Dremel
o Optimized for interactive SQL queries
o Very low latency
Mixer 0
Mixer 1 Mixer 1
Leaf Leaf Leaf Leaf
Distributed Storage
Dremel Architecture
• Columnar Storage
• Long lived shared serving tree
• Partial Reduction
• Diskless data flow
SELECT
state, COUNT(*) count_babies
FROM [publicdata:samples.natality]
WHERE
year >= 1980 AND year < 1990
GROUP BY state
ORDER BY count_babies DESC
LIMIT 10
Simple Query
Mixer 0
Mixer 1 Mixer 1
Leaf Leaf Leaf Leaf
Distributed Storage SELECT state, year
O(Rows ~140M)
COUNT(*)
GROUP BY state
WHERE year >= 1980 and year < 1990
O(50 states)
LIMIT 10
ORDER BY count_babies DESC
COUNT(*)
GROUP BY state
COUNT(*)
GROUP BY state
O(50 states) O(50 states)
Modeling Data
Example: Daily Weather Station Data
weather_station_data
station lat long mean_temp humidity timestamp year month day
9384 33.57 86.75 89.3 .35 1351005129 2011 04 19
2857 36.77 119.72 78.5 .24 1351005135 2011 04 19
3475 40.77 73.98 68 .35 1351015930 2011 04 19
etc...
Example: Daily Weather Station Data
station, lat, long, mean_temp, year, mon, day
999999, 36.624, -116.023, 63.6, 2009, 10, 9
911904, 20.963, -156.675, 83.4, 2009, 10, 9
916890, -18133, 178433, 76.9, 2009, 10, 9
943320, -20678, 139488, 73.8, 2009, 10, 9
CSV
Organizing BigQuery Tables
Your Source
Data
October 22
October 23
October 24
Modeling Event Data: Social Music Store
logs.oct_24_2012_song_activities
USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP
Michael LISTEN Too Close Alex Clare 1351065562
Michael LISTEN Gangnam Style PSY 1351105150
Jim LISTEN Complications Deadmau5 1351075720
Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
logs.oct_24_2012_song_activities
USERNAME ACTIVITY Cost SONG ARTIST TIMESTAMP
Michael LISTEN Too Close Alex Clare 1351065562
Michael LISTEN Gangnam Style PSY 1351105150
Jim LISTEN Complications Deadmau5 1351075720
Michael PURCHASE 0.99 Gangnam Style PSY 1351115962
Users Who Listened to More than 10 Songs/Day
SELECT
UserId, COUNT(*) as ListenActivities
FROM
[logs.oct_24_2012_song_activities]
GROUP EACH BY
UserId
HAVING
ListenActivites > 10
How Many Songs Listened to Total by Listeners of PSY?
SELECT
UserId, count(*) as ListenActivities
FROM
[logs.oct_24_2012_song_activities]
WHERE UserId IN (
SELECT
UserId
FROM
[logs.oct_24_2012_song_activities]
WHERE artist = 'PSY')
GROUP EACH BY UserId
HAVING
ListenActivites > 10
Modeling Event Data: Nested and Repeated Values
{"UserID" : "Michael",
"Listens": [
{"TrackId":1234,"Title":"Gangam Style",
"Artist":"PSY","Timestamp":1351075700},
{"TrackId":1234,"Title":"Alex Clare",
"Artist":"Alex Clare",'Timestamp":1351075700}
]
"Purchases": [
{"Track":2345,"Title":"Gangam Style",
"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}
]}
JSON
{"UserID" : "Michael",
"Listens": [
{"TrackId":1234,"Title":"Gangam Style",
"Artist":"PSY","Timestamp":1351075700},
{"TrackId":1234,"Title":"Alex Clare",
"Artist":"Alex Clare",'Timestamp":1351075700}
]
"Purchases": [
{"Track":2345,"Title":"Gangam Style",
"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}
]}
{"UserID" : "Michael",
"Listens": [
{"TrackId":1234,"Title":"Gangnam Style",
"Artist":"PSY","Timestamp":1351075700},
{"TrackId":1234,"Title":"Alex Clare",
"Artist":"Alex Clare",'Timestamp":1351075700}
]
"Purchases": [
{"Track":2345,"Title":"Gangnam Style",
"Artist":"PSY","Timestamp":1351075700,"Cost":0.99}
]}
Which Users Have Listened to Beyonce?
SELECT
UserID,
COUNT(ListenActivities.artist) WITHIN RECORD
AS song_count
FROM
[logs.oct_24_2012_songactivities]
WHERE
UserID IN (SELECT UserID,
FROM [logs.oct_24_2012_songactivities]
WHERE ListenActivities.artist = 'Beyonce');
What Position are PSY songs in our Users' Daily Playlists?
SELECT
UserID,
POSITION(ListenActivities.artist)
FROM
[sample_music_logs.oct_24_2012_songactivities]
WHERE
ListenActivities.artist = 'PSY';
SELECT
AVG(POSITION(ListenActivities.artist))
FROM
[sample_music_logs.oct_24_2012_songactivities],
[sample_music_logs.oct_23_2012_songactivities],
/* etc... */
WHERE
ListenActivities.artist = 'PSY';
Average Position of Songs by PSY in All Daily Playlists?
Summary: Choosing a BigQuery Data Model
• "Shard" your Data Using Multiple Tables
• Source Data Files
• CSV format
• Newline-delimited JSON
• Using Nested and Repeated Records
• Simplify Some Types of Queries
• Often Matches Document Database Models
Developing with BigQuery
Google Cloud Storage
Upload Your Data
BigQuery
Load your Data into BigQuery
"jobReference":{
"projectId":"605902584318"},
"configuration":{
"load":{
"destinationTable":{
"projectId":"605902584318",
"datasetId":"my_dataset",
"tableId":"widget_sales"},
"sourceUris":[
"gs://widget-sales-data/2012080100.csv"],
"schema":{
"fields":[{
"name":"widget",
"type":"string"},
...
POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
"jobReference":{
"projectId":"605902584318"},
"configuration":{
"load":{
"destinationTable":{
"projectId":"605902584318",
"datasetId":"my_dataset",
"tableId":"widget_sales"},
"sourceUris":[
"gs://widget-sales-data/2012080100.csv"],
"schema":{
"fields":[{
"name":"widget",
"type":"string"},
...
Query Away!
"jobReference":{
"projectId":"605902584318",
"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count
FROM widget_sales",
"maxResults":100,
"apiVersion":"v2"
}
POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
"jobReference":{
"projectId":"605902584318",
"query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count
FROM widget_sales",
"maxResults":100,
"apiVersion":"v2"
}
Libraries
• Python
• Java
• .NET
• Ruby
• JavaScript
• Go
• PHP
• Objective-C
Libraries - Example JavaScript Query
var request = gapi.client.bigquery.jobs.query({
'projectId': project_id,
'timeoutMs': '30000',
'query': 'SELECT state, AVG(mother_age) AS theav
FROM [publicdata:samples.natality]
WHERE year=2000 AND ever_born=1
GROUP BY state
ORDER BY theav DESC;'
});
request.execute(function(response) {
console.log(response);
$.each(response.result.rows, function(i, item) {
...
Custom Code and the Google Chart Tools API
Google Spreadsheets
Commercial Visualization Tools
Demo: Using BigQuery on BigQuery
• Full table scans FAST
• Aggregate Queries on Massive Datasets
• Supports Flat and Nested/Repeated Data Models
• It's an API
BigQuery - Aggregate Big Data Analysis in Seconds
Get started now:
http://developers.google.com/bigquery/
SELECT questions FROM audience
SELECT 'Thank You!'
FROM jordan
http://developers.google.com/bigquery
Schema definition
birth_record
parent_id_mother
parent_id_father
plurality
is_male
race
weight
parents
id
race
age
cigarette_use
state
Schema definition
birth_record
mother_race
mother_age
mother_cigarette_use
mother_state
father_race
father_age
father_cigarette_use
father_state
plurality
is_male
race
weight
Tools to prepare your data
• App Engine MapReduce
• Commercial ETL tools
• Pervasive
• Informatica
• Talend
• UNIX command-line
Schema definition - sharding
birth_record_2011
mother_race
mother_age
mother_cigarette_use
mother_state
father_race
father_age
father_cigarette_use
father_state
plurality
is_male
race
weight
birth_record_2012
mother_race
mother_age
mother_cigarette_use
mother_state
father_race
father_age
father_cigarette_use
father_state
plurality
is_male
race
weight
birth_record_2013
birth_record_2014
birth_record_2015
birth_record_2016
Visualizing your Data
BigQuery architecture
“ If you do a table scan over a 1TB table,
you're going to have a bad time. ”
Anonymous
16th century Italian Philosopher-Monk
•
• Reading 1 TB/ second from disk:
• 10k+ disks
• Processing 1 TB / sec:
• 5k processors
Goal: Perform a 1 TB table scan in 1 second
Parallelize Parallelize Parallelize!
Data access: Column Store
Record Oriented Storage Column Oriented Storage
Distributed Storage (e.g. GFS)
BigQuery Architecture
Mixer 0
Mixer 1
Shard 0-8
Mixer 1
Shard 17-24
Mixer 1
Shard 9-16
Shard 0 Shard 10 Shard 12 Shard 24 Shard 20
Running your Queries
SELECT COUNT(foo), MAX(foo), STDDEV(foo)
FROM ...
BigQuery SQL Example: Simple aggregates
SELECT ... FROM ....
WHERE REGEXP_MATCH(url, "\.com$")
AND user CONTAINS 'test'
BigQuery SQL Example: Complex Processing
SELECT COUNT(*) FROM
(SELECT foo ..... )
GROUP BY foo
BigQuery SQL Example: Nested SELECT
BigQuery SQL Example: Small JOIN
SELECT huge_table.foo
FROM huge_table
JOIN small_table
ON small_table.foo = huge_table.foo
Distributed Storage (e.g. GFS)
BigQuery Architecture: Small Join
Mixer 0
Mixer 1
Shard 0-8
Mixer 1
Shard 17-24
Shard 0 Shard 24 Shard 20
Other new features!
Batch queries!
• Don't need interactive queries for some jobs?
• priority: "BATCH"
• API
• Column-based datastore
• Full table scans FAST
• Aggregates
• Commercial tool support
• Use cases
That's it
SELECT questions FROM audience
SELECT 'Thank You!'
FROM ryan
http://developers.google.com/bigquery
@ryguyrg http://profiles.google.com/ryan.boyd
Data access: Column Store
Record Oriented Storage Column Oriented Storage
A Little Later ...
Row wp_namespace Revs
1 0 53697002
2 1 6151228
3 3 5519859
4 4 4184389
5 2 3108562
6 10 1052044
7 6 877417
8 14 838940
9 5 651749
10 11 192534
11 100 148135
Underlying table:
• Wikipedia page revision records
• Rows: 314 million
• Byte size: 35.7 GB
Query Stats:
• Scanned 7G of data
• <5 seconds
• ~ 100M rows scanned / second
Mixer 0
Mixer 1 Mixer 1
Leaf Leaf Leaf Leaf
Distributed Storage
SELECT wp_namespace, revision_id
10 GB / s
COUNT (revision_id)
GROUP BY wp_namespace
WHERE timestamp > CUTOFF
ORDER BY Revs DESC
COUNT (revision_id)
GROUP BY wp_namespace
COUNT (revision_id)
GROUP BY wp_namespace
"Multi-stage" Query
SELECT
contributor_id,
INTEGER(LOG10(COUNT(revision_id))) LogEdits
FROM [publicdata:samples.wikipedia]
SELECT
contributor_id,
INTEGER(LOG10(COUNT(revision_id))) LogEdits
FROM [publicdata:samples.wikipedia]
GROUP EACH BY contributor_id)
SELECT
LogEdits, COUNT(contributor_id) Contributors
FROM (
SELECT
contributor_id,
INTEGER(LOG10(COUNT(*))) LogEdits
FROM [publicdata:samples.wikipedia]
GROUP EACH BY contributor_id)
GROUP BY LogEdits
ORDER BY LogEdits DESC
Mixer 0
Mixer 1 Mixer 1
Leaf Leaf Shuffler Shuffler
Distributed Storage
SELECT contributor_id
ORDER BY LogEdits DESC
COUNT(contributor_id)
GROUP BY LogEdits
COUNT(contributor_id)
GROUP BY LogEdits
COUNT(contributor_id)
GROUP BY LogEdits
SELECT LE, Id
COUNT(*)
GROUP BY contributor_id
Shuffle by
contributor_id
N^2
GB/s
When to use EACH
• Shuffle definitely adds some overhead
• Poor query performance if used incorrectly
• GROUP BY
o Groups << Rows => Unbalanced load
o Example: GROUP BY state
• GROUP EACH BY
o Groups ~ Rows
o Example: GROUP BY user_id
top related