Crunching Big Data with BigQuery
Ryan Boyd, Developer Advocatehttp://profiles.google.com/ryan.boyd@ryguyrg
XLDBTuesday, September 11th
2012
How BIG is big?
1 million rows?
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
10 million rows?
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
100 million rows?
1 million
1 million1 million 1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million
1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million
1 million
1 million1 million1 million
1 million1 million1 million1 million
1 million
1 million
1 million1 million1 million
1 million1 million1 million1 million
1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
1 million1 million1 million1 million1 million1 million1 million1 million1 million1 million
500 million rows!
1 million
Big Data at Google
60 hours100 million gigabytes425 million users
Google's internal technology: Dremel
Big Data at Google - Finding top installed market apps
SELECT top(appId, 20) AS app, count(*) AS countFROM installlog.2012;ORDER BY count DESC
Result in ~20 seconds!
Big Data at Google - Finding slow servers
SELECT count(*) AS count, source_machine AS machineFROM product.product_log.liveWHERE elapsed_time > 4000GROUP BY source_machineORDER BY count DESC
Result in ~20 seconds!
BigQuery gives you this power
Store data with reliability, redundancy and consistency
Go from data to meaning
Quickly!
At scale ...
How are developers using it?
Game and social media analytics
Advertising campaign optimization
Sensor data analysis
Infrastructure monitoring
● Show the power● Running your queries● Preparing your data● Visualizing your data● Underlying architecture design● Latest and greatest features
Agenda
Let's dive in!
It's an API
Google Cloud Storage
Upload your Data
BigQuery
Your data is loaded
Google Cloud Storage
Load your data into BigQuery"jobReference":{ "projectId":"605902584318"}, "configuration":{ "load":{ "destinationTable":{ "projectId":"605902584318", "datasetId":"my_dataset", "tableId":"widget_sales"}, "sourceUris":[ "gs://widget-sales-data/2012080100.csv"], "schema":{ "fields":[{ "name":"widget", "type":"string"}, ...
POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
Query Away!
"jobReference":{ "projectId":"605902584318"}, "query":"SELECT TOP(widget, 50), COUNT(*) AS sale_count FROM widget_sales", "maxResults":100, "apiVersion":"v2"}
POST https://www.googleapis.com/bigquery/v2/projects/605902584318/jobs
● Java● Python● .NET● PHP● JavaScript● Apps Script● ... more ...
Libraries
Libraries - Example JavaScript query
var request = gapi.client.bigquery.jobs.query({ 'projectId': project_id, 'timeoutMs': '30000', 'query': 'SELECT state, AVG(mother_age) AS theav FROM [publicdata:samples.natality] WHERE year=2000 AND ever_born=1 GROUP BY state ORDER BY theav DESC;'});
request.execute(function(response) { console.log(response); $.each(response.result.rows, function(i, item) { ...
BigQuery UI
bigquery.cloud.google.com
Preparing your Data
Schema definition
birth_record
parent_id_motherparent_id_fatherpluralityis_maleraceweight
parents
idraceagecigarette_usestate
Schema definition
birth_record
mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight
Schema definition - sharding
birth_record_2011
mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight
birth_record_2012
mother_racemother_agemother_cigarette_usemother_statefather_racefather_agefather_cigarette_usefather_statepluralityis_maleraceweight
birth_record_2013
birth_record_2014
birth_record_2015
birth_record_2016
Data format
1969,1969,1,20,,AL,TRUE,1,7.813,AL,1,20,true1971,1971,5,7,,NY,FALSE,1,7.213,MA,5,7,true2001,2001,12,5,,CA,TRUE,2,6.427,CA,12,5,true
CSV
Tools to prepare your data
● App Engine MapReduce● Commercial ETL tools
● Pervasive● Informatica● Talend
● UNIX command-line
Visualizing your Data
Commercial visualization tools
Google Spreadsheets
Custom code and the Google Visualization API
BigQuery architecture
“ If you do a table scan over a 1TB table, you're going to have a bad time. ”
Anonymous16th century Italian Philosopher-Monk
● Reading 1 TB/ second from disk:● 10k+ disks
● Processing 1 TB / sec:● 5k processors
Goal: Perform a 1 TB table scan in 1 second
Parallelize Parallelize Parallelize!
Data access: Column Store
Record Oriented Storage Column Oriented Storage
Distributed Storage (e.g. GFS)
BigQuery Architecture
Mixer 0
Mixer 1Shard 0-8
Mixer 1Shard 17-24
Mixer 1Shard 9-16
Shard 0 Shard 10 Shard 12 Shard 24Shard 20
Running your Queries
SELECT COUNT(foo), MAX(foo), STDDEV(foo) FROM ...
BigQuery SQL Example: Simple aggregates
SELECT ... FROM ....WHERE REGEXP_MATCH(url, "\.com$") AND user CONTAINS 'test'
BigQuery SQL Example: Complex Processing
SELECT COUNT(*) FROM (SELECT foo ..... )GROUP BY foo
BigQuery SQL Example: Nested SELECT
BigQuery SQL Example: Small JOIN
SELECT huge_table.foo FROM huge_tableJOIN small_table ON small_table.foo = huge_table.foo
Distributed Storage (e.g. GFS)
BigQuery Architecture: Small Join
Mixer 0
Mixer 1Shard 0-8
Mixer 1Shard 17-24
Shard 0 Shard 24Shard 20
Other new features!
Batch queries!
● Don't need interactive queries for some jobs?● priority: "BATCH"
● API● Column-based datastore● Full table scans FAST● Aggregates● Commercial tool support● Use cases
That's it
SELECT questions FROM audience
SELECT 'Thank You!' FROM ryan
http://developers.google.com/bigquery
@ryguyrg http://profiles.google.com/ryan.boyd