google cloud and data pipeline patterns
TRANSCRIPT
1
Google Cloud & Data Pipeline Patterns@LynnLangit
2
Google Cloud in AustraliaData center here in 2017
3
GCP and Patterns
Developer-first• Fast, flexible and cheap• Virtual Machines / GCE• Storage / GCS
Servers ➡ Containers ➡ Functions• Data Warehouse• Internet of Things (IoT)• Bioinformatics
1. Modern Cloud by Example
2. GCP Data Pipeline Patterns
4Confidential & ProprietaryGoogle Cloud Platform 4
Demo – Storage / GCS
5
6Confidential & ProprietaryGoogle Cloud Platform 6
Demo – Virtual Machines / GCE
7
Virtual Machines / GCE• Fast
• Spin up in seconds• Tools - SSH, gcloud console
• Flexible• Custom sizing – slider • OS variety – Linux or Windows
• Cheap and Simple• Auto discount for use• Pre-emptible
Storage / GCS• Fast
• Very fast within region• Tools included
• Flexible• 4 storage options• Simple to use /
understand• Cheap
• Pricing by type
8
Pipeline Architectures
9Google Cloud Platform 9
Data Warehousing
10
Big Data > Data Warehouse
Reference tableQuery / ComputeBigQuery
Customer Lists / Reference Data
Export Ad DataCloud Storage
Id matchingCloud Dataflow
Marketing List
DoubleClickCampaign Manager
Google Analytics
Relevant UsersCloud Storage
AnalystsDataStudio 360Dashboards
11Confidential & ProprietaryGoogle Cloud Platform 11
Demo – BigQuery
12
Batch
Streaming
Big Data > Log Processing
Log StorageCloud Storage
Log StreamingCloud Pub/Sub
Log AnalyticsBigQuery
Log ProcessingCloud Dataflow
13
Big Data > Time Series Analysis
Batch
StorageBigQuery
StorageCloud Storage
Time Series ProcessingCloud Dataflow
AnalysisCloud Datalab
StorageCloud Bigtable*
ProcessingCloud Dataproc
Time Series FilesCloud Storage
MLCloud ML
Streaming Time Series
StreamingCloud Pub/Sub
*Note: Use Bigtable with NoSQL workloads of 1 TB or more
14
Streaming
Big Data > Complex Event Processing
Cloud AppsCompute Engine
Streaming
Batch
Push to DevicesApp Engine
Rules EngineCloud Dataflow
Data AnalysisCloud Datalab
Mobile DevicesPush Notifications
Report & ShareBusiness Analysis
Cloud AppsCompute Engine
On-PremisesDatabases
On-PremisesApplications
Processed EventsCloud BigtableEvents Time Series
Data WarehouseBigQueryExecution Results
StreamingCloud Pub/SubTransactions
ProcessingCloud DataflowTransaction Streams
MessagingCloud Pub/SubRules Actions
ETLCloud DataflowTransform Data
Cloud DataCloud Storage
Rules EngineCloud Dataproc
1515
Files• Cloud Storage
Compute• Big Query• Cloud Dataflow
Other• 3rd party ETL• 3rd party dashboards
Core Products for Data Warehousing
More on Big Query…• Interactive or Batch query• ANSI SQL compliant• Cost control - Purchase
‘slots’• NoOps Data Warehouse
16Google Cloud Platform 16
Internet of Things
17
Internet of Things > MQTT
IoT WarehouseBigQuery
IoT ApplicationApp Engine
Stream AnalyticsCloud Dataflow
IoT TopicCloud Pub/Sub
MQTTDevices
Auto-scaled Broker TierCustom MQTT broker MQTT Broker
Compute Engine
RabbitMQ
Cloud LoadBalancing
18
Ingest Pipelines
Storage
Analytics
Application &Presentation
StandardDevicesHTTPS
ConstrainedDevicesNon-TCPe.g. BLE
Gateway
Internet of Things > Sensor stream ingest and processing
AppEngineContainerEngine
CloudStorage
CloudPub/Sub
CloudDataflow
Monitoring
Logging
CloudDataflow
CloudDatastoreCloudBigtable
BigQuery
CloudDataprocCloudDatalab
ComputeEngine
19
Retail > Beacons and Targeted Marketing
EventsCloud BigtableProximity Events
AnalyticsBigQueryData Warehouse
MessagingCloud Pub/SubProximity Streams
ProcessingCloud DataflowStream Processing
NotificationsApp EnginePush to Devices
Mobile-Push Notifications
Office Business Systems
BeaconsProximity Notifications
MessagingCloud Pub/SubQueued Notifications
2020
Files & Storage• Cloud Storage• Big Table
Compute & Ingest• Cloud Pub/Sub• Big Query• Cloud Dataflow
Core Products for IoT
21Confidential & ProprietaryGoogle Cloud Platform 21
Demo – Machine Learning
22Google Cloud Platform 22
Bioinformatics
23
Patient
Analytics
Life Sciences > Patient Monitoring
Analytics Process
DataPrediction API
IngestCloud Pub/Sub
StorageCloud Bigtable
AlertsNotificationsCloud Pub/Sub
Health CareProfessional
Patient Monitors(pulse, bloodsugar, exercise)
24
Private Datasets
Public Datasets
Life Sciences > Variant Analysis
MSSNG AutismCloud Storage
Scientist
HighThroughputGenomeSequencers
1000 GenomesCloud Storage
Patient DataCloud Storage
Illumina PlatformCloud Storage
Ref GenomesCloud Storage
TCGACloud Storage
Analytics
Online AnalyticsBigQuery
Batch AnalyticsCloud Dataflow
Lab NotebooksCloud Datalab
Data IngestGenomics
BAMFASTQ
25
Ingest
Elastic Cluster
Storage
Analytics
Life Sciences > Genomics, Secondary Analysis
CarrierInterconnect
HighThroughputGenomeSequencers
Scientist
Raw DatafilesCloud Storage
Processed DataCloud Storage
MetadataCloud SQL
Lab notebooksCloud Datalab
HPC ClusterCompute Engine10 Nodes
Ingest ServerCompute Engine
Online AnalyticsBigQuery
Cloud LoadBalancing
CloudNetwork
2626
• Cloud Storage• Big Query• Compute Engine• Cloud Dataflow• Public datasets on GCP
Core Products for Bioinformatics
27
28
“The Future is Functional” - @LynnLangit