jcconf 2016 - dataflow workshop labs

20
JCConf Dataflow Workshop Labs {Simon Su / 20161015} Index Index 1 Lab 1: 準備Dataflow環境,並建置第一個專案 1 建立GCP專案,並安裝Eclipse開發環境 1 安裝Google Cloud SDK 1 Dataflow API 2 建立第一個Dataflow專案 3 執行您的專案 6 Lab 2: 佈署您的第一個專案到Google Cloud Platform 9 準備工作 9 執行佈署 9 檢測執行結果 10 實作Input/Output/Transform等功能 12 Lab 3: 建立Streaming Dataflow 16 建立PubSub topic / subscription 16 佈署Dataflow streaming sample 16 Streaming範例1 16 Streaming範例2 17 Dashboard監控Dataflow Streaming Task 19 Lab結束後 20 Lab 1: 準備Dataflow環境,並建置第一個專案 建立GCP專案,並安裝Eclipse開發環境 請參考:JCConf 2016 - Dataflow Workshop行前安裝Google Cloud SDK 請參考此URL安裝Cloud SDKhttps://cloud.google.com/sdk/?hl=en_US#download

Upload: simon-su

Post on 14-Feb-2017

173 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: JCConf 2016 - Dataflow Workshop Labs

JCConf Dataflow Workshop Labs {Simon Su / 20161015}

Index

Index 1

Lab 1: 準備Dataflow環境,並建置第一個專案 1 建立GCP專案,並安裝Eclipse開發環境 1

安裝Google Cloud SDK 1 啟用Dataflow API 2

建立第一個Dataflow專案 3 執行您的專案 6

Lab 2: 佈署您的第一個專案到Google Cloud Platform 9 準備工作 9 執行佈署 9 檢測執行結果 10 實作Input/Output/Transform等功能 12

Lab 3: 建立Streaming Dataflow 16 建立PubSub topic / subscription 16 佈署Dataflow streaming sample 16

Streaming範例1 16 Streaming範例2 17

從Dashboard監控Dataflow Streaming Task 19

Lab結束後 20

Lab 1: 準備Dataflow環境,並建置第一個專案

建立GCP專案,並安裝Eclipse開發環境 請參考:JCConf 2016 - Dataflow Workshop行前說明

安裝Google Cloud SDK ● 請參考此URL安裝Cloud SDK:https://cloud.google.com/sdk/?hl=en_US#download

Page 2: JCConf 2016 - Dataflow Workshop Labs

● 認證Cloud SDK: > gcloud auth login > gcloud auth application-default login

● 設定預設專案 > gcloud config set project <your-project-id>

● 確認安裝 > gcloud config list

啟用Dataflow API 至所屬Project的API Manager項目:

在API Manager Dashboard中點選Enable API:

搜尋Dataflow項目:

將該項目做Enable:

Page 3: JCConf 2016 - Dataflow Workshop Labs

建立第一個Dataflow專案 透過Eclipse Dataflow Wizard可以協助您建立Dataflow的相關專案,步驟如下: Step1: 選擇New > Other...

Step2: 選擇Google Cloud Platform > Cloud Dataflow Java Project

Page 4: JCConf 2016 - Dataflow Workshop Labs

Step3: 輸入您的專案資訊

Step4: 輸入Google Cloud Platform上的專案ID與Cloud Storage資訊

Page 5: JCConf 2016 - Dataflow Workshop Labs

Step4: 專案建立好後,可以檢視專案狀態

範例程式如下:

Page 6: JCConf 2016 - Dataflow Workshop Labs

執行您的專案

Page 7: JCConf 2016 - Dataflow Workshop Labs

點選右上角 按鈕,建立新的Dataflow Run Configuration...

Page 8: JCConf 2016 - Dataflow Workshop Labs

設定Run Configuration名稱

設定Runner形式:

檢視佈署Log狀態...

Page 9: JCConf 2016 - Dataflow Workshop Labs

Lab 2: 佈署您的第一個專案到Google Cloud Platform

準備工作 在進行Lab2前的前置工作部分,需要先確認您在Lab1的專案可以正常執行,然後您可以依照您的需

求稍加改動您的專案,測試一下變化...

執行佈署 透過”Run As > Run Configurations...”之項目進入到Run Configurations設定視窗

設定視窗如下:

您可以點選視窗中的”New Launch Configuration”按鈕(下圖紅色標記處)來建立新的Configuration…

Page 10: JCConf 2016 - Dataflow Workshop Labs

本Lab中,新的Configuration有兩個地方需要設定: 1. 設定Main method

2. 設定Pipeline Arguments

檢測執行結果 在執行視窗中,Console會顯示執行的過程,大致結果如下:

Page 11: JCConf 2016 - Dataflow Workshop Labs

執行當下,可以到依照IDE Console的指示,連線到Web Console檢視該Dataflow Task狀態:

該執行項目的詳細畫面如下:

Page 12: JCConf 2016 - Dataflow Workshop Labs

可以透過”LOGS”鏈結檢視執行狀況...

實作Input/Output/Transform等功能 修改您的專案,讓他從Google Cloud Storage抓取檔案...

@SuppressWarnings("serial") public class TestMain { private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);

Page 13: JCConf 2016 - Dataflow Workshop Labs

public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); } }

進一步修改程式,讓資料輸出到Google Cloud Storage…

@SuppressWarnings("serial") public class TestMain { private static final Logger LOG = LoggerFactory.getLogger(TestMain.class); public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt")); p.run(); } }

增加Transform Function,將段落字元切割

Page 14: JCConf 2016 - Dataflow Workshop Labs

@SuppressWarnings("serial") public class TestMain { private static final Logger LOG = LoggerFactory.getLogger(TestMain.class); public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt")) .apply(ParDo.of(new DoFn<String, String>() { private final Aggregator<Long, Long> emptyLines = createAggregator("emptyLines", new Sum.SumLongFn()); @Override public void processElement(ProcessContext c) { if (c.element().trim().isEmpty()) {

emptyLines.addValue(1L); }

// Split the line into words. String[] words = c.element().split("[^a-zA-Z']+");

// Output each word encountered into the output PCollection. for (String word : words) { if (!word.isEmpty()) { c.output(word); } }

} })) .apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt")); p.run(); } }

Word Count Sample - 計算每個文件中單字出現的數量

@SuppressWarnings("serial") public class TestMain {

Page 15: JCConf 2016 - Dataflow Workshop Labs

static class MyExtractWordsFn extends DoFn<String, String> { private final Aggregator<Long, Long> emptyLines = createAggregator(

"emptyLines", new Sum.SumLongFn());

@Override public void processElement(ProcessContext c) {

if (c.element().trim().isEmpty()) { emptyLines.addValue(1L);

}

// Split the line into words. String[] words = c.element().split("[^a-zA-Z']+");

// Output each word encountered into the output PCollection. for (String word : words) {

if (!word.isEmpty()) { c.output(word);

} }

} }

public static class MyCountWords extends

PTransform<PCollection<String>, PCollection<KV<String, Long>>> { @Override public PCollection<KV<String, Long>> apply(PCollection<String> lines) {

// Convert lines of text into individual words. PCollection<String> words = lines.apply(ParDo.of(new MyExtractWordsFn()));

// Count the number of times each word occurs. PCollection<KV<String, Long>> wordCounts = words.apply(Count.<String> perElement());

return wordCounts;

} }

public static class MyFormatAsTextFn extends DoFn<KV<String, Long>, String> {

@Override public void processElement(ProcessContext c) {

c.output(c.element().getKey() + ": " + c.element().getValue()); }

}

private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);

public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args)

.withValidation().create());

p.apply(TextIO.Read.named("sample-book").from( "gs://jcconf2016-dataflow-workshop/sample/book-sample.txt")) .apply(new MyCountWords()) .apply(ParDo.of(new MyFormatAsTextFn())) .apply(TextIO.Write.named("output-book") .to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));

p.run(); }

}

Page 16: JCConf 2016 - Dataflow Workshop Labs

Lab 3: 建立Streaming Dataflow

建立PubSub topic / subscription 建立topic

gcloud beta pubsub topics create jcconf2016

建立該topic的subscription

gcloud beta pubsub subscriptions create --topic jcconf2016 jcconf2016-sub001

佈署Dataflow streaming sample

Streaming範例1 聆聽subscription作為資料輸入,並將資料輸出在LOG中...

Page 17: JCConf 2016 - Dataflow Workshop Labs

public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); options.setStreaming(true); Pipeline p = Pipeline.create(options); p.apply(PubsubIO.Read.named("my-pubsub-input")

.subscription("projects/sunny-573/subscriptions/jcconf2016-sub001")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) {

c.output(c.element().toUpperCase()); }

})) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run();

}

Streaming範例2 整合Work Count範例,將資料寫入BigQuery的dataset中...

/* * Copyright (C) 2015 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.jcconf2016.demo; import java.util.ArrayList; import java.util.List; import org.joda.time.Duration; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.google.api.services.bigquery.model.TableFieldSchema; import com.google.api.services.bigquery.model.TableReference; import com.google.api.services.bigquery.model.TableRow; import com.google.api.services.bigquery.model.TableSchema; import com.google.cloud.dataflow.sdk.Pipeline; import com.google.cloud.dataflow.sdk.io.BigQueryIO; import com.google.cloud.dataflow.sdk.io.PubsubIO; import com.google.cloud.dataflow.sdk.options.Default; import com.google.cloud.dataflow.sdk.options.Description; import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory; import com.google.cloud.dataflow.sdk.options.StreamingOptions;

Page 18: JCConf 2016 - Dataflow Workshop Labs

import com.google.cloud.dataflow.sdk.transforms.DoFn; import com.google.cloud.dataflow.sdk.transforms.ParDo; import com.google.cloud.dataflow.sdk.transforms.windowing.FixedWindows; import com.google.cloud.dataflow.sdk.transforms.windowing.Window; import com.google.cloud.dataflow.sdk.values.KV; import com.google.cloud.dataflow.sdk.values.PCollection; /** * A starter example for writing Google Cloud Dataflow programs. * * <p> * The example takes two strings, converts them to their upper-case * representation and logs them. * * <p> * To run this starter example locally using DirectPipelineRunner, just execute * it without any additional parameters from your favorite development * environment. In Eclipse, this corresponds to the existing 'LOCAL' run * configuration. * * <p> * To run this starter example using managed resource in Google Cloud Platform, * you should specify the following command-line options: * --project=<YOUR_PROJECT_ID> * --stagingLocation=<STAGING_LOCATION_IN_CLOUD_STORAGE> * --runner=BlockingDataflowPipelineRunner In Eclipse, you can just modify the * existing 'SERVICE' run configuration. */ @SuppressWarnings("serial") public class StreamingPipeline {

static final int WINDOW_SIZE = 1; // Default window duration in minutes

public static interface Options extends StreamingOptions { @Description("Fixed window duration, in minutes") @Default.Integer(WINDOW_SIZE) Integer getWindowSize();

void setWindowSize(Integer value);

@Description("Whether to run the pipeline with unbounded input") boolean isUnbounded();

void setUnbounded(boolean value);

}

private static TableReference getTableReference(Options options) { TableReference tableRef = new TableReference(); tableRef.setProjectId("sunny-573"); tableRef.setDatasetId("jcconf2016"); tableRef.setTableId("pubsub"); return tableRef;

}

private static TableSchema getSchema() { List<TableFieldSchema> fields = new ArrayList<>(); fields.add(new TableFieldSchema().setName("word").setType("STRING")); fields.add(new TableFieldSchema().setName("count").setType("INTEGER")); fields.add(new TableFieldSchema().setName("window_timestamp").setType(

"TIMESTAMP")); TableSchema schema = new TableSchema().setFields(fields); return schema;

}

static class FormatAsTableRowFn extends DoFn<KV<String, Long>, TableRow> { @Override public void processElement(ProcessContext c) {

TableRow row = new TableRow().set("word", c.element().getKey()) .set("count", c.element().getValue())

Page 19: JCConf 2016 - Dataflow Workshop Labs

// include a field for the window timestamp .set("window_timestamp", c.timestamp().toString());

c.output(row); }

}

private static final Logger LOG = LoggerFactory .getLogger(StreamingPipeline.class);

public static void main(String[] args) {

Options options = PipelineOptionsFactory.fromArgs(args) .withValidation().as(Options.class);

options.setStreaming(true); Pipeline p = Pipeline.create(options);

PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016")); PCollection<String> windowedWords =

input.apply(Window.<String> into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));

PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new TestMain.MyCountWords());

wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(

BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));

p.run(); }

}

從Dashboard監控Dataflow Streaming Task 打開GCP Web Console,使用Dataflow Dashboard來檢視每個流程的執行狀況。

並透過Cloud Logging來檢視執行Log…

Page 20: JCConf 2016 - Dataflow Workshop Labs

Lab結束後 在Lab結束後,記得參考IDE輸出的Log,將Dataflow job做cancel動作,避免Streaming Dataflow仍

在運行中,主機無法關閉...

gcloud alpha dataflow jobs --project=sunny-573 cancel 2016-10-14_08_38_48-17987270960467929246