user guide · scaling and standard sql interfaces, uquery enables you to easily explore and analyze...

Data Pipeline Service

User Guide

Issue 05

Date 2018-01-30

HUAWEI TECHNOLOGIES CO., LTD.

Copyright © Huawei Technologies Co., Ltd. 2018. All rights reserved.No part of this document may be reproduced or transmitted in any form or by any means without prior writtenconsent of Huawei Technologies Co., Ltd. Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei and thecustomer. All or part of the products, services and features described in this document may not be within thepurchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information,and recommendations in this document are provided "AS IS" without warranties, guarantees orrepresentations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.

Huawei Technologies Co., Ltd.Address: Huawei Industrial Base

Bantian, LonggangShenzhen 518129People's Republic of China

Website: http://www.huawei.com

Email: [email protected]

Issue 05 (2018-01-30) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

i

http://www.huawei.com

mailto:[email protected]

Contents

1 Introduction.................................................................................................................................... 11.1 What Is DPS?..................................................................................................................................................................11.2 Application Scenarios.....................................................................................................................................................11.3 Functions........................................................................................................................................................................ 21.3.1 Pipeline Creation and Management.............................................................................................................................21.3.2 Pipeline Scheduling..................................................................................................................................................... 21.3.3 Pipeline Monitoring.....................................................................................................................................................21.3.4 Connector Creation and Management......................................................................................................................... 21.3.5 Resource Creation and Management...........................................................................................................................31.4 Related Services............................................................................................................................................................. 31.5 Permissions Required for Accessing DPS......................................................................................................................41.6 Restrictions..................................................................................................................................................................... 51.7 Basic Concepts............................................................................................................................................................... 5

2 Getting Started............................................................................................................................... 72.1 Using MRS and OBS to Process Data............................................................................................................................72.2 Using MRS, OBS, and RDS to Process Data............................................................................................................... 11

3 Installing DPS Agent.................................................................................................................. 163.1 Overview...................................................................................................................................................................... 163.1.1 Introduction to DPS Agent........................................................................................................................................ 163.1.2 Installation Flow........................................................................................................................................................ 163.2 Installation Preparation.................................................................................................................................................163.2.1 Purchasing Elastic Cloud Server (ECS).................................................................................................................... 173.2.2 Obtaining an AK/SK Pair.......................................................................................................................................... 173.2.3 Installing JRE............................................................................................................................................................ 183.2.4 Configuring hosts File...............................................................................................................................................193.3 Deploying DPS Agent.................................................................................................................................................. 203.3.1 Installing DPS Agent................................................................................................................................................. 203.3.2 Configuring DPS Agent.............................................................................................................................................213.3.3 Starting DPS Agent................................................................................................................................................... 233.3.4 Verifying DPS Agent................................................................................................................................................. 243.3.5 Stopping DPS Agent..................................................................................................................................................263.4 (Optional) Connecting to DWS Cluster........................................................................................................................26

Data Pipeline ServiceUser Guide Contents


ii

3.5 Common Operations.....................................................................................................................................................273.5.1 Binding EIP............................................................................................................................................................... 273.5.2 Unbinding EIP........................................................................................................................................................... 273.5.3 Configuring Security Group...................................................................................................................................... 283.5.4 Generating API Gateway Certificate.........................................................................................................................293.5.5 Using the WCC Tool to Encrypt Passwords..............................................................................................................293.5.6 Modifying the Run User and User Group of DPS Agent.......................................................................................... 303.5.7 Resetting the Password of API Gateway Certificate................................................................................................. 31

4 Working With DPS......................................................................................................................324.1 Pipeline Manager..........................................................................................................................................................324.1.1 Buying a Pipeline.......................................................................................................................................................324.1.2 Editing a Pipeline.......................................................................................................................................................334.1.3 Scheduling a Pipeline................................................................................................................................................ 364.1.4 Monitoring a Pipeline................................................................................................................................................ 384.1.5 Exporting a Pipeline.................................................................................................................................................. 404.1.6 Stopping a Pipeline....................................................................................................................................................414.1.7 Deleting a Pipeline.....................................................................................................................................................414.2 Connector List.............................................................................................................................................................. 424.2.1 Creating a DataSource Connector............................................................................................................................. 424.2.2 Creating a CDM Connector.......................................................................................................................................444.2.3 Creating an ESSource Connector.............................................................................................................................. 464.2.4 Editing a Connector................................................................................................................................................... 474.2.5 Deleting a Connector................................................................................................................................................. 484.3 Resource List................................................................................................................................................................ 484.3.1 Creating a DIS Resource........................................................................................................................................... 484.3.2 Creating an MRS Resource....................................................................................................................................... 514.3.3 Creating a CDM Resource.........................................................................................................................................534.3.4 Editing a Resource.....................................................................................................................................................554.3.5 Deleting a Resource...................................................................................................................................................56

5 Configuration Guide...................................................................................................................575.1 Data Sources................................................................................................................................................................. 575.1.1 RDS........................................................................................................................................................................... 575.1.2 HBase.........................................................................................................................................................................585.1.3 HDFS......................................................................................................................................................................... 595.1.4 OBS........................................................................................................................................................................... 605.1.5 DWS.......................................................................................................................................................................... 615.1.6 CDM Source.............................................................................................................................................................. 625.1.7 Dummy...................................................................................................................................................................... 635.1.8 UQuery Table.............................................................................................................................................................645.1.9 ES Storage................................................................................................................................................................. 655.2 Activities.......................................................................................................................................................................655.2.1 HDFS->HBASE........................................................................................................................................................ 65



iii

5.2.2 HDFS<->OBS........................................................................................................................................................... 695.2.3 Database<->HDFS.....................................................................................................................................................725.2.4 UQuery<->OBS.........................................................................................................................................................755.2.5 CDM Job....................................................................................................................................................................785.2.6 ExecuteCDM............................................................................................................................................................. 805.2.7 Spark..........................................................................................................................................................................825.2.8 SparkSQL.................................................................................................................................................................. 865.2.9 Hive........................................................................................................................................................................... 885.2.10 MapReduce..............................................................................................................................................................925.2.11 Shell Script...............................................................................................................................................................955.2.12 MachineLearning.....................................................................................................................................................975.2.13 Elasticsearch............................................................................................................................................................ 995.2.14 RDS SQL...............................................................................................................................................................1025.2.15 DWS SQL..............................................................................................................................................................1065.2.16 UQuery SQL..........................................................................................................................................................1085.2.17 Create OBS............................................................................................................................................................ 1105.2.18 Delete OBS............................................................................................................................................................ 112

6 FAQs.............................................................................................................................................1166.1 What Is DPS?..............................................................................................................................................................1166.2 Which Services Can DPS Schedule?.......................................................................................................................... 1166.3 How Many Pipelines Can I Create Using the DPS Console?.....................................................................................1166.4 What Can DPS Do?.................................................................................................................................................... 1176.5 What Is a Pipeline?..................................................................................................................................................... 1176.6 What Is a Data Source?...............................................................................................................................................117

A Change History......................................................................................................................... 118



iv

1 Introduction

1.1 What Is DPS?

Overview

Data Pipeline Service (DPS) is a web service running on the public cloud. It enables you toeasily automate the movement and transformation of data between different services.

With DPS, you can define a pipeline to describe data processing tasks, task executionsequences, and task scheduling plans. DPS then schedules and controls the execution of tasksbased on the pre-defined scheduling plan and relationship, to achieve inter-service dataprocessing and movement.

Highlightsl Visualized Orchestration

Pipelines defined in a drag-and-drop manner on a clear GUI, without requiring complexprogramming; template import and export; multiple data sources and data processingactivities

l Flexible SchedulingThree scheduling modes: periodic, event-driven, and manual; multiple executionpolicies, including precondition, failure policy, timeout, and retry; automatic operation ofpipelines

l Cost EffectivenessLow usage price; dynamic creation and release of compute and storage resources,minimizing the DPS expenses

l Solid ReliabilityUnified console for obtaining pipeline status in real time; automatic retry and recovery ofpipeline operation; automatic notification if an exception occurs

1.2 Application ScenariosDPS is applicable to the following scenarios:

Data Pipeline ServiceUser Guide 1 Introduction


1

l Data movement between servicesFor example, you have accumulated a certain amount data on a service that youpurchased, and want to transfer data between this service and other services. DPS sets upa data transmission channel between services and provides activities for concurrent datatransmission. This allows you to move data between services.

l Scheduled batch task executionDeep data analysis often requires a variety of complex tasks. However, DPS canschedule and run pipelines only through a few simple configurations.

1.3 Functions

1.3.1 Pipeline Creation and Managementl DPS provides a graphical pipeline editor. This allows you to orchestrate and edit data

sources and activities through drag-and-drop operations and build service-basedpipelines.

l DPS can integrate with various data sources, such as RDS, OBS, Hadoop distributed filesystem (HDFS), and HBase. For details, see Data Sources.

l DPS has a series of pre-packaged activities, enabling you to reliably process or movedata. For details, see Activities.

l DPS supports pipeline file import and export. It allows you to export pipeline files toyour local PC and import pipeline files to create or edit pipelines.

l DPS provides pre-defined templates. These pre-defined templates can be used to createpipelines quickly.

1.3.2 Pipeline Schedulingl To achieve efficient data processing, DPS supports two scheduling modes:

– Periodic scheduling: In a given period, DPS automatically runs the pipeline at aspecified interval (by month, week, day, hour, or minute).

– Manual scheduling: Manually trigger the running of a pipeline. In this schedulingmode, the pipeline is run for only once.

l During pipeline running, you can pause pipeline running or stop the schedule ofpipelines.

1.3.3 Pipeline MonitoringDPS allows you to view:

l Current and historical running details of pipelines.l Activity running details of each pipeline.

1.3.4 Connector Creation and ManagementDPS supports connector creation and management. With this function, you can directly use acreated and configured connector as a data source, eliminating the need of duplicate datasource configurations.



2

1.3.5 Resource Creation and ManagementDPS provides a resource list, which allows you to create and manage cloud service resources.Using this list, you can configure resource management and scheduling tasks to automaticallycreate and delete resources. This resource list facilitates the use of other cloud serviceresources.

1.4 Related ServicesDPS works with the following services:

l MapReduce ServiceBig data activities supported by DPS run on MapReduce Service (MRS).

l Object Storage ServiceObject Storage Service (OBS) stores data, including the input data and output data ofjobs.– Input data: user programs and data files.– Output data: result files and log files output by a job.

l Relational Database ServiceRelational Database Service (RDS) stores the input and output data of relationaldatabases and processes data.

l Elastic Cloud ServerElastic Cloud Server (ECS) is used to deploy DPS Agent. DPS schedules DPS Agentdeployed on the ECS to execute tasks.

l Key Management ServiceKey Management Service (KMS) is used to encrypt and decrypt passwords and privatekeys that DPS uses to connect to storage or compute resources.

l Data Warehouse ServiceData Warehouse Service (DWS) is used to store the input and output data of datawarehouses and process data.

l Data Ingestion ServiceDPS allows you to manage Data Ingestion Service (DIS). That is, you can create anddelete DIS streams on the DPS console.

l Cloud Data MigrationDPS uses Cloud Data Migration (CDM) to orchestrate and schedule cloud data.

l Machine Learning ServiceDPS uses Machine Learning Service (MLS) to implement data orchestration andscheduling related to machine learning.

l Unified Query ServiceUnified Query Service (UQuery) is a fully managed data query service. With autoscaling and standard SQL interfaces, UQuery enables you to easily explore and analyzeon-cloud data.

l Elasticsearch ServiceElasticsearch Service (ES) provides a distributed RESTful data search and analysisengine.



3

l Identity and Access ManagementIdentity and Access Management (IAM) authenticates access to DPS.

1.5 Permissions Required for Accessing DPS

BackgroundDPS uses access control lists (ACLs) to control users' permissions to data.

The MetaDB stores configurations of pipelines created by users as well as the ACLs ofpipelines. When a user attempts to retrieve a pipeline, DPS checks the user identityinformation to determine whether the user has permission to access this pipeline. This protectspipelines against unauthorized access and avoids information disclosure.

Permission listUser operation permission varies with the user groups to which the users belong.

Permission required for creating a user and creating or modifying a user group must be set onthe IAM console. For details, see Identity and Access Management User Guide.

Table 1-1 describes the permissions of different user groups.

Table 1-1 Permission list

NodeName

PermissionName

ManagedCloudResource

Description

Base TenantAdministrator

All services Permissions to operate all cloud resourcesowned by an enterprise.



4

NodeName

PermissionName

ManagedCloudResource

Description

DPS DPSAdministrator

Data PipelineService (DPS)

Users with the Tenant Administrator andDPS Administrator permissions canperform the following operations:l Create, delete, modify, and export

pipelines; query the pipeline list.l Run and stop pipelines; set the schedule

configurations for pipelines.l Create, delete, and modify connectors;

query the connector list.l Create, delete, and modify resources;

query the resource list.Users with only the DPS Administratorpermissions can perform the followingoperations:l Delete, modify, and export pipelines;

query the pipeline list.l Stop pipelines; set the schedule

configurations for pipelines.l Delete connectors; query the connector

list.l Query the resource list.

1.6 RestrictionsBefore using DPS, note the following restrictions to ensure that DPS runs properly:

l Recommended browsers for logging in to DPS:– Google Chrome 43.0 or later– Mozilla Firefox 38.0 or later– Internet Explorer 9.0 or later

Login to the DPS console through Internet Explorer 9.0 may fail. This is becausesome Windows operating systems (such as Windows 7 Ultimate) forbid the adminuser by default. You are advised to run browsers as the admin user.

l Do not delete existing processes or files on the DPS Agent node. Otherwise, the Agentwill become abnormal, affecting cluster and task running.

1.7 Basic Concepts

Regions and AZs

A region is a geographic area where resources used by your DPS services are located.



5

DPS services in the same region can communicate with each other over an intranet, but DPSservices in different regions cannot.

Public cloud data centers are deployed worldwide in places, such as North America, Europe,and Asia. Creating DPS services in different regions can better suit certain user requirements.For example, applications can be designed to meet user requirements in specific regions orcomply with local laws or regulations.

Each region contains many availability zones (AZs) where power and networks are physicallyisolated. AZs in the same region can communicate with each other over an intranet. Each AZprovides cost-effective and low-latency network connections that are unaffected by faults thatmay occur in other AZs.

ProjectA project is a collection of resources and the minimum unit for user authorization. Users'resources must be mounted to a project. DPS projects are used to isolate resources betweendifferent departments, different program teams, or different environments (such as R&D, test,and production environments) under the same program team.



6

2 Getting Started

2.1 Using MRS and OBS to Process DataThe procedure for using MRS and OBS to process data is as follows:

1. Scenario

2. Step 1: Logging In to DPS

3. Step 2: Creating Pipeline

4. Step 3: Configuring Pipeline

5. Step 4: Scheduling Pipeline

6. Step 5: Viewing Pipeline Running Information

Scenario

This section illustrates how to use DPS to transfer and process OBS data on the public cloudand saves the processed OBS data to the specified OBS bucket.

Figure 2-1 shows the data processing flow.

Figure 2-1 Data processing flow

Data transfer and processing flow:

1. The MapReduce activity of DPS transfers OBS data of the public cloud and theprograms developed by the user to MRS.

2. After MRS processes the data, it stores the processed data in a specified OBS bucket.

Data Pipeline ServiceUser Guide 2 Getting Started


7

Logging In to DPS

Step 1 Log in to the management console.

Step 2 Choose All Services > EI Enterprise Intelligence > Data Pipeline Service. The DPSconsole is opened.

----End

Creating Pipeline

Step 1 Click in the upper left corner on the DPS console and select your region and project.

Step 2 On the Pipeline Manager page, click Buy Pipeline.

Figure 2-2 Buying a pipeline

Step 3 On the Specify Details page, configure the required parameters (as shown in Figure 2-3) andclick Buy Now.

Figure 2-3 Specifying service details

Step 4 On the Confirm Specifications page, confirm your order information, and click Next.



8

Figure 2-4 Confirming order information

Step 5 On the Pay page, select a payment mode and click OK.

After the pipeline is successfully bought, the system redirects you to the Pipeline Managerpage.

----End

Configuring Pipeline

Step 1 On the Pipeline Manager page, click Edit in the Operation column for the newly createdpipeline.

Step 2 Drag and drop two OBS data sources and one MapReduce activity to the edit grid area, andconnect them as shown in Figure 2-5.

Figure 2-5 Connecting the data sources and activity

Step 3 Click the data sources and activity one by one. On the configuration page that is displayed atthe right side of the edit grid area, configure the required parameters.l Data source: For details about how to configure the data source, see Data Sources.l Activity: For details about how to configure the activity, see Activities.

Step 4 Click . The system checks the parameter validity of the pipeline.

In the displayed dialog box with the message "Are you sure you want to save the pipeline?",click Yes. If the pipeline is valid, it is saved successfully.

----End



9

Scheduling Pipeline

Step 1 On the Pipeline Manager page, click Schedule in the Operation column for the newlycreated pipeline.

The Schedule Pipeline dialog box is displayed. Configure the pipeline schedule task asshown in Figure 2-6.

Figure 2-6 Configuring the pipeline schedule task

Step 2 Click OK.

Step 3 Click Run in the Operation column to start the schedule task for the pipeline.

----End

Viewing Pipeline Running InformationViewing Pipeline Running Status:

Step 1 On the Pipeline Manager page, click the name of the newly created pipeline. You can viewthe pipeline running information in the Running History area of the displayed page.

Step 2 Click to refresh the pipeline and activity running information.

NOTE

To view the running status of each activity in the pipeline, click at the left side of each runningrecord.



10

Figure 2-7 Viewing activity running status

----End

Viewing the Output OBS Data:

Step 1 Log in to OBS Browser.

For details, see Object Storage Service Browser Operation Guide.

Step 2 Go to the OBS bucket or directory that stores output data, and view detailed files.

----End

2.2 Using MRS, OBS, and RDS to Process DataThe procedure for using MRS, OBS, and RDS to process data is as follows:

1. Scenario2. Step 1: Logging In to DPS3. Step 2: Configuring Pipeline4. Step 3: Configuring Pipeline5. Step 4: Scheduling Pipeline6. Step 5: Viewing Pipeline Running Information

Scenario

This section illustrates how to use DPS to transfer and process OBS data on the public cloudand saves the processed OBS data to RDS.

Figure 2-8 shows the data processing flow.

Figure 2-8 Data processing flow

Data transfer and processing flow:

1. The MapReduce activity of DPS transfers OBS data of the public cloud and theprograms developed by the user to MRS.



11

2. After MRS processes the data, it stores the processed data in the HDFS of MRS.3. The Database<->HDFS activity of DPS transfers the data stored in the HDFS to the data

table of RDS.

Logging In to DPS


Step 2 Choose All Services > EI Enterprise Intelligence > Data Pipeline Service. The DPSconsole is opened.

----End

Creating Pipeline

Step 1 Click in the upper left corner on the DPS console and select your region and project.

Step 2 On the Pipeline Manager page, click Buy Pipeline.

Figure 2-9 Buying a pipeline

Step 3 On the Specify Details page, configure the required parameters (as shown in Figure 2-10)and click Buy Now.

Figure 2-10 Specifying service details




12

Figure 2-11 Confirming order information

Step 5 On the Pay page, select a payment mode and click OK.

After the pipeline is successfully bought, the system redirects you to the Pipeline Managerpage.

----End

Configuring Pipeline

Step 1 On the Pipeline Manager page, click Edit in the Operation column for the newly createdpipeline.

Step 2 Drag and drop the OBS, HDFS, and RDS data sources and the MapReduce and RDS<->HDFS activities to the edit grid area, and connect them as shown in Figure 2-12.

Figure 2-12 Connecting data sources and activities

Step 3 Click the data sources and activities one by one. On the configuration page that is displayed atthe right of the edit grid area, configure the required parameters.

l Data source: For details about how to configure the data source, see Data Sources.

l Activity: For details about how to configure the activity, see Activities.

Step 4 Click . The system checks the parameter validity of the pipeline.



13

In the displayed dialog box with the message "Are you sure you want to save the pipeline?",click Yes. If the pipeline is valid, it is saved successfully.

----End

Scheduling Pipeline

Step 1 On the Pipeline Manager page, click Schedule in the Operation column for the newly createdpipeline.

The Schedule Pipeline dialog box is displayed. Configure the pipeline schedule task asshown in Figure 2-13.

Figure 2-13 Configuring the pipeline schedule task

Step 2 Click OK.

Step 3 Click Run in the Operation column to start the schedule task for the pipeline.

----End

Viewing Pipeline Running Information

Viewing Pipeline Running Status:

Step 1 On the Pipeline Manager page, click the name of the newly created pipeline. You can viewthe pipeline running information in the Running History area of the displayed page.

Step 2 Click to refresh the pipeline and activity running information.

NOTE

To view the running status of each activity in the pipeline, click at the left side of each runningrecord.



14

If the pipeline fails to run, click View Log in the Operation column (as shown in Figure2-14). The log helps you find the failure cause.

Figure 2-14 Viewing activity running record

----End

Viewing the Output RDS Data:

Step 1 Use the MySQL client to connect to the RDS MySQL instance as the root user.

For details, see section "Connecting to an RDS MySQL Instance" in Relational DatabaseService User Guide.

Step 2 Run the database commands to go to the RDS database table storing HDFS data and view thedetailed data in the database table.

Commands

l Selecting a database: use database_namel Viewing a database table: select * from table_name

----End



15

3 Installing DPS Agent

3.1 Overview

3.1.1 Introduction to DPS AgentDPS Agent is a platform provided by Data Pipeline Service (DPS) for running user-definedactivities. With DPS Agent, you can develop your own activities, such as Shell scripts, andthen schedule and manage your activities using DPS.

3.1.2 Installation FlowFigure 3-1 illustrates the installation flow of DPS Agent.

Figure 3-1 DPS Agent installation flowchart

3.2 Installation Preparation

Data Pipeline ServiceUser Guide 3 Installing DPS Agent


16

3.2.1 Purchasing Elastic Cloud Server (ECS)

Procedure

Step 1 Purchase an ECS.

For details, see Getting Started > Purchasing an ECS in Elastic Cloud Server Usage Guide.

Step 2 Log in to the ECS.

For details, see Getting Started > Logging In to an ECS in Elastic Cloud Server UsageGuide.

l If no elastic IP address (EIP) is purchased or is bound to the ECS, you can log in to theECS through Virtual Network Computing (VNC).

l If you have purchased an EIP and bind the EIP to the ECS, you can log in to the ECSthrough Secure Shell (SSH).

NOTE

l You are advised to log in to the Linux-based ECS through SSH (During the login, you are requiredto enter the username and password). In this installation guide, logging in to the ECS through SSH isused as an example.

l If you need to bind an EIP to the ECS, see Binding EIP.

l If you need to unbind an EIP from the ECS, see Unbinding EIP. After the EIP is unbound from theECS, you cannot log in to the ECS through SSH.

Step 3 (Optional) Set the security group.

The security group rules take effect in the following directions: inbound and outbound.

l Inbound: External services access the ECS server in the security group.l Outbound: An ECS server in the security group accesses instances outside the security

group.

To prevent malicious attacks, you are required to configure the inbound security group ruleand set the outbound security group rule to any IP address (ensure that DPS Agent cannormally access DPS). For details, see Configuring Security Group.

NOTE

For more information about security groups, see Security > Security Group in Virtual Private CloudUser Guide.

----End

3.2.2 Obtaining an AK/SK Pair

Background

Access Key ID/Secret Access Key (AK/SK) files are created by Identity and AccessManagement (IAM) to authenticate calls to application programming interfaces (APIs) on thepublic cloud.

During the startup of DPS Agent, DPS Agent uses the AK/SK pair to access DPS. After DPSAgent is started, it uses the AK/SK pair to access and operate other public cloud services.



17

NOTICEBefore creating the AK/SK pair, ensure that your public cloud account (used to log in to themanagement console) has passed the real-name authentication.

Procedure

Step 1 Log in to the DPS console.

Step 2 Click your username in the upper right corner of the page, and select Basic Information fromthe drop-down list.

Step 3 On the Account Info page, click the Manage my credentials.

Step 4 On the My Credential page, click the Access Keys tab. Then click Add Access Key. TheAdd Access Key dialog box is displayed.

If an AK/SK pair has been created, you can directly use it.

NOTE

Each user can create a maximum of two AK/SK pairs. If you want to create a new AK/SK pair, deletethe existing one.

Step 5 Enter the required information as prompted and click OK to download the AK/SK file.

NOTE

l During the download of the AK/SK file, if you cancel the download, the AK/SK file cannot be re-downloaded.

l Save the downloaded AK/SK pair properly to prevent information leakage.

----End

Follow-up Procedure

If you find that your AK/SK pair is abnormally used (for example, the AK/SK pair is lost orleaked) or will be no longer used, delete your AK/SK pair in the IAM system or contact theOBS administrator to reset your AK/SK pair.

NOTE

Deleted AK/SK pairs cannot be restored.

3.2.3 Installing JRE

Prerequisitesl You have downloaded the Java runtime environment (JRE) installation package of

version 1.8.0 or later. Download address: https://www.java.com/en/download/manual.jsp.

l You have obtained the EIP and the root user password of the ECS server.

l PuTTY and WinSCP tools have been installed on the local Windows-based PC.



18

https://www.java.com/en/download/manual.jsp

https://www.java.com/en/download/manual.jsp

Procedure

Step 1 Use the PuTTY tool to remotely log in to the ECS server as the root user.

Step 2 Run the following commands to create the /opt/jre directory in the ECS server for storing theJRE installation package.

mkdir -p /opt/jre

Step 3 Run the following command to assign permission 777 to the JRE installation directory:

chmod -R 777 /opt/jre

Step 4 Use the WinSCP tool to upload the JRE installation package to the /opt/jre directory.

Step 5 Run the following commands to decompress the JRE installation package:

cd /opt/jre

tar -zxvf 'JRE installation package name'.tar.gz

Step 6 Run the following command to edit the /etc/profile configuration file.

vim /etc/profile

Set the JAVA_HOME configuration item to the JRE installation directory.

export JAVA_HOME=/opt/jre/jre_filenameexport PATH=$PATH:$JAVA_HOME/binexport CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

/opt/jre/jre_filename is the path to the JRE installation directory after the JRE installationpackage is decompressed. You can change it as required.

After the modifications are completed, enter :wq to save the modifications and exit.

Step 7 Run the following command to make the JRE configuration take effect:

source /etc/profile

----End

Verification

Run the following command to query the JRE version. If the JRE version is earlier than 1.8.0,uninstall it and re-install the JRE of version 1.8.0 or later.

java -version

3.2.4 Configuring hosts File

Prerequisitesl You have obtained the EIP and the root user password of the ECS server.l The PuTTY tool has been installed on the local Windows-based PC.

Procedure




19

Step 2 Run the following commands to view IP address and host name of the ECS server:

ip address

hostname

As shown in the following, 192.168.0.43 indicates the IP address of the ECS server. Note that192.168.0.43 is only an example here.

[root@ecs-192c ~]# ip address1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether fa:16:3e:59:10:ba brd ff:ff:ff:ff:ff:ff inet 192.168.0.43/24 brd 192.168.0.255 scope global dynamic eth0 valid_lft 73176sec preferred_lft 73176sec inet6 fe80::f816:3eff:fe59:10ba/64 scope link valid_lft forever preferred_lft forever[root@ecs-192c ~]# hostnameecs-192c

Step 3 Run the following command to modify the hosts file:

echo 'IP HOSTNAME' >> /etc/hosts

Parameter description:

IP HOSTNAME indicates the IP address and host name obtained in Step 2.

----End

3.3 Deploying DPS Agent

3.3.1 Installing DPS Agent

Prerequisitesl All operations in Installation Preparation have been completed.l You have obtained the EIP and the root user password of the ECS server.l The PuTTY tool has been installed on the local Windows-based PC.

Procedure


Step 2 Run the following command to create the /opt/dps directory in the ECS server for storing theDPS Agent installation package:

mkdir -p /opt/dps

NOTE

/opt/dps is the default installation directory. If this directory does not exist, you need to run thepreceding commands to create it.



20

Step 3 Run the following commands to download the DPS Agent installation package:

cd /opt/dps

wget http://obs.myhwclouds.com/dps-program/dps-agent.tar.gz.sha256

wget http://obs.myhwclouds.com/dps-program/dps-agent.tar.gz

Step 4 Run the following command to check whether the DPS Agent installation package ismodified:

sha256sum -c dps-agent.tar.gz.sha256

l If OK is returned, the DPS Agent installation package has not been modified. Then go toStep 5.

l If FAILED is returned, the DPS Agent installation package is modified. Contact thetechnical support to obtain the new installation package.

Step 5 Run the following command to decompress the DPS Agent installation package:

tar -zxf dps-agent.tar.gz

Step 6 Run the following command to go to the directory generated after the DPS Agent installationpackage is decompressed:

cd agent

Step 7 Run the following command to run the DPS Agent installation program:

bash bin/install.sh

Enter y as prompted to continue the installation.

After the DPS Agent installation completes, the system displays a message, telling you thatthe installation succeeds and asking you to configure related files.

----End

3.3.2 Configuring DPS Agent

Prerequisitesl You have obtained the EIP and the root user password of the ECS server.l The PuTTY tool has been installed on the local Windows-based PC.l If the certificate verification function is enabled, ensure that the API gateway certificate

is available. For details about how to generate an API gateway certificate, seeGenerating API Gateway Certificate.

Procedure


Step 2 Run the following command to go to the DPS Agent installation directory:

cd /opt/dps/agent

Step 3 Run the following command to modify the DPS Agent configuration file:

vim conf/agent.conf



21

Table 3-1 Configuration items of the DPS Agent

ConfigurationItem

Mandatory orNot

Description

agent.name Yes DPS Agent name, which must be 1 to 64 characterslong and contains only letters, digits, and underscores(_).

agent.user.ak Yes AK obtained in Obtaining an AK/SK Pair.NOTICE

Before creating the AK/SK pair, ensure that your publiccloud account (used to log in to the management console)has passed the real-name authentication.

agent.user.sk Yes Encrypted SK.1. Obtain the SK. For details, see Obtaining an

AK/SK Pair.2. Use the WCC tool to encrypt the obtained SK. For

details, see Using the WCC Tool to EncryptPasswords.

NOTICEBefore creating the AK/SK pair, ensure that your publiccloud account (used to log in to the management console)has passed the real-name authentication.

agent.apigateway.endpoint

Yes Domain name of the public cloud API Gatewayaddress, for example, https://dps.cn-north-1.myhuaweicloud.com.You can obtain the domain name in Regions andEndpoints.

agent.obs.ip Yes Domain name of the OBS server. You can obtain thedomain name in Regions and Endpoints.

agent.trusted.jks.enabled

Yes An indication of whether to enable certificateverification.l False: Disable certificate verification.l True: Enable certificate verification. If you enable

certificate verification, the following parametersneed to be configured: agent.trusted.jks.path,agent.trusted.jksPasswd, andagent.hostname.verify.

Default value: false.

agent.trusted.jks.path No Path to the directory where the API Gatewaycertificate is stored. For details, see Generating APIGateway Certificate.



22

http://developer.huaweicloud.com/en-us/endpoint.html



ConfigurationItem

Mandatory orNot

Description

agent.trusted.jksPasswd

No Ciphertext password (encrypted password) of the APIgateway certificate.1. Obtain the plaintext certificate password. For

details, see Generating API Gateway Certificate.2. Use the WCC tool to encrypt the plaintext

password. For details, see Using the WCC Tool toEncrypt Passwords.

agent.hostname.verify

No An indication of whether to enable domain nameverification for the public cloud gateway.l False: Disable domain name verification.l True: Enable domain name verification.

----End

3.3.3 Starting DPS Agent

Prerequisitesl You have obtained the EIP and the root user password of the ECS server.

l The PuTTY tool has been installed on the local Windows-based PC.

Procedure



cd /opt/dps/agent

Step 3 Run the following command to start DPS Agent:

bash bin/agent.sh start


Step 5 Create a pipeline. For details, see Buying a Pipeline.

Step 6 Edit the pipeline. For details, see Editing a Pipeline.

1. On the Edit page, drag and drop the Shell Script activity to the edit grid area (namely,Canvas), and click the activity.

2. The configuration page is displayed at the right side of the edit grid area. On thisconfiguration page, click the ComputeResource drop-down list, and check whether DPSAgent started in Step 3 is included in the drop-down list.

– If yes, DPS Agent is successfully started.



23

– If no, DPS Agent fails to be started, perform Step 3 to restart DPS Agent. If DPSAgent is still not included in the ComputeResource drop-down list after the restartcompletes, contact technical support.

----End

3.3.4 Verifying DPS Agent


Procedure


Step 2 Select the pipeline to be edited, and click Edit in the Operation column. The Edit page isdisplayed.

Step 3 On the displayed page, drag and drop the Shell Script activity to the edit grid area, and clickthe activity.

Step 4 The configuration page is displayed at the right side of the edit grid area. On thisconfiguration page, configure the Shell Script activity. Table 3-2 shows the exampleconfigurations of the Shell Script activity.

Table 3-2 Configuring the Shell Script activity

Property Mandatoryor Not

Description Example Value

Name Yes Activity name. ShellScript_1271

Compute Resource Yes Name of the DPS Agentthat has been registered inthe ECS server.NOTE

If the installed DPS Agent isnot available, contacttechnical support.

test

Script Path Yes Absolute path to the shellscript on the ECS server.

/tmp/test.sh

Log Backup Yes An indication of whetherto back up logs.

True

Destination LogPath

Yes Log backup directory.Currently, logs can bebacked up only on OBS.

s3a://dpsfile/log

Step 5 Click . In the displayed dialog box, click OK. If all the configurations are valid, amessage is displayed, indicating that the pipeline is successfully saved.



24

If the pipeline fails to be saved, the possible causes are as follows:

l There is a loop in the pipeline.

l There are more than 32 activities in the pipeline.

l The configurations of a data source or activity are invalid.

l The link relationships of an activity are not complete.

Step 6 Use the PuTTY tool to remotely log in to the ECS server as the root user. Run the followingcommands in the /tmp directory to create the test.sh script:

cd /tmp

touch test.sh

Step 7 Run the following command to edit the test.sh script:

vim test.sh

Enter i, and add the following lines to the test.sh script. Then enter :wq to save themodifications and exit.

BIN_HOME ='dirname $0' #Query the script path.cd $BIN_HOME #Switch to the directory where the script is stored.echo "Hello World" > /tmp/result.txt

Step 8 Run the following command to check the test.sh script:

cat test.sh

[datasight@cce-masterinit tmp]# cat test.shecho "Hello World" > /tmp/result.txt

Step 9 Run the following command to set the execution permission for the test.sh script:

chmod 750 test.sh

[datasight@cce-masterinit tmp]# chmod 750 test.sh[datasight@cce-masterinit tmp]# ls -ltotal 8-rwxr-xr-x 1 datasight datasight37 Apr 10 22:07 test.sh

Step 10 Log in to the DPS console. On the Pipeline Manager page, select the pipeline to be run, andclick Run in the Operation column.

Step 11 Click the name of the pipeline. You can view the pipeline and activity running information inthe Running History area of the displayed page.

Step 12 Use the PuTTY tool to remotely log in to the ECS server as the root user. Run the followingcommand, and check whether message Hello World is displayed in the /tmp/result.txt file:

cat /tmp/result.txt

If the following information is displayed after the command is run, DPS Agent runs normally:[datasight@cce-masterinit tmp]# cat /tmp/result.txt Hello World

----End



25

3.3.5 Stopping DPS Agent


Procedure



cd /opt/dps/agent

Step 3 Run the following command to stop DPS Agent.

bash bin/agent.sh stop

----End

3.4 (Optional) Connecting to DWS Cluster

BackgroundData Warehouse Service (DWS) is an online data processing database that runs on the publiccloud architecture and platform.

DPS provides the DWS activity to help you quickly process and transfer data. For details, seeDWS SQL. If you need to use the DWS activity provided by DPS, download and configurethe DWS client by following the instructions provided in this section.

Prerequisitesl A DWS cluster has been created, and you have obtained the internal access address, port

number, admin account, and password of the cluster.l You have obtained the EIP and the root user password of the ECS server.l PuTTY and WinSCP tools have been installed on the local Windows-based PC.

Procedure

Step 1 Download the DWS client file.

1. Log in to the DWS console.2. Click Connection Management in the left navigation pane. In the displayed page, select

the required client type, and click Download.

Step 2 Use the WinSCP tool to upload the DWS client file to the /tmp directory of the ECS server.

Step 3 Configure the DWS client and connect it to the DWS cluster.

1. Use the PuTTY tool to remotely log in to the ECS server as the root user.2. Run the following command to go to the directory where the DWS client file is stored.

cd /tmp



26

3. Run the following command to decompress the DWS client file.tar -xvf dws_client_redhat_x64.tar.gz

4. Run the following command to configure the DWS client:source gsql_env.shIf the following information is displayed, the DWS client is configured successfully.All things done.

5. Run the following command to use the gsql tool provided by the DWS client to connectto the database in the DWS cluster:gsql -d postgres -h IP -U dbadmin -p PORT -W PasswordModify the following parameters based on the actual environment:– IP: internal access address of the DWS cluster.– Dbadmin: administrator of the DWS cluster.– PORT: port number of DWS.– Password: password of the administrator.If the following information is displayed, the gsql tool is successfully connected to thedatabase:postgres=>

6. Run the following command to exit the gsql tool.\q

----End

3.5 Common Operations

3.5.1 Binding EIP

Procedure


Step 2 On the homepage, choose Network > Virtual Private Cloud.

Step 3 In the left navigation pane, click Elastic IP Address.

On the displayed Elastic IP Address page, you can purchase and bind the EIP. For details,see Network Components > EIP > Assigning an EIP and Binding It to an ECS in VirtualPrivate Cloud User Guide.

----End

3.5.2 Unbinding EIP

Procedure


Step 2 On the homepage, choose Network > Virtual Private Cloud.



27

Step 3 In the left navigation pane, click Elastic IP Address.

Step 4 Find the target EIP in the EIP list, and click Unbind in the Operation column.

Step 5 In the displayed dialog box, click OK.

----End

3.5.3 Configuring Security Group

Procedure


Step 2 On the homepage, select Network > Virtual Private Cloud.

Step 3 In the left navigation pane, click Security Group.

Step 4 On the displayed Security Group page, click Create Security Group, and then complete thecreation of a security group as instructed.

Step 5 Find the newly created security group in the security group list, and click Add Rule in theOperation column. In the displayed Add Rule dialog box, configure the rules for the securitygroup.

NOTE

l Inbound: Set this parameter based on the actual requirements.

l Outbound: Set the configurations by referring to Figure 3-2.

Figure 3-2 Adding security group rules

----End



28

Follow-Up OperationsAfter a security group is added, you need to add the newly purchased ECS server to thesecurity group.


Step 2 In the homepage, choose Computing > Elastic Cloud Server.

Step 3 In the ECS list, click the name of the target ECS.

Step 4 On the displayed page, click the NIC tab, and click Change Security Group.

Step 5 In the displayed Change Security Group dialog box, select the security group created inStep 4.

----End

3.5.4 Generating API Gateway Certificate

Procedure

Step 1 Remotely log in to the ECS server as the root user.

Step 2 Run the following command to obtain the public cloud API gateway server certificate:

echo -n | openssl s_client -connect IP:PORT | sed -ne '/-BEGIN CERTIFICATE-/,/-ENDCERTIFICATE-/p' > apigateway.pem

Parameter description:

IP:PORT indicates the IP address and port number of the public cloud API gateway.

Step 3 Run the following command to generate the gateway.jks certificate:

keytool -import -file apigateway.pem -keystore gateway.jks

After this command is run, the system prompts you to configure the certificate password. Thispassword will be used in the other operations. Keep it confidential to protect informationsecurity.

Step 4 Run the following command to copy the gateway.jks certificate to the conf directory:

cp gateway.jks /opt/dps/agent/conf/

----End

3.5.5 Using the WCC Tool to Encrypt Passwords

Procedure



cd /opt/dps/agent

Step 3 Run the following command to run the WCC tool:

bash bin/encrypt-tool.sh



29

Enter the plaintext password as prompted.

The WCC tool encrypts the entered plaintext password.

Save the encrypted password properly.

----End

3.5.6 Modifying the Run User and User Group of DPS Agent

Background

After DPS Agent is installed, using non-root users to run DPS Agent is recommended in orderto ensure system security.

This section describes how to modify the run user of DPS Agent to the datasight user.

Procedure


Step 2 Run the following commands to create the datasight user and user group:

groupadd datasight

useradd datasight -g datasight -m -s /bin/bash

usermod datasight -a -G datasight

Step 3 Run the following command to go to the directory where the DPS Agent installation packageis stored, for example, /opt/dps:

cd /opt

Run the following command to modify the user and user group of the DPS Agent installationpackage:

chown -R datasight:datasight dps

The user and user group of the DPS Agent installation package are changed from root todatasight, as shown in the following:

[root@DPSNCM01 dps]# lltotal 4-rw-r----- 1 datasight datasight 123027036 Apr 14 18:30 DPSAgent.zip[root@DPSNCM01 dps]# ll ..total 8drwxr-x--- 2 datasight datasight 4096 Apr 14 18:30 dpsdrwxr-x--- 2 root root 4096 Apr 14 18:31 jdk

Step 4 Run the following command to switch from the root user to the datasight user:

su datasight

The following information is displayed:

[root@cce-masterinit usr]# su datasight[datasight@cce-masterinit usr]$

----End



30

3.5.7 Resetting the Password of API Gateway Certificate

Procedure


Step 2 Run the following commands to go to the DPS Agent installation directory and stop therunning of DPS Agent:

cd /opt/dps/agent

bash bin/agent.sh stop

Step 3 Run the following command to go to directory where the API gateway certificate is stored:

cd /opt/dps/agent/conf

Step 4 Run the following command to change the certificate password:

keytool -storepasswd -keystore gateway.jks

Enter the old certificate password as prompted, and then enter the new certificate passwordtwice. Then the certificate password is changed successfully.

Step 5 Use the WCC tool to encrypt the new plaintext password, and write the generated ciphertextpassword in the agent.trusted.jksPasswd configuration item of the agent.conf file.

For details, see Using the WCC Tool to Encrypt Passwords and Configuring DPS Agent.

Step 6 Run the following command to start DPS Agent:

cd /opt/dps/agent

bash bin/agent.sh start

----End



31

4 Working With DPS

4.1 Pipeline Manager

4.1.1 Buying a Pipeline

Scenario

A pipeline is a logical group of activities that execute a data processing task collaboratively.Before using DPS, you need to purchase pipelines.

Prerequisitesl The account who wants to access the public cloud management console has permission

on accessing DPS. For details Permissions Required for Accessing DPS.

l The number of pipelines does not exceed the quota.

NOTE

By default, a maximum of 10 pipelines can be created. If this quota cannot satisfy yourrequirement, you can increase the quota. To increase the quota, click Apply for a higher quota.

Procedure


Step 2 Click in the upper left corner and select your region and project.

Step 3 In the navigation pane of the DPS console, click Pipeline Manager.

Step 4 On the Pipeline Manager page, click Buy Pipeline. The system displays the Buy Pipelinepage.

Step 5 On the Basic Information page, configure pipeline parameters. Table 4-1 describes thepipeline parameters.

Data Pipeline ServiceUser Guide 4 Working With DPS


32

Table 4-1 Pipeline parameters

Parameter Description

Pipeline Name Name of the pipeline.A pipeline name is 1 to 62 characters long and contains onlyletters, digits, and underscores (_).

Description Pipeline description.

Preset Data Preset pipeline configuration mode.l Not Configured: No configurations are preset for the

pipeline.l Import from File: Import the preset pipeline

configurations from a JSON pipeline file.l Import from Template: Use the preset pipeline

configurations in a template provided by DPS.

Region Current region.

Purchase Quantity Validity period of the pipeline.After you determine the validity period, DPS automaticallycalculates the fees you need to pay.NOTE

For details about billing details, click Price Details in Price

Step 6 Click Buy Now.


Step 8 Select either of the following payment modes: Coupons, Balance, Online Payment, or Payby transfer and remittance.

Step 9 Click OK. The service is purchased.

----End

4.1.2 Editing a Pipeline

ScenarioYou can edit the data sources and activities of the pipeline that you have purchased ifnecessary.


on accessing DPS. For details Permissions Required for Accessing DPS.l Pipelines have been purchased.



33

Procedure




Step 4 On the Pipeline Manager page, enter the name of the pipeline in the search box at the upper

right corner, and click .

Step 5 Click Edit in the Operation column for a created pipeline. The Edit page is displayed.

The left side of the Edit page is divided into two sections:

l Data Sources. For details, see Data Sources.

l Activities. For details, see Activities.

Step 6 Drag any data source or activity and drop it in the edit grid area (Canvas) on the right side.The following process uses an OBS data source as an example of how to configure a datasource or activity:

1. Drag and drop the OBS data source to the edit grid area. Click the OBS data source.

2. The configuration page is displayed at the right side of the edit grid area. Configure theOBS properties.

NOTE

You can also click on the upper part of the edit grid area to import the pipeline file from your localdirectory. The newly imported pipeline information will overwrite the existing one.

Step 7 Put your mouse on the icon of the OBS data source. Then the icon appears. Drag this iconand link the OBS data source to an activity.

Figure 4-1 shows a successfully linked pipeline.

Figure 4-1 Successfully linked pipeline

Table 1 Link relationships between sources and activities describes the link relationshipsbetween data sources and activities.



34

Table 4-2 Link relationships between data sources and activities

Activity Link Relationships

HDFS<->OBS OBS -> [HDFS<->OBS] -> HDFSHDFS -> [HDFS<->OBS] -> OBS

Database<->HDFS HDFS -> [Database<->HDFS] -> RDSRDS -> [Database<->HDFS] -> HDFS

HDFS->HBASE HDFS -> [HDFS->HBASE] -> HBase

UQuery<->OBS OBS -> [UQuery<->OBS] -> UQuery TableUQuery Table -> [UQuery<->OBS] -> OBS

ExecuteCDM CDM Source/OBS -> ExecuteCDM -> CDM Source/OBS

CDM Job Any data source -> CDM Job -> any data sourceThe CDM Job activity can be connected to the Shell Script, CDMJob, Create OBS, and Delete OBS.

SparkSQL Any data source -> SparkSQL -> any data source

Spark OBS/HDFS/Dummy -> Spark -> OBS/HDFS/Dummy

Hive OBS/HDFS/Dummy -> Hive -> OBS/HDFS/Dummy

MapReduce OBS/HDFS/Dummy -> MapReduce -> OBS/HDFS/Dummy

Shell Script Any data source -> Shell Script -> any data sourceThe Shell Script activity can be connected to the Shell Script,CDM Job, Create OBS, and Delete OBS.

MachineLearning HDFS/Dummy -> MachineLearning -> HDFS/Dummy

Elasticsearch OBS/ES Storage/Dummy -> Elasticsearch -> ES StorageES Storage -> Elasticsearch -> OBS/ES Storage/Dummy

RDS SQL RDS -> RDS SQL -> RDS

DWS SQL DWS -> DWS SQL -> DWS

UQuery SQL OBS -> UQuery SQL -> UQuery TableUQuery Table -> UQuery SQL -> UQuery Table

Create OBS Any data source -> Create OBS -> any data sourceThe Create OBS activity can be connected to the Shell Script,CDM Job, Create OBS, and Delete OBS.

Delete OBS Any data source -> Delete OBS -> any data sourceThe Delete OBS activity can be connected to the Shell Script,CDM Job, Create OBS, and Delete OBS.



35

Step 8 Click . A dialog box with the message "Are you sure you want to submit the pipeline?" isdisplayed. Click Yes.

If the pipeline fails to be saved, the possible causes are as follows:

l An isolated data source exists in the pipeline.

l There is a loop in the pipeline.

l There are more than 32 activities in a pipeline.

l The configurations of a data source or activity are invalid.

l The link relationships of an activity are not complete.

Step 9 (Optional) Click to run the pipeline.

After the pipeline runs, you can view the current running status and running result of eachactivity in the edit grid area.

If you want to view the running status of the pipeline after you exit the editing page, use oneof the following methods:

l On the Pipeline Manager page, click on the left of the pipeline name.

l On the Pipeline Manager page, click the pipeline name. For details, see Monitoring aPipeline.

Step 10 (Optional) Click to export pipeline data as a JSON pipeline file to your local PC.

NOTE

The Pipeline Manager page also provides a function for you to export pipeline data. For details, seeExporting a Pipeline.

Step 11 (Optional) Click to add a tab to the edit grid area for remarks.

To add an association tab for a data source or activity, select the data source or activity and

click . Alternatively, right-click the data source or activity in the edit grid area and chooseadd Note from the shortcut menu.

Constraints on using tabs:

l Each note can contain a maximum of 1000 English characters.

l Each data source or activity can have multiple notes.

l A pipeline can have a maximum of 40 notes.

----End

4.1.3 Scheduling a Pipeline

Scenario

After editing a pipeline, you can configure a pipeline scheduling mode: manual or automatic.



36



l The pipeline has been edited and is stopped.

Procedure






Step 5 Click Schedule in the Operation column of a pipeline. The Schedule Pipeline dialog box isdisplayed. Configure the pipeline schedule task by referring to Table 4-3.

Table 4-3 Pipeline schedule parameters


Schedule Type Pipeline schedule type. Options are as follows:l Run once: The pipeline will be run only once.l Run periodically: The pipeline will be run

periodically.

Running Cycle Interval at which the pipeline runs.This parameter is displayed when Schedule Typeis set to Run periodically.

Start Time Time at which the pipeline schedule task starts.Must be earlier than the end time.This parameter is displayed when Schedule Typeis set to Run periodically.

End Time Time at which the pipeline schedule task ends.This parameter is displayed when Schedule Typeis set to Run periodically.

Cross-Cycle Dependency Please select the dependency between instances ofthe same pipeline.l Not dependent on the previous schedule cycle.l Self-dependent. The current schedule task can

continue only after the previous schedule cycleends.

This parameter is displayed when Schedule Typeis set to Run periodically.



37


Dependency pipeline Pipeline A cannot depend on Pipeline B in any ofthe following circumstances:l The running cycle of Pipeline B is longer than

that of Pipeline A.l Pipeline B's running cycle is set to hours or

minutes, while that of Pipeline A is set toweeks.

This parameter is displayed when Schedule Typeis set to Run periodically.

Dependency execution strategy Pipeline execution policy when a pipeline dependson other pipelines. Options are as follows:l success: The current pipeline is executed only

when the other pipeline instances on which thepipeline depends are executed successfully.

l any result: The current pipeline is executedwhen the other pipeline instances on which thepipeline depends are executed and no matterwhat is the execution result.

Step 6 Click OK to save the schedule configurations.

Step 7 Click Run in the Operation column for the pipeline. Then the Run button turns to Pause.l If Schedule Type is set to Run Once, the pipeline starts to run.l If Schedule Type is set to Run Periodically, the pipeline starts to run at the preset time.

----End

4.1.4 Monitoring a Pipeline

Scenario

You can view the running status and log information of pipelines and activities if necessary.


on accessing DPS. For details Permissions Required for Accessing DPS.l Pipelines have been purchased.

Procedure






38



Step 5 Click the name of the pipeline to be monitored. You can view the pipeline monitoringinformation in the Running History area of the displayed page.

Step 6 Click to refresh the monitoring information.

Table 4-4 Pipeline monitoring parameters


Status Status of the schedule task, which can be Success, Failed,Running, Paused, Deleted, or Canceled.

Running Duration(min)

Running duration of the pipeline.

Start Time Time at which the pipeline starts to run.

End Time Time at which the pipeline stops running.

Instance GenerationTime

Time at which the instance was generated.

Running Type Scheduling mode of the pipeline.

Click on the left of a pipeline running record. Then the running information about eachactivity of the pipeline is displayed.

Table 4-5 Activity monitoring parameters


Name Activity name.

Type Activity type.

Status Activity status, which can be Success, Failed, Running, Paused,Deleted, or Canceled.

Running Duration(min)

Running duration of the activity.

Start Time Time at which the activity starts to run.

Retry Count Number of retries upon an activity execution failure.



39


Operation View Log: You can query the logs of activities in the Success orFailed state.Logs cannot be queried in the following scenarios:l Logs of the SparkSQL and MachineLearning activities cannot

be queried.l If the log backup property for activities is set to false, no logs

can be viewed. For details, see Activities.NOTE

If an activity encounters a fault, logs help quickly locate and resolve thefault.

Error Message Error message that is displayed.

----End

4.1.5 Exporting a Pipeline

Scenario

If you need to back up a pipeline, or use the existing pipeline as a new pipeline template,export a JSON pipeline file.



l Pipelines have been purchased.

Procedure






Step 5 Select the pipeline to be exported, choose More > Export in the Operation column to exportthe pipeline data as a JSON pipeline file.

NOTE

The exported JSON pipeline file does not contain the pipeline schedule configurations or sensitiveinformation (such as accounts and passwords).

----End



40

4.1.6 Stopping a Pipeline

Scenario

You can stop a pipeline if necessary.


on accessing DPS. For details Permissions Required for Accessing DPS.l The pipeline is not in the Stopped, Deleted, or Frozen state.

Procedure






Step 5 Choose More > Stop in the Operation column for a pipeline. In the displayed dialog box,click OK to confirm your operation.

----End

4.1.7 Deleting a Pipeline

Scenario

You can delete a pipeline if the pipeline will not be used any longer.


on accessing DPS. For details Permissions Required for Accessing DPS.l Ensure that services are not affected after you delete the pipeline. If you need to back up

pipelines, see Exporting a Pipeline.

Procedure








41

Step 5 Choose More > Delete in the Operation column for a pipeline. In the displayed dialog box,click OK to confirm your operation.

NOTE

A deleted pipeline can be restored.

----End

4.2 Connector List

4.2.1 Creating a DataSource Connector

ScenarioA DataSource connector records the connector information of RDS and DWS data sources. Adefined DataSource connector is available to data sources of RDS, DWS.

A connector can be used by more than one DPS data source. If the information about theconnector changes, you only need to modify the connector configurations in the connectorlist, and these modified configurations are automatically updated in the data sources of thepipeline.


on accessing DPS. For details Permissions Required for Accessing DPS.l The current number of connectors does not reach the connector quota. The connector

quota is 20.l You have obtained the username, password, and URL of the data source.

Procedure



Step 3 In the navigation pane of the DPS console, click Connector List.

Step 4 On the Connector List page, click Create Connector.

Step 5 In the displayed dialog box, select DataSource from the Connector Type drop-down list.Configure the DataSource parameters by referring to Table 4-6.



42

Table 4-6 DataSource parameters

Parameter Mandatory orNot

Description Example

Connector Name Yes Connector name. A connectorname is 1 to 64 characters long andcontains only letters, digits, andunderscores (_).

dps_database_123

Database DriverName

Yes Name of the database driver:l com.mysql.jdbc.Driver: Used

for RDS data sources.l org.postgresql.Driver: Used for

DWS data sources.

com.mysql.jdbc.Driver

Connector URL Yes URL to the database. jdbc:mysql://IP:PORT

Database Name Yes Database name. dps

Username Yes Username for logging in to thedatabase.

dpsadmin

Password Yes Password for logging in to thedatabase.

-

Drive Path Yes Path to the JDBC driver.Download the JDBC driver on theMySQL official websites asrequired and upload the JDBCdriver to the OBS bucket.l If Database Driver Name is

set to com.mysql.jdbc.Driver,use the mysql-connector-java-5.1.21.jar driver.

s3a://dpsfile/mysql-connector-java-5.1.21.jar

KMS Encryption Yes Use KMS to encrypt and decryptuser passwords and private keys.Options: keys created in KMS.

dps/default

NOTE

KMS is the key management service provided by the public cloud. To create or manage keys, log in tothe management console, and choose Security > Key Management Service on the homepage to openthe KMS console.

Step 6 Click OK. A connector is created successfully.

----End



43

4.2.2 Creating a CDM Connector

ScenarioA CDMSource connector records the connector information of input and output data sourcesof CDM. A defined CDMSource connector is available to data sources of CDM Source.



on accessing DPS. For details Permissions Required for Accessing DPS.l The current number of connectors does not reach the connector quota. The connector

quota is 20.l You have obtained the information about the connectors, server IP addresses, and server

ports of the input and output data sources of CDM.

Procedure





Step 5 In the displayed dialog box, select CDMSource from the Connector Type drop-down list.Configure the CDMSource parameters by referring to Table 4-7.

Table 4-7 CDMSource parameters


Description Example


dps_cdmsource_123



44


Description Example

Connector Type Yes Connector type. Options are asfollows:l obs-connector: Connects to the

OBS data source. For detailsabout parameter settings, seeTable 4-8.

l generic-jdbc-connector:Connects to DWS and MySQL.For details about parametersettings, see Table 4-9.

-

Table 4-8 Parameter settings for obs-connector

Parameter Mandatory or Not Description

Database IPAddress

Yes IP address of the OBS server.

Database Port Yes Port number of the OBS server.

AK Yes AK used for accessing the OBS server.

SK Yes SK used for accessing the OBS server.

KMS Encryption Yes Use KMS to encrypt and decrypt userpasswords and private keys.Options: keys created in KMS.

NOTE

KMS is the key management service provided by the public cloud. To create or manage keys, log in tothe management console, and choose Security > Key Management Service on the homepage to openthe KMS console.

Table 4-9 Parameter settings for generic-jdbc-connector


Database Type Yes Type of the database. Options are as follows:l DWSl MYSQL

Database Name Yes Name of the database.

Database IPAddress

Yes IP address of the database server.



45


Database Port Yes Port number of the database server.

Username Yes Username used to log in to the database.Ensure that this user has permission to readand write data tables and read metadata in thedatabase.

Password Yes Password used to log in to the database.

KMS Encryption Yes Use KMS to encrypt and decrypt userpasswords and private keys.Options: keys created in KMS.


----End

4.2.3 Creating an ESSource Connector

Scenario

An ESSource connector records the connector information of an ES cluster. A definedESSource connector is available to data sources of ES Storage.




l The current number of connectors does not reach the connector quota. The connectorquota is 20.

l You have obtained the URL information of the ES cluster.

Procedure





Step 5 In the displayed dialog box, select ESSource from the Connector Type drop-down list.Configure the ESSource parameters by referring to Table 4-10.



46

Table 4-10 ESSource parameters


Description Example


dps_essource_123

Connector URL Yes IP address and port number used toaccess the ES cluster through theprivate network.Format: http://IP:PORT.Port 9200 is recommended foraccessing the ES cluster.

http://128.10.46.226:9200


----End

4.2.4 Editing a Connector

Scenario

You can modify a connector that has been created if necessary.


on accessing DPS. For details Permissions Required for Accessing DPS.l Connectors have been created.

Procedure




Step 4 On the Connector List page, click Edit in the Operation column for the connector to beedited.

Step 5 On the editing page, modify the connector information. For the description of each parameter,see the following:l DataSource connector: Table 4-6l CDMSource connector: Table 4-7l ESSource connector: Table 4-10



47

Step 6 Click OK. The modified parameter settings are saved successfully.

----End

4.2.5 Deleting a Connector

Scenario

You are advised to delete a connector if the connector will not be used any longer to reducethe quota occupation.


on accessing DPS. For details Permissions Required for Accessing DPS.l Connectors have been created but have not been used by pipelines.

Procedure




Step 4 On the Connector List page, click Delete in the Operation column for the connector to bedeleted.

Step 5 In the displayed dialog box, click OK to delete the connector.

----End

4.3 Resource List

4.3.1 Creating a DIS Resource

Scenario

DPS allows you to create a Data Ingestion Service (DIS) stream immediately, or create ordelete a DIS stream at a specific point in time. Reasonable DPS configurations help you fullyutilize DIS streams while reducing usage cost.


on accessing DPS. For details Permissions Required for Accessing DPS.l The current number of resources does not reach the resource quota. The resource quota is

20.l Currently, only common Data Ingestion Service (DIS) streams can be created.



48

Procedure



Step 3 In the navigation pane of the DPS console, click Resource List.

Step 4 On the Resource List page, click Create Resource.

Step 5 In the dialog box that is displayed, select DIS from the Type drop-down list. Configure theDIS resource parameters by referring to Table 4-11.

Table 4-11 DIS resource parameters

Parameter Mandatory or Not

Description Example

Schedule Type Yes Resource scheduling type. Options are asfollows:l At Once: Resources are created at once. If

the resources are no longer used, you needto manually delete them.

l On Schedule: Resources are automaticallyscheduled, created, and deleted at yourspecified resource schedule cycle, creationtime, and deletion time. Ensure that theinterval between the resource creationtime and the resource deletion time islonger than the resource schedule cycle.

-

Stream Name Yes Unique name of the DIS stream used to sendor receive data.A stream name is 1 to 64 characters long andcontains only letters, digits, hyphens (-), andunderscores (_).

dis-5acb

Partitions Yes Number of the partitions into which datarecords of the newly created DIS streams willbe distributed.Value range: an integer from 1 to 50.

10



49


Description Example

Data Dumping Yes Location in which data from the DIS streamwill be stored.l No Dump: Data will be stored only in

DIS.l Dump to OBS: Data will be stored in DIS

and periodically dumped to OBS. Fordetails about parameter settings, see Table4-12.

NOTEData stored in DIS can be retained for only 24hours. After this period of time expires, data willbe automatically cleared.

-

Table 4-12 Parameter settings for dumping data to OBS


Description

Dumped To Yes Name of the OBS bucket used to store data from theDIS stream.

IAM Agency Yes DIS uses an agency to access your specified resourcessuch as OBS buckets.Select an IAM agency from the drop-down list.

Dump Type Yes Data dumping type. Options are as follows:l Custom file: You can select which streaming data

will be saved into which folder. Custom files aredumped to OBS immediately after they aregenerated in the chosen folder.

l Periodic: Streaming data is automatically savedinto files in the chosen directory. Files are thendumped to OBS at regular intervals

Dump FileDirectory

No User-defined directory storing files that will bedumped to OBS. Use slashes (/) to separate differentdirectory levels.This parameter is displayed only when Dump Type isset to Periodic.

Dump Interval (s) Yes User-defined interval at which data from the DISstream is dumped to OBS.Value range: 60 to 900.This parameter is displayed only when Dump Type isset to Periodic.



50

Step 6 Click OK to complete the resource creation.

----End

4.3.2 Creating an MRS Resource

ScenarioDPS allows you to create an MRS cluster immediately, or create or delete an MRS cluster ondemand or at a specific point in time. Reasonable DPS configurations help you fully utilizeMRS clusters while reducing usage cost.

MRS clusters in the resource management list can be used by data sources and activities thatneed to use MRS clusters, for example, HDFS data source and MapReduce activities.



20.

Procedure





Step 5 In the dialog box that is displayed, select MRS from the Type drop-down list. Configure theMRS resource parameters by referring to Table 4-13.



51

Table 4-13 MRS resource parameters

Parameter Mandatoryor Not

Description Example

Schedule Type Yes Resource scheduling type. Options are asfollows:l On Demand: Resources are

automatically created and deleted basedon the resource usage of the pipeline.

l At Once: Resources are created at once.If the resources are no longer used, youneed to manually delete them.

l On Schedule: Resources areautomatically scheduled, created, anddeleted at your specified resourceschedule cycle, creation time, anddeletion time. Ensure that the intervalbetween the resource creation time andthe resource deletion time is longerthan the resource schedule cycle.

-

Resource Name Yes Resource name.A resource name is 1 to 64 characters longand contains only letters, digits, hyphens(-), and underscores (_).

mrs_123

Cluster Name Yes Unique name of the MR cluster.A cluster name is 1 to 64 characters longand contains only letters, digits, hyphens(-), and underscores (_).

mrs_11c6

AZ Yes An AZ is an area where power andnetworks are physically isolated. AZs inthe same region can communicate witheach other over an intranet.Currently, only the northchina 1 andeastchina 2 regions are supported. Theavailable areas under each region are asfollows:l In the northchina 1 region: AZ 2.l In the eastchina 2 region: AZ 1.

AZ 1

VPC Yes Virtual private cloud (VPC) in which thecluster is created. If there is no availableVPC, create one in advance.

vpc-11

Subnet Yes Subnet of the cluster. If there is noavailable subnet, create one in the VPC inadvance.

subnet-11



52

Parameter Mandatoryor Not

Description Example

Cluster Version Yes Currently, MRS 1.3.0, MRS 1.5.0, MRS1.5.1, and MRS 1.6.0 are supported.

MRS 1.3.0

Key Pair Yes Key pair used to access the master node ofthe cluster. If there is no available key pair,create or import one in advance.

KeyPair-7dbd

Logging Yes An indication of whether to back up logs.If this parameter is set to Yes, you need tospecify the OBS bucket where logs arestored.

-

SelectComponent

Yes MRS component. Options are as follows:l Hadoopl Sparkl HBasel Hive

Spark


----End

4.3.3 Creating a CDM Resource

Scenario

DPS allows you to create a CDM cluster immediately, or create or delete a CDM cluster ondemand or at a specific point in time. Reasonable DPS configurations help you fully utilizeCDM clusters while reducing usage cost.

CDM clusters in the resource management list can be used by ExecuteCDM activities.



20.

Procedure







53

Step 5 In the dialog box that is displayed, select CDM from the Type drop-down list. Configure theCDM resource parameters by referring to Table 4-14.

Table 4-14 CDM resource parameters


Description Example

Schedule Type Yes Resource scheduling type. Options are asfollows:l On Demand: Resources are automatically

created and deleted based on the resourceusage of the pipeline.

l At Once: Resources are created at once. Ifthe resources are no longer used, you needto manually delete them.

l On Schedule: Resources are automaticallyscheduled, created, and deleted at yourspecified resource schedule cycle, creationtime, and deletion time. Ensure that theinterval between the resource creationtime and the resource deletion time islonger than the resource schedule cycle.

-

Resource Name Yes Resource name.A resource name is 1 to 64 characters longand contains only letters, digits, hyphens (-),and underscores (_).

cdm_source

Cluster Name Yes CDM cluster name.A CDM cluster name is 4 to 64 characterslong, contains only letters, digits, underscores(_), and hyphens (-), and must start with aletter.

cdm-a4d3

Version Yes CDM service version.Currently, only CDM 1.0.8T is supported.

1.0.8T

VPC Yes VPC in which the CDM cluster is created.Ensure that the CDM cluster and the datasource to which the CDM cluster isconnected are in the same VPC.

vpc-cdm-source

Subnet Yes Subnet of the CDM cluster. Ensure that thesubnet of the CDM cluster can communicatewith that of the data source.

subnet-cdm-source

Security Group Yes Security group to which the CDM clusterbelongs. Ensure that the security group canaccess the data source.

sg-cdm



54


Description Example

NodeConfiguration

Yes The following two types of nodespecifications are available:l cdm.medium: ECS server with 4-core

CPU and 8 GB memory, which is suitablefor a single database table with less than10 million records.

l cdm.large: ECS server with 8-core CPUand 16 GB memory, which is suitable fora single database table with more than 10million records.

cdm.medium

EIP Yes An indication of whether to bind an EIP tothe CDM cluster.If the CDM cluster needs to access a datasource on the Internet, bind an EIP to theCDM cluster.

AutomaticallyAssign

AZ Yes AZ in which the CDM cluster is created. cn-north-1b


----End

4.3.4 Editing a Resource

Scenario

You can modify a resource that has been created if necessary.


on accessing DPS. For details Permissions Required for Accessing DPS.l Resources have been created.

Procedure




Step 4 On the Resource List page, click Edit in the Operation column for the resource to be edited.

Step 5 On the editing page, modify the resource information. For the description of each parameter,see the following:



55

l DIS resource: Table 4-11l MRS resource: Table 4-13l CDM resource: Table 4-14

Step 6 Click OK. The modified parameter settings are saved successfully.

----End

4.3.5 Deleting a Resource

ScenarioYou are advised to delete a resource if the resource will not be used any longer to reduce thequota usage.


on accessing DPS. For details Permissions Required for Accessing DPS.l Resources have been created but have not been used by pipelines.

Procedure




Step 4 On the Resource List page, click Delete in the Operation column for the resource to bedeleted.

Step 5 In the displayed dialog box, click OK to delete the resource.

----End



56

5 Configuration Guide

5.1 Data SourcesA data source such as OBS, RDS, and HDFS indicates the location where data is stored.

5.1.1 RDS

FunctionThe RDS data source indicates MySQL of RDS, and is used to store user data in the form oftables.

ConfigurationOn the Edit page, drag and drop the RDS data source to the edit grid area. Click the RDS datasource.

l On the Input and Output tab pages at the left side of the edit grid area, check theactivities to which the data source can connect when it serves as an input or output datasource.

l On the configuration page that is displayed at the right side of the edit grid area, viewand edit the configuration items shown in Table 5-1.

Table 5-1 RDS properties

Property Mandatory or Not


Name Yes Data source name. RDS_4171

Data Pipeline ServiceUser Guide 5 Configuration Guide


57



Database Yes Select a suitable connector from theconnector list and use it as the database.

To create a connector, click or go tothe DPS connector list page forcreation. For details about parametersettings, see Creating a DataSourceConnector.

DBname

Table Name Yes Name of the RDS table.You need to create the RDS table inadvance.Statement for creating a table: create'test','d'

test

5.1.2 HBase

Function

The HBase data source indicates the HBase distributed cloud storage system of MRS, and isapplicable to massive data storage.

Configuration

On the Edit page, drag and drop the HBase data source to the edit grid area. Click the HBasedata source.



Table 5-2 HBase properties



Name Yes Data source name. HBase_4653

HBASE TableName

Yes Name of the HBase table.You need to create the HBase tablein advance.Statement for creating a table:create 'test','d'

datacsv2



58



HBASEColumns

No Columns of the HBase table. Thecolumns must be created inadvance.

HBASE_ROW_KEY,d:c2,d:c3

5.1.3 HDFS

FunctionThe HDFS data source indicates the Hadoop distributed file system of MRS, and is applicableto large-scale data storage.

ConfigurationOn the Edit page, drag and drop the HDFS data source to the edit grid area. Click the HDFSdata source.



Table 5-3 HDFS properties



Name Yes Data source name. HDFS_0745

MR Cluster Yes MR cluster.To create a cluster, perform thefollowing operations:

l Click to create an MRS clusteras required. For details aboutparameter settings, see Creating anMRS Resource.

l Go to the DPS resourcemanagement list page and create anMRS cluster.

l Go to the MRS managementconsole and create an MRS cluster.

DPS_using_mrs



59



HDFS Path Yes Storage path of the HDFS file.When HDFS is used as an output datasource, HDFS Path supports thefollowing variables:l <scheduletime>: This indicates that

a directory named after the time atwhich the pipeline starts runningwill be automatically created forstoring pipeline output data.

l <date>: This indicates that adirectory named after the currentdate will be automatically createrdfor storing pipeline output data.

l <yesterday>: This indicates that adirectory named after the previousday will be automatically createdfor storing pipeline output data.

/user/omn/<scheduletime>

5.1.4 OBS

FunctionThe OBS data source indicates the data storage function of OBS, and is used to storeunstructured data, including documents, images, and videos.

ConfigurationOn the Edit page, drag and drop the OBS data source to the edit grid area. Click the OBS datasource.



Table 5-4 OBS properties



Name Yes Data source name. OBS_5680



60



OBS Path Yes Path to the directory where OBSdata is stored.When OBS is used as an outputdata source, OBS Path supportsthe following variables:l <scheduletime>: This indicates

that a directory named after thetime at which the pipeline startsrunning will be automaticallycreated for storing pipelineoutput data.

l <date>: This indicates that adirectory named after thecurrent date will beautomatically createrd forstoring pipeline output data.

l <yesterday>: This indicates thata directory named after theprevious day will beautomatically created forstoring pipeline output data.

s3a://dpsfile/<scheduletime>/

5.1.5 DWS

Function

The DWS data source indicates the data storage function of DWS.

Configuration

On the Edit page, drag and drop the DWS data source to the edit grid area. Click the DWSdata source.



Table 5-5 DWS properties



Name Yes Data source name. DWS_0167



61



Database Yes Select a suitable connector fromthe connector list and use it asthe database.

To create a connector, click or go to the DPS connector listpage for creation. For detailsabout parameter settings, seeCreating a DataSourceConnector.

dws_test

Table Name Yes Name of the database table. Thetables must be created inadvance.

tabledws

5.1.6 CDM Source

Function

CDM Source indicates that CDM is used as the input or output data source of pipelines.

Configuration

On the Edit page, drag and drop the CDM Source data source to the edit grid area. Click theCDM Source data source.



Table 5-6 CDM Source properties


Description ExampleValue

Name Yes Data source name. CDM_Source_0167

CDM SourceConnector

Yes Select a suitable CDM connector from theconnector list, and use it as the CDM datasource.

To create a connector, click or go tothe DPS connector list page for creation.For details about parameter settings , seeCreating a CDM Connector.

cdm_s11



62



CDM DBSchema

Yes Name of the database schema.This parameter is displayed only whenLink Type is set to generic-jdbc-connector during the creation of the CDMconnector.

columncdm

DatabaseTable Name

Yes Name of the database table. You need tocreate the database table in advance.This parameter is displayed only whenLink Type is set to generic-jdbc-connector during the creation of the CDMconnector.

tablecdm

OBS Path Yes Path to the directory where OBS data isstored.When CDM Source is used as the outputdata source of the ExecuteCDM activity,this parameter can be set to a directory thatdoes not exist in the OBS bucket, and thisdirectory will be automatically created bythe ExecuteCDM activity. However, if thisparameter is set to an OBS bucket thatdoes not exist, the ExecuteCDM activityfails to be executed.This parameter is displayed only whenLink Type is set to obs-connector duringthe creation of the CDM connector.

s3a://dpsfile/

5.1.7 Dummy

Function

Dummy is a data source that does not store any data. The Dummy data source is connected toan activity that does not require an input or output data source.

Configuration

On the Edit page, drag and drop the Dummy data source to the edit grid area. Click theDummy data source.


l On the configuration page that is displayed at the right side of the edit grid area,configure the data source name.



63

5.1.8 UQuery Table

Function

UQuery Table indicates the tables supported by UQuery, such as the UQuery tables and OBStables, and is used together with UQuery<->OBS and UQuery SQL activities. For details, seeUQuery<->OBS and UQuery SQL.

Configuration

On the Edit page, drag and drop the UQuery Table data source to the edit grid area. Click theUQuery Table data source.



Table 5-7 UQuery Table properties



Name Yes Data source name. UQuery_Table_9622

DatabaseName

Yes Name of a created UQuery database.l Select a created UQuery database.l If you select Create, an input box is

displayed for you to enter the name ofthe database to be created.

uquery_db

Table Type Yes Location where data is stored.l OBS: Data is stored in OBS buckets.l UQuery: Data is stored in UQuery.l View: Data table view.This parameter is displayed when youselect a created database for DatabaseName.

OBS

Table Name Yes Name of a created UQuery data table.l Select a created UQuery data table.l If you select Create, an input box is

displayed for you to enter the name ofthe data table to be created.

This parameter is displayed when youselect a created database for DatabaseName.

uquery_table



64

5.1.9 ES Storage

FunctionES Storage indicates the data sources of ES, and is used together with the Elasticsearchactivity. For details, see Elasticsearch.

ConfigurationOn the Edit page, drag and drop the ES Storage data source to the edit grid area. Click the ESStorage data source.



Table 5-8 ES Storage properties



Name Yes Data source name. ES_Storage_7212

ESConnector

Yes Select a suitable ESSource connector fromthe connector list, and use it as the ES datasource.

To create a connector, click or go tothe DPS connector list page for creation.For details about parameter settings, seeCreating an ESSource Connector.

dps_es

5.2 ActivitiesAn activity defines the move or transfer operation performed for data. For example, theHDFS<->OBS activity can move data from OBS to HDFS.

5.2.1 HDFS->HBASE

FunctionThe HDFS->HBASE activity is used to import data files (in the format of CSV) stored inHDFS to HBase tables.

ConfigurationOn the Edit page, drag and drop the HDFS->HBASE activity to the edit grid area. Click theHDFS->HBASE activity.



65

l On the Input and Output tab pages at the left side of the edit grid area, check the inputand output data sources to which the activity connects, as shown in Table 5-9.

Table 5-9 Link relationship between the HDFS->HBASE activity and the data sources

Activity Link Relationship

HDFS->HBASE HDFS -> [HDFS->HBASE] -> HBase

l On the configuration page that is displayed at the right side of the edit grid area, view

and edit the configuration items in the following section.

ParametersProperties

Table 5-10 describes the HDFS->HBASE properties.

NOTE

For a common user, you do not need to configure Customer Jar File Path, Execution Class Name, andExtension Parameter File Path. If you want to import a custom data process logic, configure thesethree items.

Table 5-10 HDFS->HBASE properties

Property Mandatory orNot


Name Yes Activity name. HDFS_HBASE_3904

MR Cluster Yes MR cluster.To create a cluster, performthe following operations:

l Click to create anMRS cluster asrequired. For detailsabout parametersettings, see Creatingan MRS Resource.

l Go to the DPS resourcemanagement list pageand create an MRScluster.

l Go to the MRSmanagement consoleand create an MRScluster.

DPS_using_mrs



66

Property Mandatory orNot


Load Type Yes Data loading type.Possible values:l BULKLOAD: Load a

large amount of data toHBase.

l INSERT: Load a fewdata to HBase.

BULKLOAD orINSERT

File Backup Yes Whether to back up theloaded files.

Yes or No

Backup Path Yes Path to the directory whereloaded files are backed up.

/user/omm/

Custom Jar File Path No Path to the custom Jarpackage.

/user/omm/yu/loadhbase/customjar/customer.jar

Execution Class Name No Name of the executionclass.

com.company.datacraft.hbase.ImportTsvCustom

Execution ParameterFile Path

No Path to the extensionparameter file.

/user/omm/yu/loadhbase/arg/b.txt

Log Path Yes Path to the directory wherelogs are stored.Log path supports thefollowing variables:l <scheduletime>: This

indicates that adirectory named afterthe time at which thepipeline starts runningwill be automaticallycreated for storing logfiles.

l <date>: This indicatesthat a directory namedafter the current datewill be automaticallycreated for storing logfiles.

l <yesterday>: Thisindicates that adirectory named afterthe previous day will beautomatically createdfor storing log files.

s3a://dps/log/



67

Precondition

Table 5-11 describes the parameters of an HDFS->HBASE precondition.

NOTE

A maximum of five preconditions can be added.

Table 5-11 Precondition parameters


Description

Action AfterPrecondition CheckFailure

Yes Action that will be performed if the precondition isnot met.Options are as follows:l Exit: Exit the activity.l Continue: Execute the activity.

Check Method Yes Options are as follows:l Meet all types: The system executes the activity

only when all preconditions must be met.l Meet any type: The system executes the activity

if any precondition is met.

Precondition Type Yes Type of the precondition.Options are as follows:l Check whether the file exists.l Check the number of files in the folder.l Check the file size.

Advanced settings

The advanced settings define the operation policy that takes effect if the activity fails to beexecuted. Table 5-12 describes the advanced parameters.



68

Table 5-12 Advanced parameter settings


Description

Retry uponFailure

Yes An indication of whether to re-execute the activity ifthe activity fails to be executed.l Yes: Re-execute the activity. Configure the

following parameters:– Timeout Interval: Timeout interval for activity

execution.– Maximum Retries: Number of retries upon an

execution failure.– Retry Interval (seconds): Interval between

two retries.l No: Do not re-execute the activity.Default value: No.

Failure policy Yes Operation that will be performed if the activity re-execution still fails.l End the current job execution plan.l Proceed to the next job.

5.2.2 HDFS<->OBS

Function

The HDFS<->OBS activity uses MapReduce jobs to implement distributed file copy, andtransfer data between HDFS and OBS.

Configuration

On the Edit page, drag and drop the HDFS<->OBS activity to the edit grid area. Click theHDFS<->OBS activity.


Table 5-13 Link relationships between the HDFS<->OBS activity and the data sources


HDFS<->OBS OBS -> [HDFS<->OBS] -> HDFSHDFS -> [HDFS<->OBS] -> OBS

l On the configuration page that is displayed at the right side of the edit grid area, viewand edit the configuration items in the following section.



69

Parameters

Properties

Table 5-14 describes the HDFS<->OBS properties.

Table 5-14 HDFS<->OBS properties



Name Yes Activity name. HDFS_OBS_3603


l Click to create an MRScluster as required. Fordetails about parametersettings, see Creating anMRS Resource.

l Go to the DPS resourcemanagement list page andcreate an MRS cluster.

l Go to the MRS managementconsole and create an MRScluster.

DPS_using_mrs

Job Name Yes MRS Job name. HDFS_OBS

Log Path No Path to the directory where logsare stored.

s3a://dps/log/

Precondition

Table 5-15 describes the parameters of an HDFS<->OBS precondition.

NOTE




Description





70


Description





Advanced settings




Description

Retry uponFailure









71

5.2.3 Database<->HDFS

FunctionThe Database<->HDFS activity is used to transfer data between HDFS and RDS.

ConfigurationOn the Edit page, drag and drop the Database<->HDFS activity to the edit grid area. Click theDatabase<->HDFS activity.


Table 5-17 Link relationships between the Database<->HDFS activity and the datasources


Database<->HDFS HDFS -> [Database<->HDFS] -> RDSRDS -> [Database<->HDFS] -> HDFS

NOTE

In pipeline RDS -> [Database<->HDFS] -> HFDS, ensure that the directory specified by theHDFS Path property of the HDFS data source does not exist.

For example:

If HDFS Path is set to /user/omm/yourfile, /user/omm indicates an existing directory, andyourfile is a user-defined directory and does not exist in the /user/omm directory.



Table 5-18 describes the Database<->HDFS properties.

Table 5-18 Database<->HDFS properties



Name Yes Activity name. Database_HDFS_5690



72




l Click to create an MRScluster as required. For detailsabout parameter settings, seeCreating an MRS Resource.



DPS_using_mrs

Job Name Yes MRS Job name. RDS_HDFS

Database<->HDFS JobParameters

No Sqoop command parameters.NOTE

The Sqoop component is used in theDatabase<->HDFS activity. Enter theSqoop command parameters here.

-m 1: starts a Mapprocess to executethe task.

Log path Yes Path to the directory where logsare stored.Log path supports the followingvariables:l <scheduletime>: This

indicates that a directorynamed after the time at whichthe pipeline starts running willbe automatically created forstoring log files.

l <date>: This indicates that adirectory named after thecurrent date will beautomatically created forstoring log files.

l <yesterday>: This indicatesthat a directory named afterthe previous day will beautomatically created forstoring log files.

s3a://dps/log/



73



HDFSSubdirectory

No Subdirectory under the directoryspecified by the HDFS Pathproperty of the HDFS data source.This parameter is used to specifythe path to the input data sourceof the Database<->HDFS activityonly when HDFS is used as theinput data source.NOTE

HDFS Path is a property of theHDFS data source. For details, seeTable 5-3.

/chd

Precondition

Table 5-19 describes the parameters of an Database<->HDFS precondition.

NOTE




Description






Precondition Type Yes Type of the precondition.Options are as follows:l Check whether the file exists.l Check the number of files in the folder.l Check the file size.l Check whether the database table exists.



74

Advanced settings




Description

Retry uponFailure







5.2.4 UQuery<->OBS

FunctionThe UQuery<->OBS activity is used to transfer data between the OBS bucket and UQuerytable. This activity supports data transfers only through CSV files.

ConfigurationOn the Edit page, drag and drop the UQuery<->OBS activity to the edit grid area. Click theUQuery<->OBS activity.




75

Table 5-21 Link relationship between the UQuery<->OBS activity and the data sources


UQuery<->OBS OBS -> [UQuery<->OBS] -> UQuery TableUQuery Table -> [UQuery<->OBS] -> OBSNOTE

When data is transferred from the UQuery Table data source to theOBS data source, a folder is automatically created in OBS to store thetransferred data. Ensure that the folder (specified by the OBS Pathproperty of the OBS data source) does not exist in the OBS bucket.

For example, if OBS Path is set to s3a://dpsfile/new/, new is thefolder to be automatically created and ensure that this folder does notexist in the s3a://dpsfile/ directory.




Table 5-22 describes the UQuery<->OBS properties

Table 5-22 UQuery<->OBS properties



Name Yes Activity name. UQuery_OBS_5844

Import/Export Yes Data transfer direction.l When data is

transferred from theOBS data source to theUQuery Table datasource, selectImport(OBS->UQuery).

l When data istransferred from theUQuery Table datasource to the OBS datasource, selectExport(UQuery->OBS).

Import

Log Path No Path of the execution log. s3a://dps/log/



76



File Format Yes Format of the file used totransfer data between theUQuery Table data sourceand the OBS data source.Currently, only CSV filesare supported.

csv

CompressionFormat

Yes File compression type.Options are as follows:l Not compressl gzipl bzip2l deflateThis parameter isdisplayed when Import/Export is set toExport(UQuery->OBS).

none

Advanced Options No Format of the targetcustom data table.This parameter isdisplayed when Import/Export is set toImport(OBS->UQuery).

-

Advanced settings




77



Description

Retry uponFailure







5.2.5 CDM Job

FunctionThe CDM Jog activity is used to execute a CDM job that has been created in CDM.

ConfigurationOn the Edit page, drag and drop the CDM Job activity to the edit grid area. Click the CDMJob activity.


Table 5-24 Link relationship between the CDM Job activity and the data sources


CDM Job Any data source -> CDM Job -> any data sourceThe CDM Job activity can be connected to the Shell Script,CDM Job, Create OBS, and Delete OBS.NOTE

Connecting the CDM Job activity to a data source or activity is onlyused to form a complete pipeline. This indicates that running the CDMJob activity does not affect the connected data source or activity.



78



Table 5-25 describes the CDM Job properties

Table 5-25 CDM Job properties



Name Yes Activity name. CDM_Job_1829

CDM ClusterName

Yes Cluster to which the CDMjob belongs.

cdm-dps

CDM Job Name Yes CDM job name. cdmjobt

Log Path No Path of the log. s3a://dps/log/

Advanced settings




Description

Retry uponFailure









79

5.2.6 ExecuteCDM

FunctionThe ExecuteCDM activity creates jobs in the CDM cluster, and migrates cloud data byexecuting the jobs.

ConfigurationOn the Edit page, drag and drop the ExecuteCDM activity to the edit grid area. Click theExecuteCDM activity.


Table 5-27 Link relationship between the ExecuteCDM activity and the data sources


ExecuteCDM CDM Source/OBS -> ExecuteCDM -> CDM Source/OBS




Table 5-28 describes the ExecuteCDM properties

Table 5-28 ExecuteCDM properties



Name Yes Activity name. ExecuteCDM_2113

CDM Job Name Yes Name of a new CDM job.The name contains onlyletters and digits and is notlonger than 21 characters.

cdmjob



80



CDM ClusterName

Yes Cluster to which the CDMjob belongs.To create a cluster,perform the followingoperations:

l Click to create aCDM cluster asrequired. For detailsabout parametersettings, see Creatinga CDM Resource.

l Go to the DPS resourcemanagement list pageand create a CDMcluster.

l Go to the CDMmanagement consoleand create a CDMcluster.

cdm-dps

Log Path No Path of the log. s3a://dps/log/

Advanced settings




Description

Retry uponFailure








81


Description


5.2.7 Spark

FunctionThe Spark activity is used to execute the predefined Spark job on MRS.

ConfigurationOn the Edit page, drag and drop the Spark activity to the edit grid area. Click the Sparkactivity.


Table 5-30 Link relationship between the Spark activity and the data sources


Spark OBS/HDFS/Dummy -> Spark -> OBS/HDFS/Dummy




Table 5-31 describes the Spark properties.

Table 5-31 Spark properties



Name Yes Activity name. Spark_2350



82



MR Cluster Yes MR cluster.To create a cluster,perform the followingoperations:

l Click to create anMRS cluster asrequired. For detailsabout parametersettings, seeCreating an MRSResource.

l Go to the DPSresourcemanagement listpage and create anMRS cluster.


DPS_using_mrs

Job Name Yes MRS Job name. Spark

Jar File Path Yes Path to the Jar packageof the Spark job.

s3a://dpsfile/program/spark-test.jar

Jar FileParameters

No Variables required forexecuting the Jarpackage.

com.spark.test.JavaWordCountWithSave



83



Log path Yes Path to the directorywhere logs are stored.Log path supports thefollowing variables:l <scheduletime>:

This indicates that adirectory namedafter the time atwhich the pipelinestarts running will beautomaticallycreated for storinglog files.

l <date>: Thisindicates that adirectory namedafter the current datewill be automaticallycreated for storinglog files.

l <yesterday>: Thisindicates that adirectory namedafter the previousday will beautomaticallycreated for storinglog files.

s3a://dpsfile/log/<scheduletime>/

Precondition

Table 5-32 describes the parameters of a Spark precondition.

NOTE




Description





84


Description





Advanced settings




Description

Retry uponFailure









85

5.2.8 SparkSQL

FunctionThe SparkSQL activity is used to execute the predefined SparkSQL statements on MRS.

ConfigurationOn the Edit page, drag and drop the SparkSQL activity to the edit grid area. Click theSparkSQL activity.


Table 5-34 Link relationship between the SparkSQL activity and the data sources


SparkSQL Any data source -> SparkSQL -> any data sourceNOTE

Connecting the SparkSQL activity to a data source is only used to forma complete pipeline. This indicates that running the SparkSQL activitydoes not affect the connected data source.




Table 5-35 describes the SparkSQL properties.

Table 5-35 SparkSQL properties



Name Yes Activity name. SparkSQL_4667



86



MR Cluster Yes MR cluster.To create a cluster,perform thefollowingoperations:

l Click tocreate an MRScluster asrequired. Fordetails aboutparametersettings, seeCreating anMRS Resource.

l Go to the DPSresourcemanagement listpage and createan MRS cluster.

l Go to the MRSmanagementconsole andcreate an MRScluster.

DPS_using_mrs

Job Name Yes MRS Job name. sparkSql

Statements Yes Spark SQLstatements to beexecuted.Statements areseparated bysemicolon (;).

show tables;

Advanced settings




87



Description

Retry uponFailure







5.2.9 Hive

Function

The Hive activity is used to execute the predefined Hive script files on MRS.

Configuration

On the Edit page, drag and drop the Hive activity to the edit grid area. Click the Hive activity.


Table 5-37 Link relationship between the Hive activity and the data sources


Hive OBS/HDFS/Dummy -> Hive -> OBS/HDFS/Dummy



Parameters

Properties



88

Table 5-38 describes the Hive properties.

Table 5-38 Hive properties



Name Yes Activity name. Hive_2909

MR Cluster Yes MR cluster.To create a cluster, performthe following operations:

l Click to create anMRS cluster asrequired. For detailsabout parametersettings, see Creatingan MRS Resource.

l Go to the DPS resourcemanagement list pageand create an MRScluster.


DPS_using_mrs

Job Name Yes MRS Job name. Hive

Hive Script Path Yes Path to the Hive script. s3a://dpsfile/program/hivescript.sql

Script Parameters No Variables required forexecuting the Hive script.

By default, this property isleft blank.



89



Log path Yes Path to the directory wherelogs are stored.Log path supports thefollowing variables:l <scheduletime>: This



l <yesterday>: Thisindicates that adirectory named afterthe previous day will beautomatically createdfor storing log files.

s3a://dpsfile/log/

Precondition

Table 5-39 describes parameters of a Hive precondition.

NOTE




Description





90


Description





Advanced settings




Description

Retry uponFailure









91

5.2.10 MapReduce

Function

The MapReduce activity is used to run the predefined MapReduce program on MRS.

Configuration

On the Edit page, drag and drop the MapReduce activity to the edit grid area. Click theMapReduce activity.


Table 5-41 Link relationship between the MapReduce activity and the data sources


MapReduce OBS/HDFS/Dummy -> MapReduce -> OBS/HDFS/Dummy


Parameters

Properties

Table 1 MapReduce properties describes the MapReduce properties.

Table 5-42 MapReduce properties



Name Yes Activity name. MapReduce_8300


l Click to create an MRScluster as required. Fordetails about parametersettings, see Creating anMRS Resource.



DPS_using_mrs



92



Job Name Yes MRS Job name. MR

Jar File Path Yes Path to the Jar package. s3a://dpsfile/program/hadoop-mapreduce-examples-2.7.1.jar

Jar FileParameters

No Variables required for executingthe Jar package.

wordcount

Log path Yes Path to the directory where logsare stored.Log path supports the followingvariables:l <scheduletime>: This

indicates that a directorynamed after the time at whichthe pipeline starts runningwill be automatically createdfor storing log files.



s3a://dpsfile/log/

Precondition

Table 5-43 describes the parameters of a MapReduce precondition.

NOTE




Description





93


Description





Advanced settings




Description

Retry uponFailure









94

5.2.11 Shell Script

Function

The Shell Script activity is used to execute the shell scripts specified by users in the ECSserver.

Configuration

On the Edit page, drag and drop the Shell Script activity to the edit grid area. Click the ShellScript activity.


Table 5-45 Link relationships between the Shell Script activity and the data sources


Shell Script Any data source -> Shell Script -> any data sourceThe Shell Script activity can be connected to the Shell Script, CDMJob, Create OBS, and Delete OBS.NOTE

Connecting the Shell Script activity to a data source or activity is only usedto form a complete pipeline. This indicates that running the Shell Scriptactivity does not affect the connected data source or activity.


Parameters

Properties

Table 5-46 describes the Shell Script properties

Table 5-46 Shell Script properties



Name Yes Activity name. ShellScript_9167

ComputeResource

Yes Name of the DPS Agent that hasbeen registered in the ECS server.NOTE

If the installed DPS Agent is notavailable, contact technical support.

test

Script Path Yes Absolute path to the shell script onthe ECS server.

/tmp/test.sh



95



Log BackupRequired

Yes An indication of whether to backup logs.

True

Log path No Log backup directory.This parameter is required onlywhen Log Backup Required is setto True.Log path supports the followingvariables:l <scheduletime>: This indicates

that a directory named after thetime at which the pipeline startsrunning will be automaticallycreated for storing log files.


l <yesterday>: This indicates thata directory named after theprevious day will beautomatically created forstoring log files.

s3a://dpsfile/log/

Advanced settings




96



Description

Retry uponFailure







5.2.12 MachineLearning

Function

The MachineLearning activity is used to execute the workflows of Machine Learning Service(MLS).

Configuration

On the Edit page, drag and drop the MachineLearning activity to the edit grid area. Click theMachineLearning activity.


Table 5-48 Link relationship between the MachineLearning activity and the data sources


MachineLearning HDFS/Dummy -> MachineLearning -> HDFS/DummyNOTE

Connecting the MachineLearning activity to a data source is only usedto form a complete pipeline. This indicates that running theMachineLearning activity does not affect the connected data source.



97



Table 5-49 describes the MachineLearning properties

Table 5-49 MachineLearning properties



Name Yes Activity name. MachineLearning_2113

MLS InstanceName

Yes Name of the MLSinstance.To create an instance,click Create MLSInstance or go to the MLSmanagement console forcreation.

mls-7da7

MLS Project Name Yes Name of the MLS project. projectname

MLS WorkflowName

Yes Name of the MLSworkflow.

test

Advanced settings




98



Description

Retry uponFailure







5.2.13 Elasticsearch

Function

The Elasticsearch activity is used to execute ES requests (GET, PUT, POST, HEAD, andDELETE requests).

Configuration

On the Edit page, drag and drop the Elasticsearch activity to the edit grid area. Click theElasticsearch activity.


Table 5-51 Link relationship between the Elasticsearch activity and the data sources


Elasticsearch OBS/ES Storage/Dummy -> Elasticsearch -> ES StorageES Storage -> Elasticsearch -> OBS/ES Storage/Dummy




99


Table 5-52 describes the Elasticsearch properties

Table 5-52 Elasticsearch properties



Name Yes Activity name. Elasticsearch_6150

Compute Resource Yes Name of the DPS Agentthat has been registered inthe ECS server.NOTE

l Ensure that the securitygroup of the ES clusterin ES Storage is thesame as the securitygroup of DPS Agent.

l If the installed DPSAgent is not available,contact technicalsupport.

test

Request Type Yes Request to be executed.Options are as follows:l GETl POSTl PUTl HEADl DELETE

GET

Request Parameter No Request parameter.For example, if you needto query the dpsdatamapping type of thedps_search index, therequest parameter is asfollows:/dps_search/dpsdata/_search

/dps_search/dpsdata/_search



100



Request Body No JSON-format requestbody.

{"query": {"constant_score": {"filter": {"terms": {"price": [200,300]}}}}}

Log path Yes Path to the directorywhere logs are stored.Log path supports thefollowing variables:l <scheduletime>: This



l <yesterday>: Thisindicates that adirectory named afterthe previous day willbe automaticallycreated for storing logfiles.

s3a://dpsfile/log/

Advanced settings




101



Description

Retry uponFailure







5.2.14 RDS SQL

Function

The RDS SQL activity transfers SQL statements (DML and DDL SQL statements) to RDS,and RDS then executes the SQL statements.

Configuration

On the Edit page, drag and drop the RDS SQL activity to the edit grid area. Click the RDSSQL activity.


Table 5-54 Link relationship between the RDS SQL activity and the data sources


RDS SQL RDS -> RDS SQL -> RDSNOTE

The input and output RDS data sources must be in the same database.





102


Table 5-55 describes the RDS SQL properties.

Table 5-55 RDS SQL properties



Name Yes Activity name. RDS_SQL_4574

ComputeResource

Yes Running environment of the activity.Options are as follows:l MR Cluster

To create a cluster, perform thefollowing operations:

– Click to create an MRScluster as required. For detailsabout parameter settings, seeCreating an MRS Resource.

– Go to the DPS resourcemanagement list page andcreate an MRS cluster.

– Go to the MRS managementconsole and create an MRScluster.

l ComputeResource: DPS Agent thathas been registered with ECS.

NOTE

l If an MR cluster is selected as therunning environment of the activity,ensure that the security group of RDSis the same as that of the master nodeof the MR cluster.

l If the installed DPS Agent is notavailable, contact technical support.

DPS_using_mrs



103



Log path Yes Path to the directory where logs arestored.Log path supports the followingvariables:l <scheduletime>: This indicates that

a directory named after the time atwhich the pipeline starts runningwill be automatically created forstoring log files.

l <date>: This indicates that adirectory named after the currentdate will be automatically createdfor storing log files.

l <yesterday>: This indicates that adirectory named after the previousday will be automatically createdfor storing log files.

s3a://dps/log/

Statements Yes Statements to be executed.Use a semicolon (;) to separatestatements.The following SQL statements aresupported:l CREATE, DROP and ALTER.l INSERT, DELETE, UPDATE, and

CALL.

INSERT INTO testVALUES ('values1',25);

Precondition

Table 5-56 describes parameters of an RDS SQL precondition.

NOTE

l A maximum of five preconditions can be added.

l When Compute Resource is set to ComputeResource, preconditions cannot be configured.



104



Description






Precondition Type Yes Type of the precondition.Options are as follows:l Check whether the database table exists.

Advanced settings




Description

Retry uponFailure









105

5.2.15 DWS SQL

FunctionThe DWS SQL activity transfers SQL statements (DML and DDL SQL statements) to DWS,and DWS then executes the SQL statements.

ConfigurationOn the Edit page, drag and drop the DWS SQL activity to the edit grid area. Click the DWSSQL activity.


Table 5-58 Link relationship between the DWS SQL activity and the data sources


DWS SQL DWS -> DWS SQL -> DWS




Table 5-59 describes the DWS SQL properties.

Table 5-59 DWS SQL properties



Name Yes Activity name. DWS_SQL_2113

Compute Resource Yes Name of the DPS Agent thathas been registered in the ECSserver.NOTE

If the installed DPS Agent is notavailable, contact technicalsupport.

test



106



Statements Yes SQL statements.Use a semicolon (;) to separatetwo statements.The following SQL statementsare supported:l CREATE, DROP and

ALTER.l INSERT, DELETE,

UPDATE, and CALL.

CREATE TABLEtest9(callee_numbervarchar(20) );

Log BackupRequired

Yes An indication of whether toback up logs.

True

Log path No Log backup directory.This parameter is required onlywhen Log Backup Requiredis set to True.Log path supports thefollowing variables:l <scheduletime>: This

indicates that a directorynamed after the time atwhich the pipeline startsrunning will beautomatically created forstoring log files.



s3a://dpsfile/log/

Advanced settings




107



Description

Retry uponFailure







5.2.16 UQuery SQL

FunctionThe UQuery SQL activity is used to transfer SQL statements to UQuery to implement clouddata queries.

ConfigurationOn the Edit page, drag and drop the UQuery SQL activity to the edit grid area. Click theUQuery SQL activity.


Table 5-61 Link relationship between the UQuery SQL activity and the data sources


UQuery SQL OBS -> UQuery SQL -> UQuery TableUQuery Table -> UQuery SQL -> UQuery TableNOTE

In pipeline UQuery Table -> UQuery SQL -> UQuery Table, the inputand output UQuery Table data sources must be set to the samedatabase or data table.



108



Table 5-62 describes the UQuery SQL properties

Table 5-62 UQuery SQL properties



Name Yes Activity name. UQuery_SQL_4796

Queue Name Yes Name of a createdUQuery queue.

dps_uquery

Query Yes Only the SQL statementsthat start with CREATE,DROP, ALTER, orINSERT are supported.SQL statements maycontain the followingvariables:l <obspath>: OBS

bucket path.l <tablename>: UQuery

data table name.l <databasename>:

UQuery databasename.

create table <tablename>(id int, name string)using csv options (path'<obspath>');

Log Path No Path of the execution log. s3a://dps/log/

Advanced settings




109



Description

Retry uponFailure







5.2.17 Create OBS

Function

The Create OBS activity is used to create buckets or directories in OBS.

Configuration

On the Edit page, drag and drop the Create OBS activity to the edit grid area. Click the CreateOBS activity.


Table 5-64 Link relationship between the Create OBS activity and the data sources


Create OBS Any data source -> Create OBS -> any data sourceThe Create OBS activity can be connected to the Shell Script,CDM Job, Create OBS, and Delete OBS.NOTE

Connecting the Create OBS activity to a data source or activity is onlyused to form a complete pipeline. This indicates that running theCreate OBS activity does not affect the connected data source oractivity.



110



Parameters

Properties

Table 5-65 describes the Create OBS properties

Table 5-65 Create OBS properties



Name Yes Activity name. Create_OBS_2113

OBS Path Yes Path to the OBS bucket ordirectory to be created.l To create a bucket, enter the

OBS bucket name following //.The OBS bucket name must beunique.

l To create an OBS directory,select the location where theOBS directory is to be created,and enter the directory namefollowing the path to thelocation. The directory namemust be unique.

OBS Path supports the followingvariables:l <scheduletime>: This indicates

that an OBS bucket or directorynamed after the time at whichthe pipeline starts running willbe automatically created.

l <date>: This indicates that anOBS bucket or directory namedafter the current date will beautomatically created.

l <yesterday>: This indicates thatan OBS bucket or directorynamed after the previous daywill be automatically created.

s3a://newbucket/<scheduletime>/

Log Path No Path to the directory where logsare stored.

s3a://dps/log/

Advanced settings



111




Description

Retry uponFailure







5.2.18 Delete OBS

FunctionThe Delete OBS activity is used to delete buckets or directories in OBS.

ConfigurationOn the Edit page, drag and drop the Delete OBS activity to the edit grid area. Click the DeleteOBS activity.




112

Table 5-67 Link relationship between the Delete OBS activity and the data sources


Delete OBS Any data source -> Delete OBS -> any data sourceThe Delete OBS activity can be connected to the Shell Script,CDM Job, Create OBS, and Delete OBS.NOTE

Connecting the Delete OBS activity to a data source or activity is onlyused to form a complete pipeline. This indicates that running theDelete OBS activity does not affect the connected data source oractivity.




Table 5-68 describes the Delete OBS properties

Table 5-68 Delete OBS properties



Name Yes Activity name. Delete_OBS_2113



113



OBS Path Yes Path to the OBS bucket ordirectory to be deleted.OBS Path supports thefollowing variables:l <scheduletime>: This

indicates that the OBSbucket or directorynamed after the time atwhich the pipelinestarts running will beautomatically deleted.

l <date>: This indicatesthat the OBS bucket ordirectory named afterthe current date will beautomatically deleted.

l <yesterday>: Thisindicates that the OBSbucket or directorynamed after theprevious day will beautomatically deleted.

NOTEIf an OBS bucket ordirectory is deleted, filesstored in it are also deletedand cannot be restored. Ifyou need to retain the filesstored in the bucket ordirectory, back them up inadvance.

s3a://obs-6dc4/<scheduletime>/

Log Path No Path to the directorywhere logs are stored.

s3a://dps/log/

Advanced settings




114



Description

Retry uponFailure









115

6 FAQs

6.1 What Is DPS?DPS is one of the public cloud services. It helps you easily create and schedule pipelines. DPShas integrated with multiple cloud services, enabling you to conveniently use and transfer datastored in OBS and RDS. DPS allows you to create and schedule MRS-based data process andanalysis tasks.

6.2 Which Services Can DPS Schedule?DPS can schedule the following services:

l OBSl MRSl RDSl ECSl DWSl DISl CDMl MLSl UQueryl ES

6.3 How Many Pipelines Can I Create Using the DPSConsole?

By default, each user can create a maximum of 10 pipelines. If this quota cannot meet yourrequirement, you can apply for a higher quota.

Data Pipeline ServiceUser Guide 6 FAQs


116

6.4 What Can DPS Do?l Using DPS, you can customize pipelines through simple drag and drop operations,

schedule the execution of pipelines, define the scripts and policies to be executed in caseof task failures.

l DPS provides multiple data collection and processing methods, freeing you fromcomplex pipeline compilation. This enables you to focus on data processing logic insteadof programming.

l DPS supports connector creation and management. With this function, you can directlyuse a created and configured connector in a pipeline, eliminating the need of duplicateconnector configurations.

l DPS offers pre-packaged templates, facilitating pipeline creation.l DPS supports pipeline file import and export. It allows you to export pipeline files to

your local PC and import pipeline files to create or edit pipelines.l DPS provides the resource management function. Using this function, you can configure

resource management and scheduling tasks to automatically create and delete resources.

6.5 What Is a Pipeline?A pipeline is formed by a series of activities and data sources. Activities indicate actionsperformed on data; data sources indicate the locations of input and output data. Activitieslinked together mean that these activities are executed according to their linking sequence.That is, DPS executes the next activity only after the previous one is completed.

6.6 What Is a Data Source?A data source indicates the location of data processed in a pipeline. For example, an OBS datasource indicates the data stored in OBS.

Data Pipeline ServiceUser Guide 6 FAQs


117

A Change History

Release Date What's New

2018-01-30 This issue is the fifth official release.Added the following content:l Creating an ESSource Connectorl Creating a CDM Resourcel UQuery Tablel ES Storagel UQuery<->OBSl MachineLearningl Elasticsearchl UQuery SQLModified the following contents:l Related Servicesl Configuring DPS Agentl Editing a Pipelinel Scheduling a Pipelinel Database<->HDFSl Create OBSl Delete OBS

Data Pipeline ServiceUser Guide A Change History


118


2017-12-08 This issue is the fourth official release.Added the following content:l Basic Conceptsl Getting Startedl Creating a CDM Connectorl Creating a DIS Resourcel Creating an MRS Resourcel CDM Sourcel Dummyl ExecuteCDMl Create OBSl Delete OBSl CDM JobModified the following contents:l Related Servicesl Obtaining an AK/SK Pairl Configuring DPS Agentl Editing a Pipelinel Monitoring a Pipelinel Sparkl Hivel RDS SQLl HDFS->HBASEl HDFS<->OBSl MapReducel Database<->HDFSl DWS SQLl Which Services Can DPS Schedule?

2017-11-01 This issue is the third official release.Modified the following contents:l Installation Flowl Purchasing Elastic Cloud Server (ECS)l Obtaining an AK/SK Pairl (Optional) Connecting to DWS Cluster



119


2017-10-27 This issue is the second official release.Added the following content:l Connector Creation and Managementl Resource Creation and Managementl Installing DPS Agentl Exporting a Pipelinel Connector Listl Resource Listl DWSl Shell ScriptModified the following contents:l Pipeline Creation and Managementl Related Servicesl Permissions Required for Accessing DPSl Buying a Pipelinel Editing a Pipelinel Monitoring a Pipelinel RDSl Activities

2017-08-26 This issue is the first official release.



120

user guide · scaling and standard sql interfaces, uquery enables you to easily explore and analyze...

Documents