Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved....

85
Altus Data Engineering

Transcript of Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved....

Page 1: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Altus Data Engineering

Page 2: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Important Notice© 2010-2019 Cloudera, Inc. All rights reserved.

Cloudera, the Cloudera logo, and any other product orservice names or slogans contained in this document are trademarks of Cloudera andits suppliers or licensors, and may not be copied, imitated or used, in whole or in part,without the prior written permission of Cloudera or the applicable trademark holder. Ifthis documentation includes code, including but not limited to, code examples, Clouderamakes this available to you under the terms of the Apache License, Version 2.0, includingany required notices. A copy of the Apache License Version 2.0, including any notices,is included herein. A copy of the Apache License Version 2.0 can also be found here:https://opensource.org/licenses/Apache-2.0

Hadoop and the Hadoop elephant logo are trademarks of the Apache SoftwareFoundation. All other trademarks, registered trademarks, product names and companynames or logosmentioned in this document are the property of their respective owners.Reference to any products, services, processes or other information, by trade name,trademark, manufacturer, supplier or otherwise does not constitute or implyendorsement, sponsorship or recommendation thereof by us.

Complying with all applicable copyright laws is the responsibility of the user. Withoutlimiting the rights under copyright, no part of this documentmay be reproduced, storedin or introduced into a retrieval system, or transmitted in any form or by any means(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,without the express written permission of Cloudera.

Cloudera may have patents, patent applications, trademarks, copyrights, or otherintellectual property rights covering subjectmatter in this document. Except as expresslyprovided in anywritten license agreement fromCloudera, the furnishing of this documentdoes not give you any license to these patents, trademarks copyrights, or otherintellectual property. For information about patents covering Cloudera products, seehttp://tiny.cloudera.com/patents.

The information in this document is subject to change without notice. Cloudera shallnot be liable for any damages resulting from technical errors or omissions which maybe present in this document, or from use of this document.

Cloudera, Inc.395 Page Mill RoadPalo Alto, CA [email protected]: 1-888-789-1488Intl: 1-650-362-0488www.cloudera.com

Release Information

Version: Cloudera AltusDate: April 5, 2019

Page 3: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Table of Contents

Overview of Altus Data Engineering ........................................................................6Altus Data Engineering Service Architecture........................................................................................................6

Altus Data Engineering Clusters................................................................................7Cluster Status.......................................................................................................................................................7

Creating and Working with Clusters on the Console............................................................................................8Creating a Data Engineering Cluster for AWS........................................................................................................................8

Creating a Data Engineering Cluster for Azure....................................................................................................................13

Viewing the Cluster Status...................................................................................................................................................17

Viewing the Cluster Details..................................................................................................................................................18

Deleting a Cluster.................................................................................................................................................................20

Creating and Working with Clusters Using the CLI.............................................................................................20Creating a Cluster for AWS...................................................................................................................................................20

Creating a Cluster for Azure.................................................................................................................................................22

Viewing the Cluster Status...................................................................................................................................................24

Deleting a Cluster.................................................................................................................................................................24

Connecting to a Cluster......................................................................................................................................24

Clusters on AWS.................................................................................................................................................25Worker Nodes......................................................................................................................................................................25

Spot Instances .....................................................................................................................................................................25

Instance Reprovisioning ......................................................................................................................................................26

System Volume.....................................................................................................................................................................26

Altus Data Engineering Jobs...................................................................................27Job Status...........................................................................................................................................................27

Job Queue..........................................................................................................................................................28

Running and Monitoring Jobs on the Console...................................................................................................29Submitting a Job on the Console..........................................................................................................................................29

Submitting Multiple Jobs on the Console.............................................................................................................................32

Viewing Job Status and Information....................................................................................................................................36

Viewing the Job Details........................................................................................................................................................37

Running and Monitoring Jobs Using the CLI.......................................................................................................37Submitting a Spark Job........................................................................................................................................................37

Submitting a Hive Job..........................................................................................................................................................40

Submitting a MapReduce Job..............................................................................................................................................42

Submitting a PySpark Job.....................................................................................................................................................44

Page 4: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Submitting a Job Group with Multiple Job Types.................................................................................................................45

Tutorial: Clusters and Jobs on AWS.........................................................................46Prerequisites......................................................................................................................................................46

Altus Console Login............................................................................................................................................47

Exercise 1: Installing the Altus Client..................................................................................................................47Step 1. Install the Altus Client..............................................................................................................................................47

Step 2. Configure the Altus Client with the API Access Key..................................................................................................47

Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs........................................................................48Creating a Spark Cluster on the Console .............................................................................................................................49

Submitting a Spark Job........................................................................................................................................................50

Creating a SOCKS Proxy for the Spark Cluster......................................................................................................................51

Viewing the Cluster and Verifying the Spark Job Output.....................................................................................................52

Creating a Spark Job using the CLI.......................................................................................................................................54

Terminating the Cluster........................................................................................................................................................54

Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs ...........................................................................54Creating a Hive Cluster on the Console ...............................................................................................................................55

Submitting a Hive Job Group...............................................................................................................................................56

Creating a SOCKS Proxy for the Hive Cluster........................................................................................................................59

Viewing the Hive Cluster and Verifying the Hive Job Output...............................................................................................60

Creating a Hive Job Group using the CLI..............................................................................................................................63

Terminating the Hive Cluster................................................................................................................................................63

Tutorial: Clusters and Jobs on Azure.......................................................................65Prerequisites......................................................................................................................................................65

Sample Files Upload...........................................................................................................................................66

Altus Console Login............................................................................................................................................66

Exercise 1: Installing the Altus Client..................................................................................................................66Step 1. Install the Altus Client..............................................................................................................................................66

Step 2. Configure the Altus Client with the API Access Key..................................................................................................67

Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs........................................................................68Creating a Spark Cluster on the Console .............................................................................................................................68

Submitting a Spark Job........................................................................................................................................................69

Creating a SOCKS Proxy for the Spark Cluster......................................................................................................................70

Viewing the Cluster and Verifying the Spark Job Output.....................................................................................................71

Creating a Spark Job using the CLI.......................................................................................................................................73

Terminating the Cluster........................................................................................................................................................73

Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs ...........................................................................73Creating a Hive Cluster on the Console ...............................................................................................................................74

Submitting a Hive Job Group...............................................................................................................................................74

Creating a SOCKS Proxy for the Hive Cluster........................................................................................................................77

Viewing the Hive Cluster and Verifying the Hive Job Output...............................................................................................78

Page 5: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Creating a Hive Job Group using the CLI..............................................................................................................................81

Terminating the Hive Cluster................................................................................................................................................82

Appendix: Apache License, Version 2.0...................................................................83

Page 6: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Overview of Altus Data Engineering

Altus Data Engineering enables you to create clusters and run jobs specifically for data science and engineeringworkloads. The Altus Data Engineering service offers multiple distributed processing engine options, including Hive,Spark, Hive on Spark, and MapReduce2 (MR2), which allow you to manage workloads in ETL, machine learning, andlarge scale data processing.

Altus Data Engineering Service ArchitectureWhen you create an Altus Data Engineering cluster or submit a job in Altus, the Altus Data Engineering service accessesyour AWS account or Azure subscription to create the cluster or run the job on your behalf.

If your Altus account uses AWS, an AWS administrator must set up a cross-account access role to provide Altus accessto your AWS account. When a user in your Altus account creates an Altus Data Engineering cluster, the Altus DataEngineering service uses the Altus cross-account access credentials to create the cluster in your AWS account.

If your Altus account uses Azure, an administrator of your Azure subscription must provide consent for Altus to accessthe resources in your subscription. When a user in your Altus account creates an Altus Data Engineering cluster, yourconsent allows the Altus Data Engineering service to create the cluster in your Azure subscription.

Altus manages the clusters and jobs in your cloud provider account. You can configure your Altus Data Engineeringcluster to be terminated when the cluster is no longer in use.

When you submit a job to run on a cluster, the Altus Data Engineering service creates a job queue for the cluster andadds the job to the job queue. The Altus Data Engineering service then runs the jobs in the cluster in your cloud provideraccount. In AWS, the jobs in the cluster access the Amazon S3 object storage for data input and output. In Azure, thejobs in the cluster access Microsoft Azure Data Lake Store (ADLS) for data input and output.

The Altus Data Engineering service sends cluster diagnostic information and job execution metrics to Altus. It alsostores the cluster and job information in the your cloud object storage.

The following diagram shows the architecture and process flow of Altus Data Engineering:

6 | Altus Data Engineering

Overview of Altus Data Engineering

Page 7: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Altus Data Engineering Clusters

You can use the Cloudera Altus console or the command-line interface to create and manage Altus Data Engineeringclusters. The Altus Data Engineering service provisions single-user, transient clusters.

By default, the Altus Data Engineering service creates a cluster that contains amaster node andmultiple worker nodes.The Altus Data Engineering service also creates a Cloudera Manager instance to manage the cluster. The ClouderaManager instance provides visibility into the cluster but is not a part of the cluster. You cannot use the ClouderaManager instance as a gateway node for the cluster.

Cloudera Manager configures the master node with roles that give it the capabilities of a gateway node. The masternode has a resource manager, Hive server and metastore, Spark service, and other roles and client configurations thatessentially turns the master node into a gateway node. You can use the master node as a gateway node in an AltusData Engineering cluster to run Hive and Spark shell commands and Hadoop commands.

The Altus Data Engineering service creates a read-only user account to connect to the Cloudera Manager instance.When you create a cluster on the Altus console, specify the user name and password for the read-only user account.Use the user name and password to log in to Cloudera Manager.

When you create a cluster using the CLI and you do not specify a user name and password, the Altus Data Engineeringservice creates a guest user account with a randomly generated password. You can use the guest user name andpassword to log in to Cloudera Manager.

Altus appends tags to each node in a cluster. You can use the tags to identify the nodes and the cluster that they belongto.

When you create an Altus Data Engineering cluster, you specify which service runs in the cluster. Select the serviceappropriate for the type of job that you plan to run on the cluster.

The following list describes the services available in Altus clusters and the types of jobs you can run with each service:

Job TypeService Type

HiveHive

HiveHive on Spark

Spark or PySparkSpark 2.x

Spark or PySparkSpark 1.6

MapReduce2MapReduce2

Hive, Spark or PySpark, MapReduce2

The Multi service cluster supports Spark 2.x. It does not support Spark 1.6.

Multi

Cluster StatusA cluster periodically changes status from the time that you create it until the time it is terminated.

An Altus cluster can have the following statuses:

• Creating. The cluster creation process is in progress.• Created. The cluster was successfully created.• Failed. The cluster can be in a failed state at creation or at termination time. View the failure message to get more

information about the failure.• Terminating. The cluster is in the process of being terminated.

Altus Data Engineering | 7

Altus Data Engineering Clusters

Page 8: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

When the cluster is terminated, it is removed from the list of clusters displayed in the Clusters page on the console.It is also not included in the list of clusters displayed when you run the list-clusters command.

Creating and Working with Clusters on the ConsoleYou can create a cluster on the Cloudera Altus Console. You can also view the status and configuration of all clusterscreated through Altus in your cloud provider account.

Creating a Data Engineering Cluster for AWS

To create an Altus Data Engineering cluster on the console for AWS:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Clusters.

By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. Thecloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list byenvironment and status. You can also search for clusters by name.

3. Click Create Cluster.4. In the General Information section, specify the following information:

DescriptionProperty

The name to identify the cluster that you are creating. The cluster nameis an alphanumeric string of any length. It can include dashes (-) andunderscores (_). It cannot include a space.

Cluster Name

Indicates the service to be installed on the cluster. Select the service basedon the types of jobs you plan to run on the cluster. You can select fromthe following service types:

Service Type

• Hive• Hive on Spark• Spark 2.x• Spark 1.6

Select Spark1.6 only if your application specifically requires Sparkversion 1.6. Altus supports Spark 1.6 only on CDH 5.11.

• MapReduce2• Multi

A cluster with service type Multi allows you to run different types ofjobs. You can run the following types of jobs in a Multi cluster:Spark2.x, Hive, MapReduce2.

Note: If you are creating the cluster for a specific job orjob group, the list includes only service types that canhandle the job or group of jobs that you want to run.

The CDH version that the cluster will use.

You can select from the following CDH versions:

CDH Version

• CDH 6.1

8 | Altus Data Engineering

Altus Data Engineering Clusters

Page 9: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

• Any version from CDH 5.11 to CDH 5.15

The CDH version that you select can affect the service that runs on thecluster:

Spark 2.x or Spark 1.6

For a Spark service type, youmust select the CDH version that supportsthe selected Spark version. Altus supports the following combinationsof CDH and Spark versions:

• CDH 6.1 with Spark 2.4• CDH 5.12 or later 5.x versions with Spark 2.2• CDH 5.11 with Spark 2.1 or Spark 1.6

Hive on Spark

OnCDH version 5.13 or later, dynamic partition pruning (DPP) is enabledfor Hive on Spark by default. For details, see Dynamic Partition Pruningfor Hive Map Joins in the Cloudera Enterprise documentation set.

The CDH version that you select affects the SDX namespace you can usewith the cluster:

CDH 6.1

You can use a CDH 6.1 cluster only with a configured SDX namespacethat points to version 6.1 of the Hive metastore and Sentry databases.

CDH 5.x

You can use a CDH 5.x cluster only with a configured SDX namespacethat points to version 5.x of the Hive metastore and Sentry databases.

The Cloudera Navigator integration option is not available for Altus DataEngineering clusters with CDH 5.11 or CDH 6.x.

Name of the Altus environment that describes the resources to be usedfor the cluster. The Altus environment specifies the network and instancesettings for the cluster.

If a lock icon appears next to the environment name, clusters that youcreate using this environment are secure.

Environment

If you do not know which Altus environment to select, check with yourAltus administrator.

5. In the Node Configuration section, specify the number of workers to create and the instance type to use for thecluster.

DescriptionProperty

The worker nodes in a cluster can run data storage and computationalprocesses. For more information about worker nodes, see Worker Nodeson page 25.

You can configure the following properties for the worker node:

Instance Type

Select the instance type from the list of supported instance types.

Default: m4.xlarge (16.0 GB 4 vCPUs)

Worker

Altus Data Engineering | 9

Altus Data Engineering Clusters

Page 10: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Note: The master node, worker nodes, and computeworker nodes use the same instance type. If you modifythe instance type for the worker node, Altus configuresthe master node and compute worker nodes to use thesame instance type.

Number of Nodes

Select the number of worker nodes to include in the cluster. A clustermust have a minimum of 3 worker nodes.

Default: 5

Note: An Altus cluster can have a total of 50worker andcompute worker nodes.

EBS Storage

In the EBS Volume Configuration window, configure the followingproperties for the EBS Volume:

• Storage Type. Select the EBS volume type best suited for the jobyou want to run.

• Storage Size. Set the storage size of the EBS volume expressed ingibibyte (GiB).

• Volumes per Instance. Set the number of EBS volumes for eachinstance in the worker node. All EBS volumes are configured withthe same volume size and type.

If you do not configure the EBS volumes, Altus sets the optimumconfiguration for the EBS volumes basedon the service type and instancetype.

For more information about Amazon EBS, see Amazon EBS ProductDetails on the AWS website.

Purchasing Option

By default, the worker nodes use On-Demand instances. You cannotmodify the worker nodes to use Spot instances.

In addition to theworker nodes, an Altus cluster can have computeworkernodes. Compute worker nodes run only computational processes. For

Compute Worker

more information about compute worker nodes, new see Worker Nodeson page 25.

You can configure the following properties for the compute worker node:

Instance Type

You cannot directly modify the instance type for a compute workernode.

Note: The master node, worker nodes, and computeworker nodes use the same instance type. If you modifythe instance type for the worker node, Altus configuresthe master node and compute worker nodes to use thesame instance type.

10 | Altus Data Engineering

Altus Data Engineering Clusters

Page 11: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Number of Nodes

Select the number of compute worker nodes to include in the cluster.

Default: 0

Note: An Altus cluster can have a total of 50worker andcompute worker nodes.

EBS Storage

In the EBS Volume Configuration window, configure the followingproperties for the EBS Volume:

• Storage Type. Select the EBS volume type best suited for the jobyou want to run.

• Storage Size. Set the storage size of the EBS volume expressed ingibibyte (GiB).

• Volumes per Instance. Set the number of EBS volumes for eachinstance in the worker node. All EBS volumes are configured withthe same volume size and type.

If you do not configure the EBS volumes, Altus sets the optimumconfiguration for the EBS volumes basedon the service type and instancetype.

For more information about Amazon EBS, see Amazon EBS ProductDetails on the AWS website.

Purchasing Option

Select whether to use On-Demand instances or Spot instances. If youuse Spot instances, you must specify the spot price.

For more information about using Spot instances for compute workernodes, see Spot Instances on page 25.

Altus configures the master node for the cluster. You cannot modify themaster node configuration.

By default, Altus sets the following configuration for the master node:

Instance Type

m4.xlarge (16.0 GB 4 vCPUs)

Note: The master node, worker nodes, and computeworker nodes use the same instance type. If you modifythe instance type for the worker node, Altus configuresthe master and compute worker nodes to use the sameinstance type.

Master

Number of Nodes

1

EBS Storage

Altus sets the optimum configuration for themaster node based on theservice type and instance type.

Purchasing Option

On-Demand instance

Altus Data Engineering | 11

Altus Data Engineering Clusters

Page 12: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Altus configures the ClouderaManager instance for the cluster. You cannotmodify the Cloudera Manager instance configuration.

By default, Altus sets following configuration for the Cloudera Managerinstance:

Instance Type

c4.2xlarge (15 GB 8 vCPUs)

Cloudera Manager

Number of Nodes

1

EBS Storage

Altus sets the optimum configuration for the Cloudera Manager nodebased on the service type and instance type.

Purchasing Option

On-Demand instance

6. In the Credentials section, provide the credentials for the user account to log in to Cloudera Manager.

DescriptionProperty

You use an SSH key to access instances in the cluster that you are creating.You can provide a public key that Altus will add to the authorized_keys

Public SSH Key

file on each node in the cluster. To connect to the cluster through SSH,use the private key that corresponds to the public key.

Select File Upload to upload a file that contains the public key or selectDirect Input to enter the full key code.

If you select Skip and you do not provide an SSH public key, you cannotaccess the cluster through SSH or access the Cloudera Manager instancethrough a SOCKS proxy.

Formore information about connecting to Altus clusters through SSH, seeSSH Connection.

Altus creates a read-only user account that you can use to o access theClouderaManager instance in the cluster. You can allow Altus to generate

Cloudera Manager Access

the user name and password for the user account or you can specify theuser name and password for the account.

To allow Altus to generate the credentials, select Auto-generate. Afteryou click Create Cluster, Altus displays a window with the user name andpassword for the ClouderaManager instance. Save the credentials beforeyou close the window.

To specify the user credentials, click Customize. Specify the user nameand password for the user account and then confirm the password. Takenote of the user name and password that you specify for the ClouderaManager user account.

Note: For security reasons, you cannot view the credentialsfor the Cloudera Manager user account after the cluster iscreated. Youmust save the credentials for reference beforeyou complete the cluster creation process. Altus does notprovide away to retrieve the credentials after you completethe process.

12 | Altus Data Engineering

Altus Data Engineering Clusters

Page 13: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

7. In the Advanced Settings section, set the following optional properties:

DescriptionProperty

Bootstrap script that is executed on all the cluster instances immediatelyafter start-up before any service is configured and started. You can use

Instance bootstrap script

the bootstrap script to install additional OS packages or applicationdependencies.

You cannot use the bootstrap script to change the cluster configuration.

Select File Upload to upload a script file or select Direct Input to type thescript on the screen.

The bootstrap scriptmust be a local file. It can be in any executable format,such as a Bash shell script or Python script. The size of the script cannotbe larger than 4096 bytes.

Tags that you define and that you want Altus to append to the cluster thatyou are creating. Altus appends the tags you define to the nodes andresources associated with the cluster.

You create the tag as a name-value pair. Click + to add a tag name and setthe value for that tag. Click - to delete a tag from the list.

Resource Tags

By default, Altus appends tags to the cluster instance to make it easy toidentify nodes in a cluster. When you define tags for the cluster, Altusadds your tags in addition to the default tags.

For more information about the tags that Altus appends to the cluster,see Altus Tags.

8. Verify that all required fields are set and click Create Cluster.

The Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the newcluster displays at the top of the list of clusters.

Creating a Data Engineering Cluster for Azure

To create an Altus Data Engineering cluster on the console for Azure:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Clusters.

By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. Thecloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list byenvironment and status. You can also search for clusters by name.

3. Click Create Cluster.4. In the General Information section, specify the following information:

DescriptionProperty

The name to identify the cluster that you are creating. The cluster nameis an alphanumeric string of any length. It can include dashes (-) andunderscores (_). It cannot include a space.

Cluster Name

Altus Data Engineering | 13

Altus Data Engineering Clusters

Page 14: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Indicates the service to be installed on the cluster. Select the service basedon the types of jobs you plan to run on the cluster. You can select fromthe following service types:

Service Type

• Hive• Hive on Spark

Dynamic partition pruning (DPP) is enabled for Hive on Spark bydefault. For details, see Dynamic Partition Pruning for HiveMap Joinsin the Cloudera Enterprise documentation set.

• Spark 2.x

Altus supports Spark 2.2 in clusters with CDH 5.x and Spark 2.4 inclusters with CDH 6.1.

• Spark 1.6

Select Spark 1.6 only if your application specifically requires Sparkversion 1.6.

• MapReduce2• Multi

A cluster with service type Multi allows you to run different types ofjobs. You can run the following types of jobs in a Multi cluster:Spark2.x, Hive, MapReduce2.

Note: If you are creating the cluster for a specific job orjob group, the list includes only service types that canhandle the job or group of jobs that you want to run.

The CDH version that the cluster will use.

Altus supports CDH 5.14, CDH 5.15, and CDH 6.1.

CDH Version

The CDH version that you select affects how you use the cluster:

CDH 6.1

• You can use a CDH 6.1 cluster only with a configured SDXnamespace that points to version 6.1 of the Hive metastore andSentry databases.

• For clusters with CDH 6.1, Altus archives logs to ADLS Gen1 orGen2, based on the folder you specify.

CDH 5.x

• You can use a CDH 5.x cluster only with a configured SDXnamespace that points to version 5.x of the Hive metastore andSentry databases.

• For clusters with CDH 5.x, Altus archives logs to ADLS Gen1.

Name of the Altus environment that describes the resources to be usedfor the cluster. The Altus environment specifies the network and instancesettings for the cluster.

If you do not know which Altus environment to select, check with yourAltus administrator.

Environment

14 | Altus Data Engineering

Altus Data Engineering Clusters

Page 15: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

5. In the Node Configuration section, specify the configuration of the nodes in the cluster.

DescriptionProperty

The worker nodes in a cluster can run data storage and computationalprocesses.

You can configure the following properties for the worker nodes:

Instance Type

Select the instance type to use for the worker nodes in the cluster. Youcan use one of the following instance types:

Worker

• Standard_D4S_v3 16 GiB with 4v CPU• Standard_D8S_v3 32 GiB with 8v CPU• Standard_D16S_v3 64 GiB with 16v CPU• Standard_D32S_v3 128 GiB with 32v CPU• Standard_D64S_v3 256 GiB with 64v CPU• Standard_DS12_v2 28 GiB with 4v CPU• Standard_DS13_v2 56 GiB with 8v CPU• Standard_DS14_v2 112 GiB with 16v CPU• Standard_DS15_v2 140 GiB with 20v CPU• Standard_E4S_v3 32 GiB with 4v CPU• Standard_E8S_v3 64 GiB with 8v CPU• Standard_E16S_v3 128 GiB with 16v CPU• Standard_E32S_v3 256 GiB with 32v CPU• Standard_E64S_v3 432 GiB with 64v CPU

Altus uses the same instance type for all theworker nodes in the cluster.

Number of Nodes

Select the number of worker nodes to include in the cluster. A clustermust have a minimum of 3 worker nodes.

Default: 5

Note: An Altus cluster can have a total of 50 workernodes.

Disk Configuration

In the Disk Configuration window, configure the following propertiesfor the disk:

• Storage Type. Select the storage type best suited for the job youwant to run, premium or standard.

• Storage Size. Set the storage size of the disk expressed in gibibyte(GiB).

• Disks per Instance. Set the number of disks for each instance inthe worker node.

If you do not change the disk configuration, Altus sets the optimumconfiguration for the disks based on the service type and instance type.

For more information about Azure Managed Disks, see Managed Diskson the Azure website.

Altus Data Engineering | 15

Altus Data Engineering Clusters

Page 16: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Altus configures the master node for the cluster. You cannot modify themaster node configuration.

By default, Altus sets the following configuration for the master node:

Instance Type

Standard_DS12_v2 56 GiB with 4v CPU

Note: Themaster node andworker nodes use the sameinstance type. If you modify the instance type for theworker node, Altus configures the master node to usethe same instance type as the worker node.

Master

Number of Nodes

1

Disk Configuration

Altus sets the optimum configuration for themaster node based on theservice type and instance type.

Altus configures the Cloudera Manager node for the cluster. You cannotmodify the Cloudera Manager node configuration.

By default, Altus sets following configuration for the Cloudera Managernode:

Instance Type

Standard_DS12_v2 56 GiB with 4v CPU

Cloudera Manager

Number of Nodes

1

Disk Configuration

Altus sets the optimum configuration for the Cloudera Manager nodebased on the service type and instance type.

6. In the Credentials section, provide the credentials for the user account to log in to Cloudera Manager.

DescriptionProperty

You use an SSH key to access instances in the cluster that you are creating.You can provide a public key that Altus will add to the authorized_keys

Public SSH Key

file on each node in the cluster. To connect to the cluster through SSH,use the private key that corresponds to the public key.

Select File Upload to upload a file that contains the public key or selectDirect Input to enter the full key code.

If you select Skip and you do not provide an SSH public key, you cannotaccess the cluster through SSH or access the Cloudera Manager instancethrough a SOCKS proxy.

Formore information about connecting to Altus clusters through SSH, seeSSH Connection.

Altus creates a read-only user account that you can use to o access theClouderaManager instance in the cluster. You can allow Altus to generate

Cloudera Manager Access

the user name and password for the user account or you can specify theuser name and password for the account.

16 | Altus Data Engineering

Altus Data Engineering Clusters

Page 17: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

To allow Altus to generate the credentials, select Auto-generate. Afteryou click Create Cluster, Altus displays a window with the user name andpassword for the ClouderaManager instance. Save the credentials beforeyou close the window.

To specify the user credentials, click Customize. Specify the user nameand password for the user account and then confirm the password. Takenote of the user name and password that you specify for the ClouderaManager user account.

Note: For security reasons, you cannot view the credentialsfor the Cloudera Manager user account after the cluster iscreated. Youmust save the credentials for reference beforeyou complete the cluster creation process. Altus does notprovide away to retrieve the credentials after you completethe process.

7. In the Advanced Settings section, set the following optional properties:

DescriptionProperty

Bootstrap script that is executed on all the cluster instances immediatelyafter start-up before any service is configured and started. You can use

Instance bootstrap script

the bootstrap script to install additional OS packages or applicationdependencies.

You cannot use the bootstrap script to change the cluster configuration.

Select File Upload to upload a script file or select Direct Input to type thescript on the screen.

The bootstrap scriptmust be a local file. It can be in any executable format,such as a Bash shell script or Python script. The size of the script cannotbe larger than 4096 bytes.

Tags that you define and that you want Altus to append to the cluster thatyou are creating. Altus appends the tags you define to the nodes andresources associated with the cluster.

You create the tag as a name-value pair. Click + to add a tag name and setthe value for that tag. Click - to delete a tag from the list.

Resource Tags

By default, Altus appends tags to the cluster instance to make it easy toidentify nodes in a cluster. When you define tags for the cluster, Altusadds your tags in addition to the default tags.

For more information about the tags that Altus appends to the cluster,see Altus Tags.

8. Verify that all required fields are set and click Create Cluster.

The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page,the new cluster displays at the top of the list of clusters.

Viewing the Cluster Status

To view the status of clusters on the console:

Altus Data Engineering | 17

Altus Data Engineering Clusters

Page 18: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Clusters.

By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. Thecloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list byenvironment and status. You can also search for clusters by name.

The Clusters list shows the following information:

• Cluster name• Status

For more information about the different statuses that a cluster can have, see Cluster Status on page 7.

• Service type for the cluster• Number of worker nodes• Instance type for the cluster• Date and time the cluster was created in Altus• Version of CDH that runs in the cluster.

3. You can click the Actions button for a cluster to perform the following tasks:

• Submit Jobs. Select this action to submit one or more jobs to run on the cluster.• Clone Cluster. Select this action to create a cluster of the same type and characteristics as the cluster that

you are viewing. On the Create Cluster page, you can create a cluster with the same properties as the clusteryou are cloning. You can modify or add to the properties before you create the cluster.

• Delete Cluster. Select this action to terminate the cluster.

4. To view the details of a cluster, click the name of the cluster you want to view.

The Cluster Details page displays information about the cluster in more detail, including the list of jobs in thecluster.

Viewing the Cluster Details

To view the details of a cluster on the console:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Clusters.

By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. Thecloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list byenvironment and status. You can also search for clusters by name.

3. Click the name of a cluster.

You can click Submit Jobs to run jobs on the cluster. Click View Jobs to go to the Jobs page and view the list of alljobs on the cluster. Clear the filter to view all jobs in the Altus account.

The details page for the selected cluster displays the status of the cluster and the following information:

Cluster Status

The details page displays information appropriate for the status of the cluster. For example, if a cluster failed atcreation time, the details page displays the failure message that explains the reason for the failure, but does notdisplay a link to the Cloudera Manager instance.

18 | Altus Data Engineering

Altus Data Engineering Clusters

Page 19: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Submit Jobs and View Jobs

The Submit Jobs and View Jobs links take you to the Jobs page. You can view the jobs on the cluster or createand submit jobs to run on the cluster. For more information about the Jobs page, see Running and MonitoringJobs on the Console on page 29.

Cloudera Manager Configuration

The Cloudera Manager Configuration section displays the instance type and connection details for the ClouderaManager instance.

The cluster details page displays the private IP address assigned to the Cloudera Manager instance in the cluster.If the Public IPs option for the environment used to create the cluster is enabled, the page also displays the publicIP addresses. You can log in to Cloudera Manager through the public or private IP. If the public IP addresses areavailable, you can click a link to view the Altus command to set up a SOCKS proxy server to access the ClouderaManager instance in the cluster.

The Cloudera Manager Configuration section appears only if the Cloudera Manager instance is accessible. TheCloudera Manager instance might not be accessible when the cluster status is Creating or when the cluster failedat creation time.

Node Configuration

The Node Configuration section displays the configuration of the nodes in the cluster.

For a cluster on AWS, the section displays the configuration of the master node, worker nodes, and any computeworker node that you add to the cluster. The section displays the number of nodes and their instance types, theEBS volume configuration and the pricing option used to acquire the instance. If the cluster does not have computeworker nodes, the section displays zero for the number of compute worker nodes, but shows the default settingsthat the Altus Data Engineering service uses for compute worker nodes.

For a cluster on Azure, the section displays the configuration of the master node and worker nodes. The sectiondisplays the number of nodes, their instance types and storage volume configuration, and the number of disksper instance.

Cluster Details

• Log Archive Location shows where the cluster and job logs are archived.• Termination condition shows the action that Altus takes when all jobs in the cluster complete.• Uses instance bootstrap script? shows whether a bootstrap script runs before cluster startup.• Security shows whether the cluster is secure or not, based on the setting for the Secure Clusters option in

the environment.• Resource Tags shows the resource tags set up for the cluster.

Service Type and other key information

• Service Type shows the service that runs in the cluster.• Creation Time shows the time when a user created the cluster in Altus.• Total Nodes shows the number of nodes in the cluster.

For a cluster on AWS, the total number of nodes includes the master node, worker nodes, and computeworker nodes. The number does not include the Cloudera Manager instance. If the compute worker nodesuse Spot instances, the number of compute worker nodes available might not be equivalent to the numberof compute worker nodes configured for the cluster. The section shows the number of nodes available inthe cluster and the total number of nodes configured for the cluster.

For a cluster on Azure, the total number of nodes includes the number of master and worker nodes but notthe Cloudera Manager instance.

To view information about the nodes, click View. The Instances window displays the list of instances in thecluster, their instance IDs and IP addresses, and their roles in the cluster. The list of instances does not includethe Cloudera Manager instance.

• Environment displays the name of the Altus environment used to create the cluster.• Region indicates the region where the cluster is created.

Altus Data Engineering | 19

Altus Data Engineering Clusters

Page 20: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

• CDH Version shows the version of CDH in the cluster.• CRN shows the Cloudera Resource Name (CRN) assigned to the cluster. Because the CRN is a long string of

characters, Altus provides a copy icon so you can easily copy the CRN for any purpose.

Deleting a Cluster

To delete a cluster on the console:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Clusters.

By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. Thecloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list byenvironment and status. You can also search for clusters by name.

3. Click the name of the cluster to terminate.

On the Cluster details page, review the cluster information to verify that it is the cluster that youwant to terminate.

Note: Before you terminate a cluster, verify that there are no jobs running on the cluster. If youterminate a cluster when a job is running, the job fails. If you terminate a cluster when a job isqueued to run on it, the job is interrupted and cannot complete. You can submit the job again torun on another cluster.

4. Click Actions and select Delete Cluster.5. Click OK to confirm that you want to terminate the cluster.

Creating and Working with Clusters Using the CLIYou can use the Cloudera Altus client to create a cluster, view the properties of a cluster, or terminate a cluster. Youcan use the commands listed here as examples for how to use the Cloudera Altus commands.

For more information about the commands available in the Altus client , run the following command:

altus dataeng help

Creating a Cluster for AWS

You can use the following command to create a cluster:

altus dataeng create-aws-cluster--service-type=ServiceType--workers-group-size=NumberOfWorkers--cluster-name=ClusterName--instance-type=InstanceType--cdh-version=CDHVersion--public-key=FullPath&FileNameOfPublicKeyFile--environment-name=AltusEnvironmentName--compute-workers-configuration='{"groupSize": NumberOfComputeWorkers, "useSpot": true, "bidUSDPerHr": BidPrice}'

Guidelines for using the create-aws-cluster command:

• You must specify the service to include in the cluster. In the service-type parameter, use one of the followingservice names to specify the service in the cluster:

– HIVE– HIVE_ON_SPARK

20 | Altus Data Engineering

Altus Data Engineering Clusters

Page 21: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

– SPARK

Use this service type for Spark 2.x.

– SPARK_16

Use this service type only if your application specifically requires Spark version 1.6. If you specify SPARK_16in the service-type parameter, you must specify CDH511 in the cdh-version parameter.

– MR2– MULTI

A cluster with service type Multi allows you to run different types of jobs. You can run the following types ofjobs in a Multi cluster: Spark2.x, Hive, MapReduce2.

• You must specify the version of CDH to include in the cluster. In the cdh-version parameter, use one of thefollowing version names to specify the CDH version:

– CDH61– CDH515– CDH514– CDH513– CDH512– CDH511

• The CDH version that you specify can affect the service that runs on the cluster:

Spark 2.x or Spark 1.6

For a Spark service type, youmust select the CDH version that supports the selected Spark version. Altus supportsthe following combinations of CDH and Spark versions:

• CDH 6.1 with Spark 2.4• CDH 5.12 or later 5.x versions with Spark 2.2• CDH 5.11 with Spark 2.1 or Spark 1.6

Hive on Spark

On CDH version 5.13 or later, dynamic partition pruning (DPP) is enabled for Hive on Spark by default. For details,see Dynamic Partition Pruning for Hive Map Joins in the Cloudera Enterprise documentation set.

• The CDH version that you specify also affects the SDX namespace you can use with the cluster:

CDH 6.1

You can use a CDH 6.1 cluster only with a configured SDX namespace that points to version 6.1 of the Hivemetastore and Sentry databases.

CDH 5.x

You can use a CDH 5.x cluster only with a configured SDX namespace that points to version 5.x of the Hivemetastore and Sentry databases.

• The public-key parameter requires the full path and file name of a .pub file prefixed with file://. For example:--public-key=file:///my/file/path/to/ssh/publickey.pub

Altus adds the public key to the authorized_keys file on each node in the cluster.

• You can use the cloudera-manager-username and cloudera-manager-password parameters to set theCloudera Manager credentials. If you do not provide a username and password, the Data Engineering servicegenerates a guest username and password for the Cloudera Manager user account.

• The compute-workers-configuration parameter is optional. It adds compute worker nodes to the clusterin addition to worker nodes. Compute worker nodes run only computational processes. If you do not set theconfiguration for the compute workers, Altus creates a cluster with no compute worker nodes.

Altus Data Engineering | 21

Altus Data Engineering Clusters

Page 22: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

• The response object for the create-aws-cluster command contains the credentials for the read-only accountfor the Cloudera Manager instance in the cluster. You must note down the credentials from this response sincethe credentials are not made available again.

Example: Creating a Cluster in AWS for a PySpark Job

This example shows how to create a cluster with a bootstrap script and run a PySpark job on the cluster. The bootstrapscript installs a custom Python environment in which to run the job. The Python script file is available in the ClouderaAltus S3 bucket of job examples.

The following command creates a cluster with a bootstrap script and runs a job to implement an alternating leastsquares (ALS) algorithm.

altus dataeng create-aws-cluster --environment-name=EnvironmentName --service-type=SPARK --workers-group-size=3 --cluster-name=ClusterName --instance-type=m4.xlarge --cdh-version=CDH512 --public-key YourPublicSSHKey --instance-bootstrap-script='file:///PathToScript/bootstrapScript.sh' --jobs '{ "name": "PySpark ALS Job", "pySparkJob": { "mainPy": "s3a://cloudera-altus-data-engineering-samples/pyspark/als/als2.py",

"sparkArguments" : "--executor-memory 1G --num-executors 2 --conf spark.pyspark.python=/tmp/pyspark-env/bin/python" }}'

The bootstrapScript.sh in this example creates a Python environment using the default Python version shipped withAltus and installs the NumPy package. It has the following content:

#!/bin/bash

target="/tmp/pyspark-env"mypip="${target}/bin/pip"echo "Provisioning pyspark environment ..."

virtualenv ${target}${mypip} install numpy

if [ $? -eq 0 ]; then echo "Successfully installed new python environment at ${target}"else echo "Failed to install custom python environment at ${target}"fi

Creating a Cluster for Azure

You can use the following command to create a cluster:

altus dataeng create-azure-cluster--service-type=ServiceType--workers-group-size=NumberOfWorkers--cluster-name=ClusterName--instance-type=InstanceType--cdh-version=CDHVersion--public-key=FullPath&FileNameOfPublicKeyfile--environment-name=AltusEnvironmentName

Guidelines for using the create-azure-cluster command:

• You must specify the service to include in the cluster. In the service-type parameter, use one of the followingservice names to specify the service in the cluster:

– HIVE

22 | Altus Data Engineering

Altus Data Engineering Clusters

Page 23: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

– HIVE_ON_SPARK– SPARK

Use this service type for Spark 2.x.

– SPARK_16

Use this service type only if your application specifically requires Spark version 1.6.

– MR2– MULTI

A cluster with service type Multi allows you to run different types of jobs. You can run the following types ofjobs in a Multi cluster: Spark2.x, Hive, MapReduce2.

• Altus supports CDH 5.14, CDH 5.15, and CDH 6.1.

Specify the following value for the cdh-version parameter: CDH514, CDH515, or CDH61

• The CDH version that you select affects how you use the cluster:

CDH 6.1

• You can use a CDH 6.1 cluster only with a configured SDX namespace that points to version 6.1 of the Hivemetastore and Sentry databases.

• For clusters with CDH 6.1, Altus archives logs to ADLS Gen1 or Gen2, based on the folder you specify.

CDH 5.x

• You can use a CDH 5.x cluster only with a configured SDX namespace that points to version 5.x of the Hivemetastore and Sentry databases.

• For clusters with CDH 5.x, Altus archives logs to ADLS Gen1.

• The ssh-public-key parameter requires the full path and file name of a .pub file prefixed with file://. Forexample: --public-key=file:///my/file/path/to/ssh/publickey.pub

• You can use the cloudera-manager-username and cloudera-manager-password parameters to set theClouderaManager credentials. If you do not provide a username and password, the Altus Data Engineering servicegenerates a guest username and password for the Cloudera Manager user account.

• The response object for thecreate-azure-cluster command contains the credentials for the read-only accountfor the Cloudera Manager instance in the cluster. You must note the credentials from this response since thecredentials are not made available again.

Example: Creating a Cluster in Azure for a PySpark Job

This example shows how to create a cluster with a bootstrap script and run a PySpark job on the cluster. The bootstrapscript installs a custom Python environment in which to run the job.

Cloudera provides the job example files and input files that you need to run the jobs. To use the following example,set up an Azure Data Lake Store account with permissions to allow read and write access when you run the Altus jobs.Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are availablefor your use. For instructions on uploading the jar file, see Sample Files Upload on page 66.

The following command creates a cluster with a bootstrap script and runs a job to implement an alternating leastsquares (ALS) algorithm.

altus dataeng create-azure-cluster --environment-name=EnvironmentName --service-type=SPARK --workers-group-size=3 --cluster-name=ClusterName --instance-type=STANDARD_DS12_V2 --cdh-version=CDH513 --public-key YourPublicSSHKey --instance-bootstrap-script='file:///PathToScript/bootstrapScript.sh'

Altus Data Engineering | 23

Altus Data Engineering Clusters

Page 24: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

--jobs '{ "name": "PySpark ALS Job", "pySparkJob": { "mainPy": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/als/als2.py",

"sparkArguments" : "--executor-memory 1G --num-executors 2 --conf spark.pyspark.python=/tmp/pyspark-env/bin/python" }}'

The bootstrapScript.sh in this example creates a Python environment using the default Python version shipped withAltus and installs the NumPy package. It has the following content:

#!/bin/bash

target="/tmp/pyspark-env"mypip="${target}/bin/pip"echo "Provisioning pyspark environment ..."

virtualenv ${target}${mypip} install numpy

if [ $? -eq 0 ]; then echo "Successfully installed new python environment at ${target}"else echo "Failed to install custom python environment at ${target}"fi

Viewing the Cluster Status

When you create a cluster, you can immediately check its status. If the cluster creation process is not yet complete,you can view information regarding the progress of cluster creation.

You can use the following command to display the status of a cluster and other information:

altus dataeng describe-cluster --cluster-name=ClusterName

cluster-name is a required parameter.

Deleting a Cluster

You can use the following command to delete a cluster:

altus dataeng delete-cluster --cluster-name=ClusterName

Connecting to a ClusterYou can access a cluster created in Altus in the same way that you access other CDH clusters. You can use SSH toconnect to a service port in the cluster. If you use SSH, you might need to modify the security group in your cloudservice provider to allow an SSH connection to your instances from the public Cloudera IP addresses. You can createclusters with public IP addresses using an environment that has the Public IPs option enabled.

You can use the Altus client to set up a SOCKS proxy server to access the Cloudera Manager instance in the cluster.

For more information about setting up an SSH connection to the cluster, see SSH Connection. For more informationabout using the CLI to set up a SOCKS proxy, see SOCKS Proxy.

24 | Altus Data Engineering

Altus Data Engineering Clusters

Page 25: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Clusters on AWSIf you create clusters on AWS, you can take advantage of the EC2 Spot instances that AWS offers at a discount. Youcan add compute worker nodes to your clusters and configure them to use Spot instances.

Worker Nodes

An Altus cluster on AWS can have the following types of worker nodes:

Worker node

A worker node runs both data storage and computational processes. Altus requires a minimum of three workernodes in a cluster.

Compute worker node

A compute worker node is a type of worker node in an Altus cluster that runs only computational processes. It doesnot run data storage processes.

Altus does not require compute worker nodes in a cluster. You can configure compute worker nodes for a clusterto add compute power and improve cluster performance.

Compute worker nodes are stateless. They can be terminated and restarted without risking job execution.

A cluster can have a total of 50 worker and compute worker nodes. You determine the combination of worker andcompute worker nodes that provides the best performance for your workload. The worker nodes and compute workernodes use the same instance type.

If you add compute worker nodes to a cluster, Altus manages the provisioning of new instances to replace terminatedor failedworker and computeworker instances in a cluster. Formore information about reprovisioning cluster instances,see Instance Reprovisioning on page 26.

All compute worker nodes in a cluster use the same instance pricing. You can configure the compute worker nodes touse On-Demand instances or Spot instances. For more information about using Spot instances for compute workernodes, see Spot Instances on page 25.

Spot Instances

A Spot instance is an EC2 instance for which the hourly price fluctuates based on demand. The hourly price for a Spotinstance is typically much lower than the hourly price of an On-Demand instance. However, you do not have controlon when Spot instances are available for your cluster. When you bid a price on Spot instances, your Spot instances runonly when your bid price is higher than the current market price and terminate when your bid price becomes lowerthan the market price.

If an increase in the number of nodes in your cluster can improve job performance, youmightwant to use Spot instancesfor compute worker nodes in your cluster. To ensure that jobs continue to run when Spot instances are terminated,Altus allows you to use Spot instances only for compute worker nodes. Compute worker nodes are stateless and canbe terminated and restarted without risking job execution.

Altusmanages the use of Spot instances in a cluster.When a Spot instancewith a running job terminates, Altus attemptsto provision a new instance every 15 minutes. Altus uses the new instance to accelerate the running job.

Use the following guidelines when deciding to use Spot instances for compute worker nodes in a cluster:

• You can use Spot instances only for compute worker nodes. You cannot use Spot instances for worker nodes.

You can configure compute worker nodes to use On-Demand or Spot instances. If you configure compute workernodes to use Spot instances, and no Spot instances are available, jobs run on the worker nodes.

To ensure that worker nodes are available in a cluster to run the processes required to complete a job, workernodes must use On-Demand instances. You cannot configure worker nodes to use Spot instances.

• Set your bid price for Spot instances high enough to have a good chance of exceeding market price.

Altus Data Engineering | 25

Altus Data Engineering Clusters

Page 26: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Generally, a bid price that is 75% of the On-Demand instance price is a good convention to follow. As you use Spotinstances more, you can develop a better standard for setting a bid price that is reasonable but still has a goodchance of exceeding market price.

• Use less On-Demand instances than required and offset the shortfall with a larger number of Spot instances.

For example, you know that a job must run on a cluster with 10 On-Demand instances to meet a service levelagreement. You can use 5 On-Demand instances and 15 Spot instances to increase the number of instances onwhich the job runs with the same or lower cost.

This strategy means that most of the job processes run on the cheaper instances and is a cost-effective way tomeet the SLA.

For more information about AWS Spot instances, see Spot Instances on the AWS console.

Instance Reprovisioning

By default, if you add compute worker nodes to a cluster, Altus manages the provisioning of new instances to replaceterminated or failed instances in the cluster.

Altus periodically attempts to replace failed or terminated worker nodes and compute worker nodes in the cluster.When an instance fails or terminates, Altus attempts to provision a new instance every 15 minutes.

Altus provisions new instances of worker nodes and compute worker nodes in the following manner:

• Altus provisions On-Demand instances to replace failed or terminated worker nodes and maintain the number ofworker nodes configured for the cluster.

• If compute worker nodes are configured to use On-Demand instances, Altus provisions On-Demand instances toreplace failed or terminated computeworker nodes andmaintain the number of computeworker nodes configuredfor the cluster.

• If compute worker nodes are configured to use Spot instances, Altus provisions Spot instances to replace failedor terminated compute worker nodes and as much as possible maintain the number of compute worker nodesconfigured for the cluster. Depending on the availability of Spot instances, the number of compute worker nodesmight not always match the number of compute worker nodes configured for the cluster.

Note: Altus cannot provision new instances to replace a terminated or failedmaster node or ClouderaManager instance. When a master node or Cloudera Manager instance fails, the cluster fails.

System Volume

By default, when you create an Altus cluster for AWS, each node in the cluster includes a root device volume. In addition,Altus attaches an EBS volume to the node to store data generated by the cluster.

The EBS volume that Altus adds to the node is a system volume meant to hold logs and other data generated by Altusservices and systems. Although Altus manages it, the system volume counts as a volume that you pay for in yourinstance. The system volume is deleted when the cluster is terminated.

Altus configures the cluster so that sensitive information is not written to the root volume, but to the system volume.When you enable the secure cluster option for Altus clusters, Altus encrypts the system volume and the EBS volumesthat you configure for the cluster. Altus does not need to encrypt the root device volume since it does not containsensitive data.

26 | Altus Data Engineering

Altus Data Engineering Clusters

Page 27: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Altus Data Engineering Jobs

You can use the Cloudera Altus console or the command-line interface to run and monitor jobs. When you submit ajob, configure it to run on a cluster that contains the service you require to run the job.

The following list describes the services available in Altus clusters and the types of jobs you can run with each service:

Job TypeService Type

HiveHive

HiveHive on Spark

Spark or PySparkSpark 2.x

Spark or PySparkSpark 1.6

MapReduce2MapReduce2

Hive, Spark or PySpark, MapReduce2

The Multi service cluster supports Spark 2.x. It does not support Spark 1.6.

Multi

Altus creates a job queue for each cluster. When you submit a job, Altus adds the job to the queue of the cluster onwhich you configure the job to run. For more information about the job queue, see Job Queue on page 28 .

Altus generates a job ID for every job that you submit. If you submit a group of jobs, Altus generates a group ID. If youdo not specify a job name or a group name, Altus sets the job name to the job ID and the group name to the group ID.

If the Altus environment hasWorkload Analytics enabled, you can view performance information for a job after it ends,including health checks, baselines, and other execution information. Use this information to analyze a job's currentperformance and compare it to past runs of the same job.

Job StatusA job periodically changes status from the time that you submit it until the time it completes or is terminated. A useraction or the configuration of the job or the cluster on which it runs can affect the status of the job.

A data engineering job can have the following statuses:

• Queued. The job is queued to run on the selected cluster.• Submitting. The job is being added to the job queue.• Running. The job is in progress.• Interrupted. A job is set to Interrupted status in the following situations:

– If you create a cluster when you submit a job and the cluster is not successfully created, the job status is setto Interrupted. You can create a cluster and rerun the job on the new cluster.

– If the job is queued to run on a cluster but the cluster is deleted, the job status is set to Interrupted. You canrerun the job on another cluster.

– If the job does not run because a previous job in the queue has the Action on Failure option set to InterruptJob Queue, the job status is set to Interrupted.

• Completed. The job completed successfully.• Terminating. You have initiated termination of the job and the job is in the process of being terminated.• Terminated. The job termination process is complete.• Failed. The job did not complete.

Altus Data Engineering | 27

Altus Data Engineering Jobs

Page 28: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Job QueueWhen you create a cluster, Altus sets up a job queue for the jobs submitted to the cluster. Altus sets up one job queuefor each cluster and adds all jobs that are submitted to a cluster to the same job queue.

Altus runs the jobs in the queue in the sequence that the job requests are received. Whether you submit single jobsor groups of jobs to the cluster, Altus runs the jobs sequentially in the order that each job request is received.

You can configure the following options to manage how jobs run in the queue:

• Job Failure Action

When you submit a job, you can specify the action that Altus takes when a job fails. You can use the job failureaction to specify whether Altus runs the jobs in the queue following a failed job.

This option is useful for handling job dependencies. If a job must complete before the next job can run, you canset the option to interrupt the job queue so that, if the job fails, Altus does not run the rest of the jobs in thequeue.

If a job failure does not affect subsequent jobs in the queue, you can set the option to an action of NONE so that,when a job fails, Altus continues to run the subsequent jobs in the queue.

If you do not specify any action, Altus sets the option to interrupt the job queue by default.

The following table shows the option on the console and parameter in the CLI that you can use to specify theaction Altus takes when a job fails:

DescriptionOption/ParameterInterface

When you submit a job, you can set theAction on Failure optionto one of the following actions:

Action on FailureConsole

• None. When a job fails, Altus continues with job executionand performs no special action. Altus runs the next job inthe queue.

• Interrupt Job Queue. When the job fails, Altus does notrun any of the subsequent jobs in the queue. The jobs thatdo not run after a job fails are set to a status of Interrupted.

If you use the CLI to submit a job, you can set the failureActionparameter to one of the following actions:

failureActionCLI

• NONE. When a job fails, Altus continues with job executionand performs no special action. Altus runs the next job inthe queue.

• INTERRUPT_JOB_QUEUE. When the job fails, Altus doesnot run any of the subsequent jobs in the queue. The jobsthat do not run after a job fails are set to a status ofInterrupted.

• Cluster termination after all jobs are processed

When you create a cluster, you can configure how Altus handles a cluster when all jobs sent to the cluster areprocessed and the job queue becomes empty.

The following table shows the option on the console and parameter in the CLI that you can use to specify thecondition by which Altus terminates a cluster:

DescriptionOption/ParameterInterface

When you submit a job and you create a cluster onwhich to run the job, you can enable the Terminate

Terminate cluster once jobs completeConsole

28 | Altus Data Engineering

Altus Data Engineering Jobs

Page 29: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionOption/ParameterInterface

cluster once jobs complete option to terminate thecluster when the job queue is empty.

If you do not enable the option, Altus does notterminate the cluster when the job queue is empty. Youmust manually terminate the cluster if you do not planto submit jobs to the cluster again.

If you use the CLI to create a cluster, you can use the--automatic-termination-condition parameter to specify

--automatic-termination-conditionCLI

whether to terminate the cluster when the job queueis empty.

You can set the parameter to one of the followingconditions:

• NONE. When the job queue is empty, Altus doesnot terminate the cluster.

• EMPTY_JOB_QUEUE. When all jobs in the queueare processed and the queue is empty, Altusterminates the cluster.

If you set the option to terminate the cluster, youmust include the --jobs parameter and submit atleast one job to the cluster.

The --automatic-termination-condition parameter isoptional. If you do not include the parameter, Altus doesnot terminate the cluster when the job queue is empty.

Running and Monitoring Jobs on the ConsoleYou can submit a single job or a group of jobs on the Cloudera Altus console. When you view the list of jobs, you canfile a support ticket with Cloudera for any job that has issues with which you require help.

Submitting a Job on the Console

You can submit a job to run on an existing cluster or create a cluster specifically for the job.

To submit a job on the console:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Jobs.

By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filterthe list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. Youcan also filter by the user who submitted the job and the job type and status.

3. Click Submit Jobs.4. On the Job Settings page, select Single job.5. Select the type of job you want to submit.

You can select from the following types of jobs:

• Hive

Altus Data Engineering | 29

Altus Data Engineering Jobs

Page 30: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

• MapReduce2• PySpark• Spark

6. Enter the job name.

The job name is optional. If you do not specify a name, Altus sets the job name to be the same as the job ID.

7. Specify the properties for the job based on the job type.

Hive Job Properties

DescriptionProperty

The Hive script to execute. Select one of the following sources for the hivescript:

Script

• Script Path. Specify the path and file name of the file that containsthe script.

• File Upload. Upload a file that contains the script.• Direct Input. Type in the script.

The Hive script can include parameters. Use the format ${Variable_Name}for the parameter. If the script contains parameters, you must specify thevariable name and value for each parameter in Hive Script Parametersfield.

Required.

Required if the Hive script includes variables.

Select the option and provide the definition of the variables used asparameters in the Hive script. You must define the value of all variablesthat you use in the script.

Hive Script Parameters

Click + to add a variable to the list. Click - to delete a variable from the list.

Optional. XML document that defines the configuration settings for thejob.

Select the option and provide the job configuration. Select File Upload toupload the configuration XML file or select Direct Input to type in theconfiguration settings.

Job XML

Spark Job Properties

DescriptionProperty

Main class and entry point of the Spark application.

Required.

Main Class

Path and file names of jar files to be added to the classpath. You caninclude jar files that are stored in AWS S3 or Azure ADLS cloud storage orin HDFS.

Click + to add a jar file to the list. Click - to delete a jar file from the list.

Jars

Required.

30 | Altus Data Engineering

Altus Data Engineering Jobs

Page 31: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Optional. Arguments to pass to the main method of the main class of theSpark application.

Click + to add an argument to the list. Click - to delete an argument fromthe list.

Application Arguments

Optional. A list of Spark configuration properties for the job.

For example:

--executor-memory 4G --num-executors 50

Spark Arguments

MapReduce2 Job Properties

DescriptionProperty

Main class and entry point of the MapReduce2 application.

Required.

Main Class

Path and file names of jar files to be added to the classpath. You caninclude jar files that are stored in AWS S3 or Azure ADLS cloud storage orin HDFS.

Click + to add a jar file to the list. Click - to delete a jar file from the list.

Jars

Required.

Optional. Arguments for theMapReduce2 application. The arguments arepassed to the main method of the main class.

Click + to add an argument to the list. Click - to delete an argument fromthe list.

MapReduce Application Arguments

Optional. A list of Java options for the JVM.Java Options

Optional. XML document that defines the configuration settings for thejob.

Select the option and provide the job configuration. Select File Upload toupload the configuration XML file or select Direct Input to type in theconfiguration settings.

Job XML

PySpark Job Properties

DescriptionProperty

Path and file name of the main Python file for the Spark application. Thisis the entry point for your PySpark application. You can specify a file thatis stored in cloud storage or in HDFS.

Required.

Main Python File

Optional. Files required by the PySpark job, such as .zip, .egg, or .py files.Altus adds the path and file names of the files in the PYTHONPATH for

Python File Dependencies

Python applications. You can include files that are stored in cloud storageor in HDFS.

Click + to add a file to the list. Click - to delete a file from the list.

Altus Data Engineering | 31

Altus Data Engineering Jobs

Page 32: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Optional. Arguments to pass to the main method of the PySparkapplication.

Click + to add an argument to the list. Click - to delete an argument fromthe list.

Application Arguments

Optional. A list of Spark configuration properties for the job.

For example:

--executor-memory 4G --num-executors 50

Spark Arguments

8. In Action on Failure, specify the action that Altus takes when the job fails.

Altus can perform the following actions:

• None. When a job fails, Altus runs the subsequent jobs in the queue.• Interrupt Job Queue. When a job fails, Altus does not run any of the subsequent jobs in the queue. The jobs

that do not run after a job fails are set to a status of Interrupted.

For more information about the Action on Failure option, see Job Queue on page 28.

9. In the Cluster Settings section, select the cluster on which the job will run:

• Use existing. Select from the list of clusters that is available for your use.

Altus displays only the names of clusters where the type of job you selected can run and that you have accessto. The list displays the number of workers in the cluster.

• Create new. Configure and create a cluster for the job. If the cluster creation process is not yet completewhen you submit the job, Altus adds it to the job queue and runs it when the cluster is created.

• Clone existing. Select the cluster on which to base the configuration of a new cluster.

10. If you create or clone a cluster, set the properties and select the options for the new cluster.

Complete the following steps:

a. To allow Altus to terminate the cluster after the job completes, select the Terminate cluster once jobscomplete option.

If you create a cluster specifically for this job and you do not need the cluster after the job runs, you can haveAltus terminate the cluster when the job completes. If the Terminate cluster once jobs complete option isselected, Altus terminates the cluster after the job runs, whether the job completes successfully or fails. Thisoption is selected by default. If you do not want Altus to terminate the cluster, clear the selection.

b. You create a cluster within the Jobs page the same way that you create a cluster on the Clusters page.

To create a cluster for AWS, follow the instructions from Step 4 on page 8 to Step 7 on page 13 in Creatinga Data Engineering Cluster for AWS on page 8.

To create a cluster for Azure, follow the instructions from Step 4 on page 13 to Step 7 on page 17 in Creatinga Data Engineering Cluster for Azure on page 13.

11. Verify that all required fields are set and click Submit.

The Altus Data Engineering service submits the job to run on the selected cluster in your cloud provider account.

Submitting Multiple Jobs on the Console

You can group multiple jobs in one job submission. You can submit a group of jobs to run on an existing cluster or youcan create a cluster specifically for the job group.

32 | Altus Data Engineering

Altus Data Engineering Jobs

Page 33: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Note: When you create a job group on the console, you can only include the same type of jobs. A jobgroup that you create on the console does not support multiple types of jobs, even if you run the jobgroup on a Multi service cluster. Use the Altus CLI to create a job group with multiple types of jobsand run it in a Multi service cluster.

To submit a job on the console:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Jobs.

By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filterthe list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. Youcan also filter by the user who submitted the job and the job type and status.

3. Click Submit Jobs.4. On the Job Settings page, select Group of jobs.5. Select the type of job you want to submit.

You can select from the following types of jobs:

• Hive• MapReduce2• PySpark• Spark

6. Enter a name for the job group.

The job group name is optional. By default, Altus assigns an ID to the job group. If you do not specify a name, Altussets the job group name to be the same as the job group ID.

7. Click Add <Job Type>.8. On the Add Job window, enter the job name.

The job name is optional. By default, Altus assigns an ID to the job. If you do not specify a name, Altus sets thejob name to be the same as the job ID.

9. Set the properties for the job.

Altus displays job properties based on the job type.

Hive Job Properties

DescriptionProperty

The Hive script to execute. Select one of the following sources for the hivescript:

Script

• Script Path. Specify the path and file name of the file that containsthe script.

• File Upload. Upload a file that contains the script.• Direct Input. Type in the script.

The Hive script can include parameters. Use the format ${Variable_Name}for the parameter. If the script contains parameters, you must specify thevariable name and value for each parameter in Hive Script Parametersfield.

Required.

Altus Data Engineering | 33

Altus Data Engineering Jobs

Page 34: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Required if the Hive script includes variables.

Select the option and provide the definition of the variables used asparameters in the Hive script. You must define the value of all variablesthat you use in the script.

Hive Script Parameters

Click + to add a variable to the list. Click - to delete a variable from the list.

Optional. XML document that defines the configuration settings for thejob.

Select the option and provide the job configuration. Select File Upload toupload the configuration XML file or select Direct Input to type in theconfiguration settings.

Job XML

Spark Job Properties

DescriptionProperty

Main class and entry point of the Spark application.

Required.

Main Class

Path and file names of jar files to be added to the classpath. You caninclude jar files that are stored in AWS S3 or Azure ADLS cloud storage orin HDFS.

Click + to add a jar file to the list. Click - to delete a jar file from the list.

Jars

Required.

Optional. Arguments to pass to the main method of the main class of theSpark application.

Click + to add an argument to the list. Click - to delete an argument fromthe list.

Application Arguments

Optional. A list of Spark configuration properties for the job.

For example:

--executor-memory 4G --num-executors 50

Spark Arguments

MapReduce2 Job Properties

DescriptionProperty

Main class and entry point of the MapReduce2 application.

Required.

Main Class

Path and file names of jar files to be added to the classpath. You caninclude jar files that are stored in AWS S3 or Azure ADLS cloud storage orin HDFS.

Click + to add a jar file to the list. Click - to delete a jar file from the list.

Jars

Required.

34 | Altus Data Engineering

Altus Data Engineering Jobs

Page 35: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Optional. Arguments for theMapReduce2 application. The arguments arepassed to the main method of the main class.

Click + to add an argument to the list. Click - to delete an argument fromthe list.

MapReduce Application Arguments

Optional. A list of Java options for the JVM.Java Options

Optional. XML document that defines the configuration settings for thejob.

Select the option and provide the job configuration. Select File Upload toupload the configuration XML file or select Direct Input to type in theconfiguration settings.

Job XML

PySpark Job Properties

DescriptionProperty

Path and file name of the main Python file for the Spark application. Thisis the entry point for your PySpark application. You can specify a file thatis stored in cloud storage or in HDFS.

Required.

Main Python File

Optional. Files required by the PySpark job, such as .zip, .egg, or .py files.Altus adds the path and file names of the files in the PYTHONPATH for

Python File Dependencies

Python applications. You can include files that are stored in cloud storageor in HDFS.

Click + to add a file to the list. Click - to delete a file from the list.

Optional. Arguments to pass to the main method of the PySparkapplication.

Click + to add an argument to the list. Click - to delete an argument fromthe list.

Application Arguments

Optional. A list of Spark configuration properties for the job.

For example:

--executor-memory 4G --num-executors 50

Spark Arguments

10. In Action on Failure, specify the action that Altus takes when a job fails.

Altus can perform the following actions:

• None. When a job fails, Altus runs the subsequent jobs in the queue.• Interrupt Job Queue. When a job fails, Altus does not run any of the subsequent jobs in the queue. The jobs

that do not run after a job fails are set to a status of Interrupted.

For more information about the Action on Failure option, see Job Queue on page 28.

11. Click OK.

The Add Job window closes and the job is added to the list of jobs for the group. You can edit the job or deletethe job from the group. To add another job to the group, click Add <Job Type> and set the properties for the newjob.

When you complete setting up all jobs in the group, specify the cluster on which the jobs will run.

Altus Data Engineering | 35

Altus Data Engineering Jobs

Page 36: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

12. In the Cluster Settings section, select the cluster on which the jobs will run:

• Use existing. Select from the list of clusters that is available for your use.

Altus displays only the names of clusters where the type of job you selected can run and that you have accessto. The list displays the number of workers in the cluster.

• Create new. Configure and create a cluster for the job. If the cluster creation process is not yet completewhen you submit the job, Altus adds it to the job queue and runs it when the cluster is created.

• Clone existing. Select the cluster on which to base the configuration of a new cluster.

13. If you create or clone a cluster, set the properties and select the options for the new cluster:

Complete the following steps:

a. To allow Altus to terminate the cluster after the job completes, select the Terminate cluster once jobscomplete option.

If you create a cluster specifically for this job and you do not need the cluster after the job runs, you can haveAltus terminate the cluster when the job completes. If the Terminate cluster once jobs complete option isselected, Altus terminates the cluster after the job runs, whether the job completes successfully or fails. Thisoption is selected by default. If you do not want Altus to terminate the cluster, clear the selection.

b. You create a cluster within the Jobs page the same way that you create a cluster on the Clusters page.

To create a cluster for AWS, follow the instructions from Step 4 on page 8 to Step 7 on page 13 in Creatinga Data Engineering Cluster for AWS on page 8.

To create a cluster for Azure, follow the instructions from Step 4 on page 13 to Step 7 on page 17 in Creatinga Data Engineering Cluster for Azure on page 13.

14. Verify that all required fields are set and click Submit.

The Altus Data Engineering service submits the jobs as a group to run on the selected cluster in your cloud serviceaccount.

Viewing Job Status and Information

To view Altus Data Engineering jobs on the console:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Jobs.

By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filterthe list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. Youcan also filter by the user who submitted the job and the job type and status.

The jobs list displays the name of the group to which the job belongs and the name of the cluster on which thejob runs. Click the group name to view the details of the job group and the jobs in the group. Click the clustername to view the cluster details.

The Jobs list displays the status of the job. For more information about the different statuses that a job can have,see Altus Data Engineering Jobs on page 27.

3. You can click the Actions button for the job to perform the following tasks:

• Clone a Job. To create a job of the same type as the job that you are viewing, select the Clone Job action. Onthe Submit Job page, you can submit a job with the same properties as the job you are cloning. You canmodify or add to the properties before you submit the job.

36 | Altus Data Engineering

Altus Data Engineering Jobs

Page 37: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

• Terminate a Job. If the job has a status of Queued, Running, or Submitting, you can select Terminate Job tostop the process. If you terminate a job with a status of Running, the job run is aborted. If you terminate ajob with a status of Queued or Submitting, the job will not run.

If the job status is Complete, the Terminate Job selection does not appear.

Viewing the Job Details

To view the details of a cluster on the console:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. On the side navigation panel, click Jobs.

By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filterthe list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. Youcan also filter by the user who submitted the job and the job type and status.

3. Click the name of a job.

The Job details page displays information about the job, including the job type and the properties and status ofthe job.

• The Job Settings section displays the properties configured for the job, according to the type of job. Thesection also shows the action that Altus will take if a job fails in the queue fails.

• The Job details page shows the name of the cluster onwhich the job runs and the user account that submittedthe job. If the cluster is not terminated, the cluster name is a link to the details page of the cluster where youcan see more information about the cluster.

• The Job details page also show the time linewhen the job changed status as itmoved through the job executionprocess. It shows the amount of time the job was in the queue and the amount of time it took to completethe job.

• The Job details page displays the job ID and CRN. The job CRN is a long string of characters. If you need touse the CRN to identify a job, the jobs details page makes it easy for you to copy the CRN to the clipboard soyou can paste it in the command or support case. For example, if you need to include the job CRN when yourun a command or create a support case, copy the job CRN from the job details page and paste it on thecommand line or the support case.

Running and Monitoring Jobs Using the CLIUse the Cloudera Altus client to submit a job or view the properties of a job. You can use the commands listed hereas examples for how to use the Altus commands to submit jobs in Altus.

For more information about the commands available in the Altus client, run the following command:

altus dataeng help

Submitting a Spark Job

You can use the following command to submit a Spark job:

altus dataeng submit-jobs --cluster-name ClusterName --jobs '{ "sparkJob": { "jars": [ "PathAndFilenameOfJar1", "PathAndFilenameOfJar2"

Altus Data Engineering | 37

Altus Data Engineering Jobs

Page 38: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

] }}'

You can include theapplicationArguments parameter to pass values to themainmethod and thesparkArgumentsparameter to specify Spark configuration settings. If you use the application and Spark arguments parameters, youmust escape the list of arguments. Alternatively, you can put the arguments into a file and pass the path and file namewith the arguments parameters.

Use the following prefixes when you include jar files for the Spark job:

• For files in Amazon S3: s3a://• For files in Azure Data Lake Store Gen1: adl://• For files in Azure Data Lake Store Gen2: abfs(s)://• For files in local files system: file://• For files in HDFS in the cluster: hdfs://

You can also add the mainClass parameter to specify the entry point of your application.

Note: You can find examples for submitting a Spark job in the Altus Data Engineering tutorials:

In the Altus tutorial for AWS: Creating a Spark Job using the CLI on page 54

In the Altus tutorial for Azure: Creating a Spark Job using the CLI on page 73

Spark Job Examples

The following examples show how to submit a Spark job to run on a cluster in AWS and in Azure.

Spark Job Examples for a Cluster in AWS

The following examples show how to submit a Spark job to run on a cluster in AWS:

Pi Estimation Example

Spark provides a library of code examples that illustrate how Spark works. The following example uses the Piestimation example from the Spark library to show how to submit Spark job using the Altus CLI.

You can use the following command to submit a Spark job to run the Pi estimation example:

altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "sparkJob": { "jars": [ "local:///opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples.jar" ], "sparkArguments" : "--executor-memory 1G --num-executors 2", "mainClass": "org.apache.spark.examples.SparkPi" }}'

The --cluster-name parameter requires the name of a Spark cluster.

Medicare Example

The following example processes publicly available data to show the usage ofMedicare procedure codes. The Sparkjob is available in a Cloudera Altus S3 bucket of job examples and reads input data from the Cloudera Altus exampleS3 bucket. You can create an S3 bucket in your account to write output data.

To use the example, set up an S3 bucket in your AWS account and set permissions to allow write access when yourun the job.

To run the Spark job example:

1. Create a Spark cluster to run the job.

38 | Altus Data Engineering

Altus Data Engineering Jobs

Page 39: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

You can create a cluster with Spark 2.x or Spark 1.6 service type. The version of the Spark service in the clustermust match the version of the Spark jar file:

• For Spark 2.x, use the example jar file named altus-sample-medicare-spark2x.jar• For Spark 1.6, use the example jar file named altus-sample-medicare-spark1x.jar

For more information about creating a cluster, see Creating a Cluster for AWS on page 20.

2. Create an S3 bucket in your AWS account.3. Use the following command to submit the Medicare job:

altus dataeng submit-jobs \--cluster-name ClusterName \--jobs '{ "sparkJob": { "jars": [

"s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-SparkVersion.jar"

], "mainClass": "com.cloudera.altus.sample.medicare.transform", "applicationArguments": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/",

"s3a://NameOfOutputS3Bucket/OutputPath/" ] }}'

The --cluster-name parameter requires the name of a cluster with a version of Spark that matches theversion of the example jar file.

The jars parameter requires the name of the jar file that matches the version of the Spark service in thecluster.

Spark Job Example for a Cluster in Azure

This example processes publicly available data to show the usage of Medicare procedure codes.

Cloudera provides the job example files and input files that you need to run the jobs. To use the following example,set up an Azure Data Lake Store account with permissions to allow read and write access when you run the Altus jobs.Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are availablefor your use. For instructions on uploading the jar file, see Sample Files Upload on page 66.

To run the Spark job example:

1. Create a Spark cluster to run the job.

You can create a cluster with Spark 2.x or Spark 1.6 service type. The version of the Spark service in the clustermust match the version of the Spark jar file:

• For Spark 2.x, use the example jar file named altus-sample-medicare-spark2x.jar• For Spark 1.6, use the example jar file named altus-sample-medicare-spark1x.jar

For more information about creating a cluster, see Creating a Cluster for Azure on page 22.

2. Use the following command to submit the Medicare job:

altus dataeng submit-jobs \--cluster-name ClusterName \--jobs '{ "sparkJob": { "jars": [

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-SparkVersion.jar"

], "mainClass": "com.cloudera.altus.sample.medicare.transform", "applicationArguments": [

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/",

Altus Data Engineering | 39

Altus Data Engineering Jobs

Page 40: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/output/"

] }}'

The --cluster-name parameter requires the name of a cluster with a version of Spark that matches the versionof the example jar file.

The jars parameter requires the name of the jar file that matches the version of the Spark service in the cluster.

Submitting a Hive Job

You can use the following command to submit a Hive job:

altus dataeng submit-jobs--cluster-name ClusterName--jobs '{ "hiveJob": { "script": "PathAndFilenameOfHQLScript" }}'

You can also include the jobXml parameter to pass job configuration settings for the Hive job.

The following is an example of the content of a Hive job XML that you can use with the jobXml parameter:

<?xml version="1.0" encoding="UTF-8"?><configuration> <property> <name>hive.auto.convert.join</name> <value>true</value> </property> <property> <name>hive.auto.convert.join.noconditionaltask.size</name> <value>20971520</value> </property> <property> <name>hive.optimize.bucketmapjoin.sortedmerge</name> <value>false</value> </property> <property> <name>hive.smbjoin.cache.rows</name> <value>10000</value> </property> <property> <name>mapred.reduce.tasks</name> <value>-1</value> </property> <property> <name>hive.exec.reducers.max</name> <value>1099</value> </property></configuration>

Note: You can find examples for submitting a Hive job in the Altus Data Engineering tutorials:

In the Altus tutorial for AWS: Creating a Hive Job Group using the CLI on page 63

In the Altus tutorial for Azure: Submitting a Spark Job on page 69

Hive Job Examples

The following examples show how to submit a Hive job to run on a cluster in AWS and in Azure.

40 | Altus Data Engineering

Altus Data Engineering Jobs

Page 41: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Hive Job Example for a Cluster in AWS

The following example of a Hive job reads data from a CSV file and writes the data to an S3 bucket in the ClouderaAWS account. It then writes the same data, with the commas changed to colons, to an S3 bucket in your AWS account.

To use the example, set up an S3 bucket in your AWS account and set permissions to allow write access when you runthe example Hive script.

To run the Hive job example:

1. Create a cluster to run the Hive job.

You can run a Hive job on a Hive on MapReduce or Hive on Spark cluster. For more information about creating acluster, see Creating a Cluster for AWS on page 20.

2. Create an S3 bucket in your AWS account.3. Create a Hive script file on your local drive.

This example uses the file name hiveScript.hql.

4. Copy and paste the following script into the file:

DROP TABLE input;DROP TABLE output;

CREATE EXTERNAL TABLE input(f1 STRING, f2 STRING, f3 STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION 's3a://cloudera-altus-data-engineering-samples/hive/data/';

CREATE TABLE output(f1 STRING, f2 STRING, f3 STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'STORED AS TEXTFILELOCATION 's3a://NameOfOutputS3Bucket/OutputPath/';

INSERT OVERWRITE TABLE output SELECT * FROM input ORDER BY f1;

5. Modify the script and replace the name and path of the output S3 bucket with the name and path of the S3 bucketyou created in your AWS account.

6. Run the following command:

altus dataeng submit-jobs \--cluster-name=ClusterName \--jobs '{ "hiveJob": { "script": "PathToHiveScript/hiveScript.hql" }}'

The --cluster-name parameter requires the name of a Hive or Hive on Spark cluster.

The script parameter requires the absolute path and file name of the script file prefixed with file://.

For example: --jobs '{ "hiveJob": { "script": "file:///file/path/to/my/hiveScript.hql"}}'

Hive Job Example for a Cluster in Azure

This example of a Hive job reads data from a CSV file and writes the same data, with the commas changed to colons,to an output folder in your Azure Data Lake Store (ADLS) account.

Cloudera provides the job example files and input files that you need to run the jobs. To use the following example,set up an Azure Data Lake Store account with permissions to allow read and write access when you run the Altus jobs.Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are availablefor your use. For instructions on uploading the jar file, see Sample Files Upload on page 66.

To run the Hive job example:

1. Create a cluster to run the Hive job.

Altus Data Engineering | 41

Altus Data Engineering Jobs

Page 42: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

You can run a Hive job on a Hive on MapReduce or Hive on Spark cluster. For more information about creating acluster, see Creating a Cluster for Azure on page 22.

2. Create a Hive script file on your local drive.

This example uses the file name hiveScript.hql.

3. Copy and paste the following script into the file:

DROP TABLE input;DROP TABLE output;

CREATE EXTERNAL TABLE input(f1 STRING, f2 STRING, f3 STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','LOCATION 'adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/';

CREATE TABLE output(f1 STRING, f2 STRING, f3 STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ':'STORED AS TEXTFILELOCATION 'adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/output/';

INSERT OVERWRITE TABLE output SELECT * FROM input ORDER BY f1;

4. Modify the script and replace the name of the ADLS account with the name of the ADLS account you set up forAltus examples.

5. Run the following command:

altus dataeng submit-jobs \--cluster-name=ClusterName \--jobs '{ "hiveJob": { "script": "PathToHiveScript/hiveScript.hql" }}'

The --cluster-name parameter requires the name of a Hive or Hive on Spark cluster.

The script parameter requires the absolute path and file name of the script file prefixed with file://.

For example: --jobs '{ "hiveJob": { "script": "file:///file/path/to/my/hiveScript.hql"}}'

Submitting a MapReduce Job

You can use the following command to submit a MapReduce job:

altus dataeng submit-jobs --cluster-name ClusterName --jobs '{ "mr2Job": { "mainClass": "main.class.file", "jars": [ "PathAndFilenameOfJar1", "PathAndFilenameOfJar2" ] }}'

Altus uses Oozie to run MapReduce2 jobs. When you submit a MapReduce2 job in Altus, Oozie launches a Java actionto process the MapReduce2 job request. You can specify configuration settings for your job in an XML configurationfile. To load the Oozie configuration settings into the MapReduce2 job, load the job XML file into the Java main classof the MapReduce2 application.

For example, the following code snippet from aMapReduce2 application shows the oozie.action.conf.xml being loadedinto the application:

public int run(String[] args) throws Exception { Job job = Job.getInstance(loadJobConfiguration(), "wordcount");

42 | Altus Data Engineering

Altus Data Engineering Jobs

Page 43: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

... // Launch MR2 Job ...}

private Configuration loadJobConfiguration() { String ooziePreparedConfig = System.getProperty("oozie.action.conf.xml"); if (ooziePreparedConfig != null) { // Oozie collects hadoop configs with job.xml into a single file. // So default config is not needed. Configuration actionConf = new Configuration(false); actionConf.addResource(new Path("file:///", ooziePreparedConfig)); return actionConf; } else { return new Configuration(true); }}

MapReduce Job Examples

The following examples show how to submit a MapReduce job to run on a cluster in AWS and in Azure.

MapReduce Job Example for a Cluster in AWS

The following example of a MapReduce job is available in the Cloudera Altus S3 bucket of job examples. The job readsinput data from a poetry file in the Cloudera Altus example S3 bucket.

To use the example, set up an S3 bucket in your AWS account to write output data. Set the S3 bucket permissions toallow write access when you run the job.

You can use the following command to submit a MapReduce job to run the example:

altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "mr2Job": { "mainClass": "com.cloudera.altus.sample.mr2.wordcount.WordCount", "jars": ["s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/program/altus-sample-mr2.jar"],

"arguments": [

"s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/input/poetry/", "s3a://NameOfOutputS3Bucket/OutputPath/" ] }}'

The --cluster-name parameter requires the name of a MapReduce cluster.

MapReduce Job Example for a Cluster in Azure

This example is a simple job that reads input data from a file and counts the words.

Cloudera provides the job example files and input files that you need to run the jobs. To use the following example,set up an Azure Data Lake Store account with permissions to allow read and write access when you run the Altus jobs.Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are availablefor your use. For instructions on uploading the jar file, see Sample Files Upload on page 66.

You can use the following command to submit a MapReduce job to run the example:

altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "mr2Job": { "mainClass": "com.cloudera.altus.sample.mr2.wordcount.WordCount", "jars": ["adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/mr2/wordcount/program/altus-sample-mr2.jar"],

"arguments": [

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/mr2/wordcount/input/poetry/",

Altus Data Engineering | 43

Altus Data Engineering Jobs

Page 44: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/mr2/wordcount/output/"

] }}'

The --cluster-name parameter requires the name of a MapReduce cluster.

Submitting a PySpark Job

You can use the following command to submit a PySpark job:

altus dataeng submit-jobs --cluster-name=ClusterName --jobs '{ "name": “WordCountJob", "pySparkJob": { "mainPy": "PathAndFilenameOfthePySparkMainFile", "sparkArguments" : "SparkArgumentsRequiredForYourApplication", "pyFiles" : "PythonFilesRequiredForYourApplication", "applicationArguments": [ "PathAndFilenameOfFile1", "PathAndFilenameOfFile2" ] }}’

You can include theapplicationArguments parameter to pass values to themainmethod and thesparkArgumentsparameter to specify Spark configuration settings. If you use the applicationArguments and sparkArgumentsparameters, you must escape the list of arguments. Alternatively, you can put the arguments into a file and pass thepath and file name with the arguments parameters.

The --cluster-name parameter requires the name of a Spark cluster.

The pyFiles parameter takes the path and file names of Python modules. For example:

"pyFiles" : ["s3a://path/to/module1.py", "s3a://path/to/module2.py"]

PySpark Job Examples

The following examples show how to submit a PySpark job to run on a cluster in AWS and in Azure.

PySpark Job Example for a Cluster in AWS

This example uses a PySpark job to count words in a text file and write the result to an S3 bucket that you specify. ThePython file is available in a Cloudera Altus S3 bucket of job examples and also reads input data from the Cloudera AltusS3 bucket. You can create an S3 bucket in your account to write output data.

The job in this example runs on a cluster with the Spark 2.2 service.

You can use the following command to submit a PySpark job to run the word count example:

altus dataeng submit-jobs \ --cluster-name=ClusterName \ --jobs '{ "name": "Word Count Job", "pySparkJob": { "mainPy": "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/wordcount2.py", "sparkArguments" : "--executor-memory 1G --num-executors 2", "applicationArguments": [

"s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/input/HadoopPoem0.txt",

"s3a://NameOfOutputS3Bucket/PathToOutputFile" ] }}’

44 | Altus Data Engineering

Altus Data Engineering Jobs

Page 45: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

If you need to use a specific Python environment for your PySpark job, you can use the --instance-bootstrap-scriptparameter to include a bootstrap script to install a custom Python environment when Altus creates the cluster.

For an example of how to use a bootstrap script in the create-aws-cluster command to install a Python environmentfor a PySpark job, see Example: Creating a Cluster in AWS for a PySpark Job on page 22.

PySpark Job Example for a Cluster in Azure

This example uses a PySpark job to count words in a text file and write the result to an ADLS account that you specify.The job runs on a cluster with the Spark 2.2 service.

Cloudera provides the job example files and input files that you need to run the jobs. To use the following example,set up an Azure Data Lake Store account with permissions to allow read and write access when you run the Altus jobs.Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are availablefor your use. For instructions on uploading the jar file, see Sample Files Upload on page 66.

You can use the following command to submit a PySpark job to run the word count example:

altus dataeng submit-jobs \ --cluster-name=ClusterName \ --jobs '{ "name": "Word Count Job", "pySparkJob": { "mainPy": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/wordcount/wordcount2.py",

"sparkArguments" : "--executor-memory 1G --num-executors 2", "applicationArguments": [

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/wordcount/input/HadoopPoem0.txt",

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/wordcount/outputfile"

] }}’

If you need to use a specific Python environment for your PySpark job, you can use the --instance-bootstrap-scriptparameter to include a bootstrap script to install a custom Python environment when Altus creates the cluster.

For an example of how to use a bootstrap script in the create-azure-cluster command to install a Python environmentfor a PySpark job, see Example: Creating a Cluster in Azure for a PySpark Job on page 23.

Submitting a Job Group with Multiple Job Types

You can submit different types of jobs to run on a Multi type cluster.

Use the following command to submit a group of jobs that includes a PySpark job, Hive job, and a MapReduce job:

altus dataeng submit-jobs--cluster-name MultiClusterName \--job-submission-group-name "JobGroupName" \--jobs '[ { "pySparkJob": { "mainPy": "PathAndFilenameOfthePySparkMainFile" }}, { "hiveJob": { "script": "PathAndFilenameOfHQLScript" }}, { "mr2Job": { "mainClass": "main.class.file", "jars": ["PathAndFilenameOfJar1"] }} ]'

Altus Data Engineering | 45

Altus Data Engineering Jobs

Page 46: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Tutorial: Clusters and Jobs on AWS

This tutorial walks you through using the Altus console and CLI to create Altus Data Engineering clusters and submitjobs in Altus. The tutorial uses publicly available data that show the usage of Medicare procedure codes.

Cloudera has created a publicly accessible S3 bucket for files used in Altus examples. This publicly accessible S3 bucketcontains the data, scripts, and other artifacts used in the tutorial. You must create an S3 bucket in your AWS accountto write output data.

The tutorial has the following sections:

Prerequisites

To use this tutorial, you must have an Altus user account and the roles required to create clusters and run jobs inAltus.

Altus Console Login on page 66

Log in to the Altus console to perform the exercises in this tutorial.

Exercise 1: Installing the Altus Client on page 66

Learn how to install the Altus client and register an access key to use the CLI.

Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs on page 48

Learn how to create a cluster with a Spark service and submit a Spark job using the Altus console and the CLI. Thisexercise provides instructions on how to create a SOCKS proxy and view the cluster andmonitor the job in ClouderaManager. It also shows you how to delete the cluster on the console.

Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs on page 54

Learn how to create a cluster with a Hive service and submit a group of Hive jobs using the Altus console and theCLI. This exercise also walks you through the process of creating a SOCKS proxy and accessing Cloudera Manager.It also shows you how to delete the cluster on the console.

Note: The tutorials in this section perform the same tasks as the sample applications provided forthe Altus SDK for Java. For more information see Using the Altus SDK for Java.

PrerequisitesBefore you start the tutorial, ensure that you have access to resources in your AWS account and an Altus user accountwith permission to create clusters and run jobs in Altus.

The following are prerequisites for the tutorial:

• Altus user account, environment, and roles. An Altus user account allows you to log in to the Altus console andperform the exercises in the tutorial. An Altus administratormust assign an Altus environment to your user accountso that you have access to resources in your AWS account. The Altus administrator must also assign roles to youruser account to allow you to create clusters and run jobs in Altus.

• Public key. You must provide a public key for Altus to use when creating and configuring clusters in your AWSaccount.

For more information about creating the SSH key in AWS, see Amazon EC2 Key Pairs. You can also create the SSHkeys using other tools, such as ssh-keygen.

• S3 bucket for output.The tutorial provides read access to an S3 bucket that contains the jars, scripts, and inputdata used in the tutorial exercises. Youmust set up an S3 bucket in your AWS account for the output data generatedby the jobs. The S3 bucket must have the permissions to allow write access when you run the Altus jobs.

For more information about creating an S3 bucket in AWS, see Creating and Configuring an S3 Bucket.

46 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 47: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Altus Console LoginTo access the Altus console, go to the following URL: https://console.altus.cloudera.com/.

Log in to Altus with your Cloudera user account. After you are authenticated, the Altus console displays your homepage.

TheData Engineering section displays on the side navigation panel. If you have been assigned roles and an environmentin Altus, you can click on Clusters and Jobs to create clusters and submit jobs as you follow the tutorial exercises.

Exercise 1: Installing the Altus ClientTo use the Altus CLI, you must install the Altus client and configure the client with an access key.

Altus manages access to the Altus services so that only users with a registered access key can run commands to createclusters, submit jobs, or use SDX namespaces. Generate and register an access key with the Altus client to create acredentials file so that you do not need submit your access key with each command.

This exercise provides instructions to download and install the Altus client on Linux, generate a key, and run the CLIcommand to register the key.

To set up the Cloudera Altus client, complete the following tasks:

1. Install the Altus client.2. Configure the Altus client with an access key.

Step 1. Install the Altus Client

To avoid conflicts with older versions of Python or other packages, Cloudera recommends that you install the ClouderaAltus client in a virtual environment. Use the virtualenv tool to create a virtual environment and install the client.

The following commands show how you can use pip to install the client on a virtual environment on Linux:

mkdir ~/altusclienvvirtualenv ~/altusclienv --no-site-packagessource ~/altusclienv/bin/activate ~/altusclienv/bin/pip install altuscli

To upgrade the client to the latest version, run the following command:

~/altusclienv/bin/pip install --upgrade altuscli

After the client installation process is complete, run the following command to confirm that the Altus client is working:

If virtualenv is activated: altus --version

If virtualenv is not activated: ~/altusclienv/bin/altus --version

Step 2. Configure the Altus Client with the API Access Key

You use the Altus console to generate the access that you register with the client. Keep the window that displays theaccess key on the console open until you complete the key registration process.

To create and set up the client with a Cloudera Altus API access key:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. Click your user account name and selectMy Account.3. On theMy Account page, click Generate Access Key.

Altus Data Engineering | 47

Tutorial: Clusters and Jobs on AWS

Page 48: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Altus creates the key and displays the information on the screen. The following image shows an example of anAltus API access key as displayed on the Altus console:

Note: The Cloudera Altus console displays the API access key immediately after you create it.You must copy the access key information when it is displayed. Do not exit the console withoutcopying the keys. After you exit the console, there is no other way to view or copy the access key.

4. On the command line, run the following command to configure the client with the access key:

altus configure

5. Enter the following information at the prompt:

• Altus Access key. Copy and paste the access key ID that you generated in the Cloudera Altus console.• Altus Private key. Copy and paste the private key that you generated in the Cloudera Altus console.

The private key is a very long string of characters. Make sure that you enter the full string.

The configuration utility creates the following file to store your user credentials: ~/.altus/credentials

6. To verify that the credentials were created correctly, run the following command:

altus iam get-user

The command displays your Altus client credentials.

7. After the credentials file is created, you can go back to the Cloudera Altus console and click OK to exit the accesskey window.

Exercise 2: Creating a Spark Cluster and Submitting Spark JobsThis exercise shows you how to create a cluster with a Spark service on the Altus console and submit a Spark job onthe console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view theprogress of the job on Cloudera Manager.

48 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 49: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

In this exercise, you complete the following tasks:

1. Create a cluster with a Spark service on the console.2. Submit a Spark job on the console.3. Create a SOCKS proxy to access the Spark cluster on Cloudera Manager.4. View the Spark cluster and verify the Spark job output.5. Submit a Spark job using the CLI.6. Terminate the Spark cluster

Creating a Spark Cluster on the Console

You must be logged in to the Altus console to perform this task.

Note that it can take a while for Altus to complete the process of creating a cluster.

To create a cluster on the console:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, click Create Cluster.3. Create a cluster with the following configuration:

DescriptionProperty

To help you easily identify your cluster, use your first initial and last name as prefixfor the cluster name. This tutorial uses the cluster namemjones-spark-tutorial as an example.

Cluster Name

Spark 2.xService Type

CDH 5.13CDH Version

Name of the Altus environment to which you have been given access for thistutorial. If you do not know which Altus environment to select, check with yourAltus administrator.

Environment

For theWorker node configuration, set the Number of Nodes to 3.

Leave the rest of the node properties with their default setting.

Node Configuration

Configure your access credentials to Cloudera Manager:

SSH Public Key

If you have your public key in a file, select File Upload and choose the key file.If you have the key available for pasting on screen, select Direct Input to enterthe full key code.

Credentials

Cloudera Manager User

Set both the user name and password to guest.

The following figure shows the Create Cluster page with the settings for this tutorial:

Altus Data Engineering | 49

Tutorial: Clusters and Jobs on AWS

Page 50: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Verify that all required fields are set and click Create Cluster.

The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page,the new cluster displays at the top of the list of clusters.

Submitting a Spark Job

Submit a Spark job to run on the cluster you created in the previous task.

To submit a Spark job on the console:

1. In the Data Engineering section of the side navigation panel, click Jobs.2. Click Submit Jobs.3. On the Job Settings page, select Single job.4. Select the Spark job type.5. Create a Spark job with the following configuration:

DescriptionProperty

Set the job name to Spark Medical Example.Job Name

Set the main class to com.cloudera.altus.sample.medicare.transformMain Class

Use the tutorial jar file: s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/

altus-sample-medicare-spark2x.jar

Jars

50 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 51: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Set the application arguments to the S3 bucket to use for job input and output.

Add the tutorial S3 bucket for the job input: s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/

Application Arguments

Click + and add the S3 bucket you created for the job output: s3a://Path/Of/The/Output/S3Bucket/

Use an existing cluster and select the cluster that you created in the previous task.Cluster Settings

The following figure shows the Submit Jobs page with the settings for this tutorial:

6. Verify that all required fields are set and click Submit Jobs.

The Altus Data Engineering service submits the job to run on the selected cluster in your AWS account.

Creating a SOCKS Proxy for the Spark Cluster

Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the cluster and progress of the job.

To create a SOCKS proxy to access Cloudera Manager:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, find the cluster on which you submitted the job and click the cluster name.3. On the cluster detail page, click View SOCKS Proxy CLI Command.

Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instancefor the Spark cluster that you created.

Altus Data Engineering | 51

Tutorial: Clusters and Jobs on AWS

Page 52: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Click Copy.5. On a terminal window, paste the command.6. Modify the command to use the name of the cluster you created and your private key and run the command:

altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"

The Cloudera Manager Admin console opens in a Chrome browser.

Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. Ifyou donot use Chrome, remove theopen-cloudera-managerparameter so that the commanddisplays instructions for accessing the Cloudera Manager URL from any browser.

Viewing the Cluster and Verifying the Spark Job Output

Log in to Cloudera Manager with the guest user account that you set up when you created the cluster.

To view the cluster and monitor the job on the Cloudera Manager Admin console:

1. Log in to Cloudera Manager using guest as the account name and password.2. On the Home page, click Clusters on the top navigation bar.3. On the cluster window, select YARN Applications.

The following screenshots show the cluster services and workload information that you can view on the ClouderaManager Admin console:

52 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 53: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

When your Spark job completes, you can view the output of the Spark job in the S3 bucket that you specified for yourjob output. The Spark job creates the following files in your output S3 bucket:

Altus Data Engineering | 53

Tutorial: Clusters and Jobs on AWS

Page 54: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

• Success (0 bytes)• part-00000 (65.5 KB)• part-00001 (69.5 KB)

Note: If you want to use the same output S3 bucket for the next exercise, go to the AWS console anddelete files in the S3 bucket. You will recreate the files when you submit the same Spark job using theCLI.

Creating a Spark Job using the CLI

You can submit the same Spark job to run on the same cluster using the CLI. If you want to view the cluster andmonitorthe job on Cloudera Manager, stay logged in to Cloudera Manager.

To submit a Spark job using the CLI, run the following command:

altus dataeng submit-jobs \--cluster-name FirstInitialLastName-tutorialcluster \--jobs '{ "sparkJob": { "jars": [

"s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar"

], "mainClass": "com.cloudera.altus.sample.medicare.transform", "applicationArguments": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/",

"s3a://Path/Of/The/Output/S3Bucket/" ] }}'

To view theworkload summary, go to the ClouderaManager console and click Clusters > SPARK_ON_YARN-1. ClouderaManager displays the same workload summary for this job as for the job that you submitted through the console.

To verify the output, go to S3 bucket you specified for your job output and verify that it contains the files created bythe Spark job:

• Success (0 bytes)• part-00000 (65.5 KB)• part-00001 (69.5 KB)

Terminating the Cluster

This task shows you how to terminate the cluster that you created for this tutorial.

To terminate the cluster on the Altus console:

1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.2. On the Clusters page, click the name of the cluster that you created for this tutorial.3. On the Cluster details page, review the cluster information to verify that it is the cluster that youwant to terminate.4. Click Actions and select Delete Cluster.5. Click OK to confirm that you want to terminate the cluster.

Exercise 3: Creating a Hive Cluster and Submitting Hive JobsThis exercise shows you how to create a cluster with a Hive service on the Altus console and submit Hive jobs on theconsole and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view theprogress of the jobs on Cloudera Manager.

In this exercise, you complete the following tasks:

54 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 55: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

1. Create a cluster with a Hive service on the console.2. Submit a group of Hive jobs on the console.3. Create a SOCKS proxy to access the Hive cluster on Cloudera Manager4. View the Hive cluster and verify the Hive job output.5. Submit a group of Hive jobs using the CLI.6. Terminate the Hive cluster

Creating a Hive Cluster on the Console

You must be logged in to the Altus console to perform this task.

Note that it can take a while for Altus to complete the process of creating a cluster.

To create a cluster with a Hive service on the console:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, click Create Cluster.3. Create a cluster with the following configuration:

DescriptionProperty

To help you easily identify your cluster, use your first initial and last name as prefixfor the cluster name. This tutorial uses the cluster name mjones-hive-tutorialas an example.

Cluster Name

HiveService Type

CDH 5.13CDH Version

Name of the Altus environment towhich you have been given access for this tutorial.If you do not know which Altus environment to select, check with your Altusadministrator.

Environment

For theWorker node configuration, set the Number of Nodes to 3.

Leave the rest of the node properties with their default setting.

Node Configuration

Configure your access credentials to Cloudera Manager:

SSH Public Key

If you have your public key in a file, select File Upload and choose the key file. Ifyou have the key available for pasting on screen, select Direct Input to enter thefull key code.

Credentials

Cloudera Manager User

Set both the user name and password to guest.

The following figure shows the Create Cluster page with the settings for this tutorial:

Altus Data Engineering | 55

Tutorial: Clusters and Jobs on AWS

Page 56: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Verify that all required fields are set and click Create Cluster.

The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page,the new cluster displays at the top of the list of clusters.

Submitting a Hive Job Group

Submit multiple Hive jobs as a group to run on the cluster that you created in the previous step.

To submit a job group on the console:

1. In the Data Engineering section of the side navigation panel, click Jobs.2. Click Submit Jobs.3. On the Job Settings page, select Group of jobs.4. Select the Hive job type.5. Set the Job Group Name to Hive Medical Example.6. Click Add Hive Job.7. Create a job with the following configuration:

DescriptionProperty

Set the job name to Create External Tables.Job Name

Select Script Path and enter the following script name: s3a://cloudera-altus-data-engineering-samples/hive/program/

med-part1.hql

Script

Select Hive Script Parameters and add the following variables and values:Hive Script Parameters

• HOSPITALS_PATH:s3a://cloudera-altus-data-engineering-samples/hive/data/hospitals/

• READMISSIONS_PATH:s3a://cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/

56 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 57: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

• EFFECTIVECARE_PATH:s3a://cloudera-altus-data-engineering-samples/hive/data/effectiveCare/

• GDP_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/GDP/

Select Interrupt Job Queue.Action on Failure

The following figure shows the Add Job window with the settings for this job:

8. Click OK to add the job to the group.

On the Submit Jobs page, Altus adds the Hive Medical Example job to the list of jobs in the group.

9. Click Add Hive Job.10. Create a job with the following configuration:

DescriptionProperty

Set the job name to Clean Data.Job Name

Select Script Path and enter the following script name: s3a://cloudera-altus-data-engineering-samples/hive/program/

med-part2.hql

Script

Select Interrupt Job Queue.Action on Failure

The following figure shows the Add Job window with the settings for this job:

Altus Data Engineering | 57

Tutorial: Clusters and Jobs on AWS

Page 58: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

11. Click OK.

On the Submit Jobs page, Altus adds the Clean Data job to the list of jobs in the group.

12. Click Add Hive Job.13. Create a job with the following configuration:

DescriptionProperty

Set the job name to Write Output.Job Name

Select Script Path and enter the following script name: s3a://cloudera-altus-data-engineering-samples/hive/program/

med-part3.hql

Script

Select Hive Script Parameters and add the S3 bucket you created for thejob output as a variable:

Hive Script Parameters

• OUTPUT_DIR: s3a://Path/Of/The/Output/S3Bucket/

Select None.Action on Failure

The following figure shows the Add Job window with the settings for this job:

58 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 59: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

14. Click OK.

On the Submit Jobs page, Altus adds the Write Output job to the list of jobs in the group.

15. On the Cluster Settings section, select Use existing and select the Hive cluster you created for this exercise.

The list of clusters displayed include only those clusters that can run Hive jobs.

16. Click Submit Jobs to run the job group on your Hive cluster.

Creating a SOCKS Proxy for the Hive Cluster

Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the progress of the job.

To create a SOCKS proxy to access Cloudera Manager:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, find the cluster on which you submitted the Hive job group and click the cluster name.3. On the cluster detail page, click View SOCKS Proxy CLI Command.

Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instancefor the Hive cluster that you created.

Altus Data Engineering | 59

Tutorial: Clusters and Jobs on AWS

Page 60: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Click Copy.5. On a terminal window, paste the command.6. Modify the command to use the name of the cluster you created and your private key and then run the following

command:

altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"

The Cloudera Manager Admin console opens in a Chrome browser.

Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. Ifyou donot use Chrome, remove theopen-cloudera-managerparameter so that the commanddisplays instructions for accessing the Cloudera Manager URL from any browser.

Viewing the Hive Cluster and Verifying the Hive Job Output

Log in to Cloudera Manager with the guest user account that you set up when you created the Hive cluster.

To view the cluster and monitor the job on the Cloudera Manager Admin console:

1. Log in to Cloudera Manager using guest as the account name and password.2. On the Home page, click Clusters on the top navigation bar.3. On the cluster window, select YARN Applications.

The following screenshots show the cluster services and workload information that you can view on the ClouderaManager Admin console:

60 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 61: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Click Clusters on the top navigation bar and select the default Hive service named HIVE-1. Then click HiveServer2Web UI.

The following screenshots show the workload information that you can view for the Hive service:

Altus Data Engineering | 61

Tutorial: Clusters and Jobs on AWS

Page 62: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

62 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 63: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

5. When the jobs complete, go to the S3 bucket you specified for your job output and verify the file created by theHive jobs.

The Hive jobs create the following file in your output S3 bucket: 000000_0 (135.9 KB)

Creating a Hive Job Group using the CLI

You can submit the same group of Hive jobs to run on the same cluster using the CLI. If you want to view the clusterand monitor the job on Cloudera Manager, stay logged in to Cloudera Manager.

To submit a group of Hive jobs using the CLI, run the submit-jobs command and provide the list of jobs in the jobsparameter. Run it on the same cluster and use the same job group name.

Run the following command:

altus dataeng submit-jobs \--cluster-name FirstInitialLastName-tutorialcluster \--job-submission-group-name "Hive Medical Example" \--jobs '[ { "name": "Create External Tables", "failureAction": "INTERRUPT_JOB_QUEUE", "hiveJob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part1.hql", "params": ["HOSPITALS_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/hospitals/",

"READMISSIONS_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/",

"EFFECTIVECARE_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/effectiveCare/",

"GDP_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/GDP/"] }}, { "name": "Clean Data", "failureAction": "INTERRUPT_JOB_QUEUE", "hiveJob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part2.hql" }}, { "name": "Output Data", "failureAction": "NONE", "hiveJob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part3.hql", "params": ["outputdir=s3a://Path/Of/The/Output/S3Bucket/"] }} ]'

You can go to the Cloudera Manager console to view the status of the Hive cluster and jobs:

• To view the workload summary, click Clusters > SPARK_ON_YARN-1.• To view the job information, click Clusters > HIVE-1 > HiveServer2 Web UI.

Cloudera Manager displays the same workload summary and job queries for this job as for the job that you submittedthrough the console.

When the jobs complete, go to the S3 bucket you specified for your job output and verify the file created by the Hivejobs. The Hive job group creates the following file in your output S3 bucket: 000000_0 (135.9 KB)

Terminating the Hive Cluster

This task shows you how to terminate the cluster that you created for this tutorial.

To terminate the cluster on the Altus console:

1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.2. On the Clusters page, click the name of the cluster that you created for this tutorial.

Altus Data Engineering | 63

Tutorial: Clusters and Jobs on AWS

Page 64: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

3. On the Cluster details page, review the cluster information to verify that it is the cluster that youwant to terminate.4. Click Actions and select Delete Cluster.5. Click OK to confirm that you want to terminate the cluster.

64 | Altus Data Engineering

Tutorial: Clusters and Jobs on AWS

Page 65: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Tutorial: Clusters and Jobs on Azure

This tutorial walks you through using the Altus console and CLI to create Altus Data Engineering clusters and submitjobs in Altus. The tutorial uses publicly available data that show the usage of Medicare procedure codes.

You must set up an ADLS account to store the tutorial job examples and input data and to write output data. Clouderahas created a jar file that contains the job examples and input files that you need to successfully complete the tutorial.Before you start the exercises, upload the files to the ADLS account that you set up for the tutorial files.

The tutorial has the following sections:

Prerequisites

To use this tutorial, you must have an Altus user account and the roles required to create clusters and run jobs inAltus.

Sample Jar File Upload

Upload the files you need to complete the tutorial.

Altus Console Login on page 66

Log in to the Altus console to perform the exercises in this tutorial.

Exercise 1: Installing the Altus Client on page 66

Learn how to install the Altus client and register an access key to use the CLI.

Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs on page 68

Learn how to create a cluster with a Spark service and submit a Spark job using the Altus console and the CLI. Thisexercise provides instructions on how to create a SOCKS proxy and view the cluster andmonitor the job in ClouderaManager. It also shows you how to delete the cluster on the console.

Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs on page 73

Learn how to create a cluster with a Hive service and submit a group of Hive jobs using the Altus console and theCLI. This exercise also walks you through the process of creating a SOCKS proxy and accessing Cloudera Manager.It also shows you how to delete the cluster on the console.

PrerequisitesBefore you start the tutorial, ensure that you have access to resources in your Azure subscription and an Altus useraccount with permission to create clusters and run jobs in Altus.

The following are prerequisites for the tutorial:

• Altus user account, environment, and roles. An Altus user account allows you to log in to the Altus console andperform the exercises in the tutorial. An Altus administratormust assign an Altus environment to your user accountso that you have access to resources in your Azure subscription. The Altus administrator must also assign roles toyour user account to allow you to create clusters and run jobs in Altus.

• Public key.When you create a cluster, provide an SSH public key that Altus can add to the cluster. You can thenuse the corresponding private key to access the cluster after the cluster is created.

• Azure Data Lake Store (ADLS) account. Set up an ADLS account to store sample jobs and input data files for usein the tutorial. You also write job output to the same account. The ADLS account must be set up with permissionsto allow read and write access when you run the Altus jobs.

For more information about creating an ADLS account in Azure, see Get started with Azure Data Lake Store usingthe Azure portal.

Altus Data Engineering | 65

Tutorial: Clusters and Jobs on Azure

Page 66: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Sample Files UploadCloudera provides jar files that contain the Altus job example files and the input files used in the tutorial. Before youstart the tutorial, upload the jar file to your ADLS account so the job examples and data are available for your use. Usethe Azure Cloud Shell to upload the file.

To upload the jar file to your ADLS account, complete the following steps:

1. Follow the instructions in the Azure documentation to set up an Azure Cloud Shell with a bash environment.2. Run the following command to download the altus_adls_upload_examples.sh script:

wget https://raw.githubusercontent.com/cloudera/altus-azure-tools/master/upload-examples/altus_adls_upload_examples.sh

You use the script to upload the files that you need for the tutorials to your ADLS account.

3. In the Azure Cloud Shell, follow the instructions in the Azure documentation to log in to Azure using the AzureCLI.

The Azure CLI is installed with Azure Cloud Shell so you do not need to install it separately.

4. Run the script to upload the tutorial files to your ADLS account:

bash ./altus_adls_upload_examples.sh --adls-account YourADLSaccountname --adls-path cloudera-altus-data-engineering-samples

5. Verify that the tutorial examples and input data files are uploaded to your ADLS account in the Altus DataEngineering examples folder.

Altus Console LoginTo access the Altus console, go to the following URL: https://console.altus.cloudera.com/.

Log in to Altus with your Cloudera user account. After you are authenticated, the Altus console displays your homepage.

TheData Engineering section displays on the side navigation panel. If you have been assigned roles and an environmentin Altus, you can click on Clusters and Jobs to create clusters and submit jobs as you follow the tutorial exercises.

Exercise 1: Installing the Altus ClientTo use the Altus CLI, you must install the Altus client and configure the client with an access key.

Altus manages access to the Altus services so that only users with a registered access key can run commands to createclusters, submit jobs, or use SDX namespaces. Generate and register an access key with the Altus client to create acredentials file so that you do not need submit your access key with each command.

This exercise provides instructions to download and install the Altus client on Linux, generate a key, and run the CLIcommand to register the key.

To set up the Cloudera Altus client, complete the following tasks:

1. Install the Altus client.2. Configure the Altus client with an access key.

Step 1. Install the Altus Client

To avoid conflicts with older versions of Python or other packages, Cloudera recommends that you install the ClouderaAltus client in a virtual environment. Use the virtualenv tool to create a virtual environment and install the client.

66 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 67: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

The following commands show how you can use pip to install the client on a virtual environment on Linux:

mkdir ~/altusclienvvirtualenv ~/altusclienv --no-site-packagessource ~/altusclienv/bin/activate ~/altusclienv/bin/pip install altuscli

To upgrade the client to the latest version, run the following command:

~/altusclienv/bin/pip install --upgrade altuscli

After the client installation process is complete, run the following command to confirm that the Altus client is working:

If virtualenv is activated: altus --version

If virtualenv is not activated: ~/altusclienv/bin/altus --version

Step 2. Configure the Altus Client with the API Access Key

You use the Altus console to generate the access that you register with the client. Keep the window that displays theaccess key on the console open until you complete the key registration process.

To create and set up the client with a Cloudera Altus API access key:

1. Sign in to the Cloudera Altus console:

https://console.altus.cloudera.com/

2. Click your user account name and selectMy Account.3. On theMy Account page, click Generate Access Key.

Altus creates the key and displays the information on the screen. The following image shows an example of anAltus API access key as displayed on the Altus console:

Note: The Cloudera Altus console displays the API access key immediately after you create it.You must copy the access key information when it is displayed. Do not exit the console withoutcopying the keys. After you exit the console, there is no other way to view or copy the access key.

Altus Data Engineering | 67

Tutorial: Clusters and Jobs on Azure

Page 68: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. On the command line, run the following command to configure the client with the access key:

altus configure

5. Enter the following information at the prompt:

• Altus Access key. Copy and paste the access key ID that you generated in the Cloudera Altus console.• Altus Private key. Copy and paste the private key that you generated in the Cloudera Altus console.

The private key is a very long string of characters. Make sure that you enter the full string.

The configuration utility creates the following file to store your user credentials: ~/.altus/credentials

6. To verify that the credentials were created correctly, run the following command:

altus iam get-user

The command displays your Altus client credentials.

7. After the credentials file is created, you can go back to the Cloudera Altus console and click OK to exit the accesskey window.

Exercise 2: Creating a Spark Cluster and Submitting Spark JobsThis exercise shows you how to create a cluster with a Spark service on the Altus console and submit a Spark job onthe console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view theprogress of the job on Cloudera Manager.

In this exercise, you complete the following tasks:

1. Create a cluster with a Spark service on the console.2. Submit a Spark job on the console.3. Create a SOCKS proxy to access the Spark cluster on Cloudera Manager.4. View the Spark cluster and verify the Spark job output.5. Submit a Spark job using the CLI.6. Terminate the Spark cluster

Creating a Spark Cluster on the Console

You must be logged in to the Altus console to perform this task.

Note that it can take a while for Altus to complete the process of creating a cluster.

To create a cluster on the console:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, click Create Cluster.3. Create a cluster with the following configuration:

DescriptionProperty

To help you easily identify your cluster, use your first initial and last name as prefixfor the cluster name. This tutorial uses the cluster namemjones-spark-tutorial as an example.

Cluster Name

Spark 2.xService Type

CDH 5.14CDH Version

Name of the Altus environment to which you have been given access for thistutorial. If you do not know which Altus environment to select, check with yourAltus administrator.

Environment

68 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 69: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

For theWorker node configuration, set the Number of Nodes to 3.

Leave the rest of the node properties with their default setting.

Node Configuration

Configure your access credentials to Cloudera Manager:

SSH Public Key

If you have your public key in a file, select File Upload and choose the key file.If you have the key available for pasting on screen, select Direct Input to enterthe full key code.

Credentials

Cloudera Manager User

Set both the user name and password to guest.

4. Verify that all required fields are set and click Create Cluster.

The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page,the new cluster displays at the top of the list of clusters.

Submitting a Spark Job

Submit a Spark job to run on the cluster you created in the previous task.

To submit a Spark job on the console:

1. In the Data Engineering section of the side navigation panel, click Jobs.2. Click Submit Jobs.3. On the Job Settings page, select Single job.4. Select the Spark job type.5. Create a Spark job with the following configuration:

DescriptionProperty

Set the job name to Spark Medical Example.Job Name

Set the main class to com.cloudera.altus.sample.medicare.transformMain Class

Use the tutorial jar file:adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar

Jars

Set the application arguments to the ADLS path to use for job input and output.

Add the tutorial ADLS path for the job input:adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/

Application Arguments

Click + and add the ADLS path for the job output: adl://YourADLSaccountname.azuredatalakestore.net/

cloudera-altus-data-engineering-samples/spark/medicare/output

Use an existing cluster and select the cluster that you created in the previous task.Cluster Settings

The following figure shows the Submit Jobs page with the settings for this tutorial:

Altus Data Engineering | 69

Tutorial: Clusters and Jobs on Azure

Page 70: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

6. Verify that all required fields are set and click Submit Jobs.

The Altus Data Engineering service submits the job to run on the selected cluster in your AWS account.

Creating a SOCKS Proxy for the Spark Cluster

Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the cluster and progress of the job.

To create a SOCKS proxy to access Cloudera Manager:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, find the cluster on which you submitted the job and click the cluster name.3. On the cluster detail page, click View SOCKS Proxy CLI Command.

Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instancefor the Spark cluster that you created.

70 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 71: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Click Copy.5. On a terminal window, paste the command.6. Modify the command to use the name of the cluster you created and your private key and run the command:

altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"

The Cloudera Manager Admin console opens in a Chrome browser.

Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. Ifyou donot use Chrome, remove theopen-cloudera-managerparameter so that the commanddisplays instructions for accessing the Cloudera Manager URL from any browser.

Viewing the Cluster and Verifying the Spark Job Output

Log in to Cloudera Manager with the guest user account that you set up when you created the cluster.

To view the cluster and monitor the job on the Cloudera Manager Admin console:

1. Log in to Cloudera Manager using guest as the account name and password.2. On the Home page, click Clusters on the top navigation bar.3. On the cluster window, select YARN Applications.

The following screenshots show the cluster services and workload information that you can view on the ClouderaManager Admin console:

Altus Data Engineering | 71

Tutorial: Clusters and Jobs on Azure

Page 72: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

When your Spark job completes, you can view the output of the Spark job in the ADLS account that you specified foryour job output. The Spark job creates the following files in your ADLS output folder:

72 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 73: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

• Success (0 bytes)• part-00000 (65.5 KB)• part-00001 (69.5 KB)

Note: If you want to use the same ADLS output folder for the next exercise, go to the Azure portaland delete files in the ADLS output folder. You will recreate the files when you submit the same Sparkjob using the CLI.

Creating a Spark Job using the CLI

You can submit the same Spark job to run on the same cluster using the CLI. If you want to view the cluster andmonitorthe job on Cloudera Manager, stay logged in to Cloudera Manager.

To submit a Spark job using the CLI, run the following command:

altus dataeng submit-jobs \--cluster-name FirstInitialLastName-tutorialcluster \--jobs '{ "sparkJob": { "jars": [

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar"

], "mainClass": "com.cloudera.altus.sample.medicare.transform", "applicationArguments": [

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/",

"adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/output"

] }}'

To view theworkload summary, go to the ClouderaManager console and click Clusters > SPARK_ON_YARN-1. ClouderaManager displays the same workload summary for this job as for the job that you submitted through the console.

To verify the output, go to ADLS account that you specified for your job output and verify that it contains the filescreated by the Spark job:

• Success (0 bytes)• part-00000 (65.5 KB)• part-00001 (69.5 KB)

Terminating the Cluster

This task shows you how to terminate the cluster that you created for this tutorial.

To terminate the cluster on the Altus console:

1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.2. On the Clusters page, click the name of the cluster that you created for this tutorial.3. On the Cluster details page, review the cluster information to verify that it is the cluster that youwant to terminate.4. Click Actions and select Delete Cluster.5. Click OK to confirm that you want to terminate the cluster.

Exercise 3: Creating a Hive Cluster and Submitting Hive JobsThis exercise shows you how to create a cluster with a Hive service on the Altus console and submit Hive jobs on theconsole and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view theprogress of the jobs on Cloudera Manager.

Altus Data Engineering | 73

Tutorial: Clusters and Jobs on Azure

Page 74: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

In this exercise, you complete the following tasks:

1. Create a cluster with a Hive service on the console.2. Submit a group of Hive jobs on the console.3. Create a SOCKS proxy to access the Hive cluster on Cloudera Manager4. View the Hive cluster and verify the Hive job output.5. Submit a group of Hive jobs using the CLI.6. Terminate the Hive cluster

Creating a Hive Cluster on the Console

You must be logged in to the Altus console to perform this task.

Note that it can take a while for Altus to complete the process of creating a cluster.

To create a cluster with a Hive service on the console:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, click Create Cluster.3. Create a cluster with the following configuration:

DescriptionProperty

To help you easily identify your cluster, use your first initial and last name as prefixfor the cluster name. This tutorial uses the cluster name mjones-hive-tutorialas an example.

Cluster Name

HiveService Type

CDH 5.14CDH Version

Name of the Altus environment towhich you have been given access for this tutorial.If you do not know which Altus environment to select, check with your Altusadministrator.

Environment

For theWorker node configuration, set the Number of Nodes to 3.

Leave the rest of the node properties with their default setting.

Node Configuration

Configure your access credentials to Cloudera Manager:

SSH Public Key

If you have your public key in a file, select File Upload and choose the key file. Ifyou have the key available for pasting on screen, select Direct Input to enter thefull key code.

Credentials

Cloudera Manager User

Set both the user name and password to guest.

4. Verify that all required fields are set and click Create Cluster.

The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page,the new cluster displays at the top of the list of clusters.

Submitting a Hive Job Group

Submit multiple Hive jobs as a group to run on the cluster that you created in the previous step.

To submit a job group on the console:

1. In the Data Engineering section of the side navigation panel, click Jobs.2. Click Submit Jobs.

74 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 75: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

3. On the Job Settings page, select Group of jobs.4. Select the Hive job type.5. Set the Job Group Name to Hive Medical Example.6. Click Add Hive Job.7. Create a job with the following configuration:

DescriptionProperty

Set the job name to Create External Tables.Job Name

Select Script Path and enter the following script name: adl://YourADLSaccountname.azuredatalakestore.net/

Script

cloudera-altus-data-engineering-samples/hive/program/

med-part1.hql

Select Hive Script Parameters and add the following variables and values:Hive Script Parameters

• HOSPITALS_PATH:adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/hospitals/

• READMISSIONS_PATH:adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/

• EFFECTIVECARE_PATH:adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/effectiveCare/

• GDP_PATH:adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/GDP/

Select Interrupt Job Queue.Action on Failure

The following figure shows the Add Job window with the settings for this job:

8. Click OK to add the job to the group.

Altus Data Engineering | 75

Tutorial: Clusters and Jobs on Azure

Page 76: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

On the Submit Jobs page, Altus adds the Hive Medical Example job to the list of jobs in the group.

9. Click Add Hive Job.10. Create a job with the following configuration:

DescriptionProperty

Set the job name to Clean Data.Job Name

Select Script Path and enter the following script name: adl://YourADLSaccountname.azuredatalakestore.net/

Script

cloudera-altus-data-engineering-samples/hive/program/

med-part2.hql

Select Interrupt Job Queue.Action on Failure

The following figure shows the Add Job window with the settings for this job:

11. Click OK.

On the Submit Jobs page, Altus adds the Clean Data job to the list of jobs in the group.

12. Click Add Hive Job.13. Create a job with the following configuration:

DescriptionProperty

Set the job name to Write Output.Job Name

Select Script Path and enter the following script name: adl://YourADLSaccountname.azuredatalakestore.net/

Script

cloudera-altus-data-engineering-samples/hive/program/

med-part3.hql

Select Hive Script Parameters and add the ADLS folder that you createdfor the job output as a variable:

Hive Script Parameters

• OUTPUT_DIR: adl://YourADLSaccountname.azuredatalakestore.net/

cloudera-altus-data-engineering-samples/hive/data/

output/

76 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 77: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

DescriptionProperty

Select None.Action on Failure

The following figure shows the Add Job window with the settings for this job:

14. Click OK.

On the Submit Jobs page, Altus adds the Write Output job to the list of jobs in the group.

15. On the Cluster Settings section, select Use existing and select the Hive cluster you created for this exercise.

The list of clusters displayed include only those clusters that can run Hive jobs.

16. Click Submit Jobs to run the job group on your Hive cluster.

Creating a SOCKS Proxy for the Hive Cluster

Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the progress of the job.

To create a SOCKS proxy to access Cloudera Manager:

1. In the Data Engineering section of the side navigation panel, click Clusters.2. On the Clusters page, find the cluster on which you submitted the Hive job group and click the cluster name.3. On the cluster detail page, click View SOCKS Proxy CLI Command.

Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instancefor the Hive cluster that you created.

Altus Data Engineering | 77

Tutorial: Clusters and Jobs on Azure

Page 78: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Click Copy.5. On a terminal window, paste the command.6. Modify the command to use the name of the cluster you created and your private key and then run the following

command:

altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="YourPrivateKey" --open-cloudera-manager="yes"

The Cloudera Manager Admin console opens in a Chrome browser.

Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. Ifyou donot use Chrome, remove theopen-cloudera-managerparameter so that the commanddisplays instructions for accessing the Cloudera Manager URL from any browser.

Viewing the Hive Cluster and Verifying the Hive Job Output

Log in to Cloudera Manager with the guest user account that you set up when you created the Hive cluster.

To view the cluster and monitor the job on the Cloudera Manager Admin console:

1. Log in to Cloudera Manager using guest as the account name and password.2. On the Home page, click Clusters on the top navigation bar.3. On the cluster window, select YARN Applications.

The following screenshots show the cluster services and workload information that you can view on the ClouderaManager Admin console:

78 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 79: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

4. Click Clusters on the top navigation bar and select the default Hive service named HIVE-1. Then click HiveServer2Web UI.

The following screenshots show the workload information that you can view for the Hive service:

Altus Data Engineering | 79

Tutorial: Clusters and Jobs on Azure

Page 80: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

80 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 81: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

5. When the jobs complete, go to the ADLS account that you specified for your job output and verify the file createdby the Hive jobs.

The Hive jobs create the following file in your ADLS output folder: 000000_0 (135.9 KB)

Creating a Hive Job Group using the CLI

You can submit the same group of Hive jobs to run on the same cluster using the CLI. If you want to view the clusterand monitor the job on Cloudera Manager, stay logged in to Cloudera Manager.

To submit a group of Hive jobs using the CLI, run the submit-jobs command and provide the list of jobs in the jobsparameter. Run it on the same cluster and use the same job group name.

Run the following command:

altus dataeng submit-jobs \--cluster-name FirstInitialLastName-tutorialcluster \--job-submission-group-name "Hive Medical Example" \--jobs '[ { "name": "Create External Tables", "failureAction": "INTERRUPT_JOB_QUEUE", "hiveJob": { "script": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part1.hql",

"params": ["HOSPITALS_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/hospitals/",

"READMISSIONS_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/",

"EFFECTIVECARE_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/effectiveCare/",

"GDP_PATH=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/GDP/"]

}}, { "name": "Clean Data", "failureAction": "INTERRUPT_JOB_QUEUE", "hiveJob": { "script": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part2.hql"

}}, { "name": "Output Data", "failureAction": "NONE", "hiveJob": { "script": "adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/program/med-part3.hql",

"params": ["outputdir=adl://YourADLSaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/output/"]

}} ]'

You can go to the Cloudera Manager console to view the status of the Hive cluster and jobs:

• To view the workload summary, click Clusters > SPARK_ON_YARN-1.• To view the job information, click Clusters > HIVE-1 > HiveServer2 Web UI.

Cloudera Manager displays the same workload summary and job queries for this job as for the job that you submittedthrough the console.

When the jobs complete, go to the ADLS account that you specified for your job output and verify the file created bythe Hive jobs. The Hive job group creates the following file in your ADLS output folder: 000000_0 (135.9 KB)

Altus Data Engineering | 81

Tutorial: Clusters and Jobs on Azure

Page 82: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Terminating the Hive Cluster

This task shows you how to terminate the cluster that you created for this tutorial.

To terminate the cluster on the Altus console:

1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters.2. On the Clusters page, click the name of the cluster that you created for this tutorial.3. On the Cluster details page, review the cluster information to verify that it is the cluster that youwant to terminate.4. Click Actions and select Delete Cluster.5. Click OK to confirm that you want to terminate the cluster.

82 | Altus Data Engineering

Tutorial: Clusters and Jobs on Azure

Page 83: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

Appendix: Apache License, Version 2.0

SPDX short identifier: Apache-2.0

Apache LicenseVersion 2.0, January 2004http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or areunder common control with that entity. For the purposes of this definition, "control" means (i) the power, direct orindirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership offifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications, including but not limited to software sourcecode, documentation source, and configuration files.

"Object" form shall mean any form resulting frommechanical transformation or translation of a Source form, includingbut not limited to compiled object code, generated documentation, and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, asindicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendixbelow).

"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) theWork and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole,an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remainseparable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications oradditions to thatWork or DerivativeWorks thereof, that is intentionally submitted to Licensor for inclusion in theWorkby the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. Forthe purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent tothe Licensor or its representatives, including but not limited to communication on electronic mailing lists, source codecontrol systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose ofdiscussing and improving theWork, but excluding communication that is conspicuouslymarked or otherwise designatedin writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whoma Contribution has been receivedby Licensor and subsequently incorporated within the Work.

2. Grant of Copyright License.

Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide,non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare DerivativeWorks of, publiclydisplay, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.

3. Grant of Patent License.

Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide,non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license tomake, havemade,use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims

Cloudera | 83

Appendix: Apache License, Version 2.0

Page 84: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of theirContribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation againstany entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporatedwithin theWork constitutes direct or contributory patent infringement, then any patent licenses granted to You underthis License for that Work shall terminate as of the date such litigation is filed.

4. Redistribution.

You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or withoutmodifications, and in Source or Object form, provided that You meet the following conditions:

1. You must give any other recipients of the Work or Derivative Works a copy of this License; and2. You must cause any modified files to carry prominent notices stating that You changed the files; and3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark,

and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any partof the Derivative Works; and

4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distributemust include a readable copy of the attribution notices contained within such NOTICE file, excluding those noticesthat do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICEtext file distributed as part of the Derivative Works; within the Source form or documentation, if provided alongwith the DerivativeWorks; or, within a display generated by the DerivativeWorks, if andwherever such third-partynotices normally appear. The contents of the NOTICE file are for informational purposes only and do not modifythe License. You may add Your own attribution notices within Derivative Works that You distribute, alongside oras an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot beconstrued as modifying the License.

You may add Your own copyright statement to Your modifications and may provide additional or different licenseterms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works asa whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions statedin this License.

5. Submission of Contributions.

Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to theLicensor shall be under the terms and conditions of this License, without any additional terms or conditions.Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreementyou may have executed with Licensor regarding such Contributions.

6. Trademarks.

This License does not grant permission to use the trade names, trademarks, service marks, or product names of theLicensor, except as required for reasonable and customary use in describing the origin of the Work and reproducingthe content of the NOTICE file.

7. Disclaimer of Warranty.

Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor providesits Contributions) on an "AS IS" BASIS,WITHOUTWARRANTIES OR CONDITIONSOF ANY KIND, either express or implied,including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, orFITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using orredistributing the Work and assume any risks associated with Your exercise of permissions under this License.

8. Limitation of Liability.

In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless requiredby applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liableto You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arisingas a result of this License or out of the use or inability to use the Work (including but not limited to damages for lossof goodwill, work stoppage, computer failure ormalfunction, or any and all other commercial damages or losses), evenif such Contributor has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability.

84 | Cloudera

Appendix: Apache License, Version 2.0

Page 85: Altus Data Engineering - Cloudera · ImportantNotice ©2010-2019Cloudera,Inc.Allrightsreserved. Cloudera,theClouderalogo,andanyotherproductor ...

While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptanceof support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, inaccepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of anyother Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liabilityincurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additionalliability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work

To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets"[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in theappropriate comment syntax for the file format. We also recommend that a file or class name and description ofpurpose be included on the same "printed page" as the copyright notice for easier identification within third-partyarchives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions andlimitations under the License.

Cloudera | 85

Appendix: Apache License, Version 2.0