Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It...

73
Predix Insights © 2020 General Electric Company

Transcript of Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It...

Page 1: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Predix Insights

© 2020 General Electric Company

Page 2: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Contents

Overview 1

About Predix Insights 1

Architecture 2

Predix Insights User Roles 3

Concepts 3

Get Started 5

Predix Insights Setup 5

Configuring Predix Vault as the Default Credential Store 7

Creating an OAuth2 Client for the UI 8

Creating a Predix Insights Service Instance 9

Creating a Predix Insights Service Instance Using the CLI 10

Binding an Application to a Predix Insights Instance 12

Creating Service Keys 14

Updating OAuth2 Client for Predix Insights 15

Updating OAuth2 Client for Predix Insights Using UAAC 17

Creating Users and Groups for Predix Insights Using UAAC 18

Logging into the Predix Insights UI 19

Updating a Predix Insights Service Instance 19

Reprovisioning a Predix Insights Instance 21

Deleting a Predix Insights Service Instance 22

Supported Python Libraries 23

Develop 25

Quick Start 25

Configure Flow Template 30

Configure Flows 36

Manage Dependencies 43

Orchestrate 46

Configure Orchestration 46

Monitor 58

Monitoring Flows 58

Monitoring Orchestration 61

Release Notes 67

ii Predix Insights

Page 3: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Q1 2019 67

Predix Insights (Beta) Version 1.0 Release Notes 67

Troubleshooting 69

General Issues 69

iii

Page 4: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Copyright GE Digital© 2020 General Electric Company.

GE, the GE Monogram, and Predix are either registered trademarks or trademarks of General Electric Company. All other trademarks are the property of their respective owners.

This document may contain Confidential/Proprietary information of General Electric Company and/or its suppliers or vendors. Distribution or reproduction is prohibited without permission.

THIS DOCUMENT AND ITS CONTENTS ARE PROVIDED "AS IS," WITH NO REPRESENTATION OR WARRANTIES OF ANY KIND, WHETHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF DESIGN, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. ALL OTHER LIABILITY ARISING FROM RELIANCE UPON ANY INFORMATION CONTAINED HEREIN IS EXPRESSLY DISCLAIMED.

Access to and use of the software described in this document is conditioned on acceptance of the End User License Agreement and compliance with its terms.

iv © 2020 General Electric Company

Page 5: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Overview

About Predix InsightsPredix Insights is a big data processing and analytics service. It provides native Apache Spark support andorchestration features using Apache Airflow. Use it to build pipelines and run orchestrations to processanalytic data in your runtime environment.

Predix Insights provides a managed infrastructure so you can concentrate on building your application.Use it to build pipelines to collect, store, process, and analyze large volumes of data without having tomanage a distributed computing infrastructure yourself (such as Hadoop) on your own.

Embedded is an Apache Spark-based framework for writing spark applications in a declarative way andconfiguring the output source with minimal coding. The service also supports creating a multi-steporchestration where you can run more than one analytic as a single workflow, while resolvinginterdependencies.

© 2020 General Electric Company 1

Page 6: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

ArchitectureThe following diagram shows the functional architecture of Predix Insights in a Spark runtimeenvironment.

2 © 2020 General Electric Company

Page 7: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Predix Insights User RolesThe following user roles are supported. Access to Predix Insights functionality varies according to the usertype you are provisioned for.

Role Description Access to

Admin The administrator has access to all PredixInsights functionality. The admin cancreate, edit, start, stop, restart, or kill aflow.

Develop →

• Flow Templates• Flows• Orchestration• Dependencies

Monitor →

• Flows• Orchestration

Operator The operator has access to Predix Insightsmonitoring functionality. The operatorcan monitor a flow or orchestration.

Monitor →

• Flows• Orchestration

ConceptsThe following is a list of common terms used in this document and their definition.

DependenciesCommon dependencies, such as libraries, are uploaded once and are stored in Predix Insights forreuse. After being uploaded, a dependent file is available from the central location on demandwhenever called by a specific job.

Directed Acyclic Graph (DAG)A DAG file contains the collection of tasks intended to be run as a single unit and defines their order ofexecution at runtime. They describe how you want the flow to be run by defining the correct order oftasks. A DAG is defined in standard Python files and can describe more than one task.

Flow TemplateA flow template contains the analytic code, including the required configuration and library files, in ZIPformat. Once configured, a single analytic (flow template) can be run against multiple assets. Thisallows you to upload the analytic code once and store for repeated use. The runtime configuration isseparately defined in a flow. Separating the analytic code and the runtime configuration enables youto create multiple customized flows to run against a single analytic. For example, you might create ananalytic for a gas turbine and then create separate flows for each turbine customization. If theanalytic code changes, you only need to change only the flow template.

FlowA flow file contains the configuration details that defines how data is to be processed at runtime.More than one flow file can be associated with the same analytic (flow template). An individual flowcan be configured with runtime parameters such as a dataset location, ID, sensor ID, and whateverelse is needed, so that the flow is specific to the given asset. A flow can be launched multiple times, asneeded, or hourly, daily, etc.

InstanceAn instance is an individual execution of a flow.

© 2020 General Electric Company 3

Page 8: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

OperatorAn operator describes a single task in a workflow. The execution order of an operator (tasks) iscontrolled by the DAG file. Supported Predix Insights operators are thePredixInsightsOperator maco, the PythonOperator function, and theBranchPythonOperator function.

OrchestrationAn orchestration is a group of analytic flows to be run together as a single unit, the task order forexecution is defined in the corresponding DAG files. You have the ability to configure, execute,validate, and monitor analytic execution.

SchedulerThe scheduler provides the ability to schedule the execution of analytics or orchestrations of analyticson time-based intervals, also called a job. You can create a job, retrieve both job definitions andhistory, update definitions, and delete jobs.

TaskA DAG file contains the collection of tasks intended to be run as a single unit. After an operator isinstantiated, it is called a task. Each task within a DAG represents a node in the graph, which can beeither a macro or a function and share a common set of parameters.

A task instance is a specific run of a task. A task instance has a state, such as RUNNING, SUCCESS,FAILED, and so on.

4 © 2020 General Electric Company

Page 9: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Get Started

Predix Insights SetupA task roadmap provides the recommended order of steps to complete a process and serves as a planningguide. Use the following roadmaps as a guide for creating and configuring your Predix Insights serviceinstance.

• Prerequisites on page 5• Setting Up Predix Insights Service on page 6• Managing Predix Insights Service Instance on page 7

Prerequisites

Like other Predix platform services, authentication access to Predix Insights is controlled by thedesignated trusted issuer and managed by the User Account and Authentication (UAA) web service. AUAA service instance must be set up as the trusted issuer before getting started with the Predix Insightsservice.

Before you begin setting up Predix Insights, ensure the following requirements are met.

Software Version Description

Cloud Foundry CLI Latest stable binary version Use the Cloud Foundry CLI to deploy and

manage applications and services.

Download the latest stable binary from

https://github.com/cloudfoundry/

cli#downloads.

Git Latest Download Git from https://git-scm.com/

downloads.

Maven If you are developing with Java, you can

use Maven to manage and organize

dependencies for your project. You can

download Maven from https://

maven.apache.org/download.cgi.

(Optional) Configure your proxy settings Depending on your location and network

configuration, you may need to configure

your proxy settings to access remote

resources. See Defining Proxy

Connections to Remote Resources.

© 2020 General Electric Company 5

Page 10: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Software Version Description

Create a UAA service instance See Creating a UAA Service Instance.

(Optional) Create a Vault service

instance, if using Vault as your default

credential store

• If using Vault service for security

credential retrieval by Predix Insights,

create your Vault service instance.

See Configuring Predix Vault as the

Default Credential Store on page 7.

• A best practice is to store credentials

needed by Predix Insights in a folder

(for example, "properties") in Vault.

• Using the Vault Dashboard, create

the folder structure for the Vault

path. For example,

<vault_path> = /insights-credential-store/<folder_name>.

• See Managing Paths and Secrets

Using Vault Dashboard in Vault

service documentation for more

information.

Setting Up Predix Insights Service

The following table provides the steps to set up the Predix Insights service.

No. Task Description

1. Create a dedicated OAuth2 client to manage authentication for the Predix

Insights UI

See Creating an OAuth2 Client for the UI

on page 8

2. Create the Predix Insights service instance See Creating a Predix Insights Service

Instance Using the CLI on page 10

3. Retrieve connection information about your Predix Insights service instance Select one of the following methods:

• Bind your application to your service

instance to view details in the

VCAP_SERVICES environment

variable. See Binding an Application

to a Predix Insights Instance on page

12

• Create a service key to generate

credentials to access the UI and API

Creating Service Keys on page 14

4. Update the OAuth2 client to use Predix Insights Select from the following methods:

• Using UAA Dashboard: Updating

OAuth2 Client for Predix Insights on

page 15

• Using UAAC: Updating OAuth2 Client

for Predix Insights Using UAAC on

page 17

6 © 2020 General Electric Company

Page 11: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

No. Task Description

5. Create users and groups to authorize access to Predix Insights UI See Creating Users and Groups for Predix

Insights Using UAAC on page 18

6. Log into the Predix Insights UI See Logging into the Predix Insights UI on

page 19

7. Download the Predix Insights Postman collection See About the Postman Collections for

Predix Insights on page 25

8. Review the Predix Insights code samples See Predix Insights Code Samples on

page 26

Managing Predix Insights Service Instance

Task Description

Update your Predix Insights service instance See Updating a Predix Insights Service

Instance on page 19

Reprovision your Predix Insights service instance See Reprovisioning a Predix Insights Instance

on page 21

Remove your Predix Insights service instance See Deleting a Predix Insights Service

Instance on page 22

Configuring Predix Vault as the Default Credential StorePredix platform provides Predix Credential Store and Encryption Vault service ("Vault service") for securelystoring and accessing credentials such as tokens, passwords, API keys.

Note: Alternative options for credential storage when using Predix Insights are AWS Systems ManagerParameter Store or passing them through a property file.

Your Vault service instance can be configured to act as the default credential store for retrieving securitycredentials needed by Predix Insights to complete requests. Datasource credentials can be stored in Vaultfor automatic retrieval instead of providing them in a runtime.properties file whenever uploading aflow template or a flow file. Examples of datasource credentials you can store include, in part:predix.timeseries.scope=, predix.timeseries.url=, predix.timeseries.zoneid,predix.ts.token.uri, predix.stream.EH-KPIS.url, predix.stream.alarm.sink,aws.accessKeyId, aws.secretKeyId, aws.s3url, apm.uaaClientId,apm.uaaClientSecret, apm.uaaPassword, apm.uaaUrl.

The Vault service instance must be created beforehand and the datasource information provided. SeeVault Service Setup for information about creating and configuring Vault service. A best practice is tostore credentials needed by Predix Insights in a folder (for example, "properties") in your Vault instance.Then setup Vault to be the default credential store either when creating a new Insights instance or byupdating your existing Insights instance. For more information about how to do this, see:

• Creating a Predix Insights Service Instance Using the CLI on page 10• Updating a Predix Insights Service Instance on page 19

© 2020 General Electric Company 7

Page 12: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Creating an OAuth2 Client for the UIBest practice is to create a dedicated OAuth2 client to manage authentication for accessing the UI, forease of maintenance. However, if you prefer, you can reuse the same the admin client to manage all ofPredix Insights service.

About This Task

The following procedure describes how to use UAA Dashboard to create a dedicated client for accessingthe Insights UI. For more information about using UAA Dashboard, see Creating an OAuth2 Client in theUAA service documentation.

If you are reusing the same service admin client, configure as follows.

Optionally, you can use UAAC command uaac client add -i with the configuration parametersdescribed below.

Before You Begin

You need the admin credentials for the UAA service instance acting as trusted issuer for Predix Insights.

Procedure

1. Log into UAA Dashboard using admin credentials, then navigate to the Client Management tab.

2. Select Add a new client option.

3. Configure the client as follows.

Parameter Description

Client ID Best practice is to name insights-ui-client butyou can specify any name. Save name provided for future use.

Client name Specify the client name. For ease of use, best practice is touse same name as clientId. Save name provided forfuture use.

Client secret Specify the secret. Save value for future use.

Verify new client secret Re-enter the same secret value to confirm.

Scope Leave blank. You will update this value later.

Authorized Grant Types Specify the following types:

• client_credentials• authorization_code• refresh_tokenFor more information on grant types, see RFC 6749.

Authorities Leave as default. You will update this value later.

Access token validity Leave as default.

Refresh token validity Leave as default.

Redirect URI Specify a placeholder value. You will update this value later.

Auto approve Leave as default. You will update this value later.

Signup redirect url Leave as default.

8 © 2020 General Electric Company

Page 13: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Creating a Predix Insights Service InstanceCreate a new service using Predix console as follows. Alternatively, you can create the service from thecommand line.

Before You Begin

Predix Credential Store and Encryption Vault service ("Vault service") must be set up first if using it as thedefault credential store for your Predix Insights service instance,

Procedure

1. Sign into your Predix account, then navigate to Catalog > Public Cloud Services > Predix Insights.2. Select the desired plan and click Subscribe.

The Predix Console launches and the New Service Instance page displays. Complete the fields inthis form to create a new Predix Insights service instance. Required fields are noted by an asterisk (*).

3. Select the desired Space from list. Once the creation process is completed, the new service instancewill reside in selected location in Predix cloud.

4. From the User Account & Authentication (UAA) field, select the UAA instance to act as theauthenticator for the service instance.

• Select existing UAA: Select the UAA instance name from the list. You can create or modifyOAuth2 clients using the Client Management tab in the UAA Dashboard.

• Subscribe to New UAA: Select to create a new UAA instance. Click the plus (+) symbol forinstructions.

5. Complete the configuration fields as follows. Save the values you provide for future use.

Field Description

Service Plan Select the plan option for your service instance.

Service Instance Name Specify the name of your Predix Insights service instance.

Client ID Specify name of the client ID for the dedicated clientmanaging authentication for the Insights UI.

Client Secret Specify the client secret for the dedicated client managingauthentication for the Insights UI.

Auto Scaling Min Controls the minimum number of slave nodes the instancecluster will scale to.

• Default value is 2.• The minimum value is 2 and the maximum value is 50.

Auto Scaling Max Controls the maximum number of slave nodes the instancecluster will scale to.

• Default value is 25.• The minimum value is 2 and the maximum value is 50.

Spark Version Specify the spark version installed on the service instancecluster.

• Supported spark versions: 2.1.0• The cluster version is selected based on the version

specified.

Python Libraries Specify the Python libraries to be installed on the sparkcluster. Enter multiple library names as a comma separatedlist. See Supported Python Libraries on page 23.

© 2020 General Electric Company 9

Page 14: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Field Description

Vault URL Required if configuring your Vault service instance to be thedefault credential store. Specify the URL for you vault serviceinstance followed by the path to your Insights credentialsstore. For example: ‘{“vault_url”:“<https://predix-vault-asv-sb.gecis.io/v1/secret/secret_key/<vault_path>”}’ where <vault_path>= /insights-credential-store/<folder_name>.

Vault Token Required if configuring your Vault service instance to be yourdefault credential store. Specify the vault service token valuefor your vault service instance

6. Click Create Service.

The service instance is created in the Org and Space specified.

Creating a Predix Insights Service Instance Using the CLICreate a new service instance using the Cloud Foundry command line as follows.

Before You Begin

• Prerequisites• Creating an OAuth2 Client for the UI on page 8

Procedure

1. Enter a command similar to the following. Omit or change the example parameter values shown asrequired for your instance. For example, if you are not configuring your Vault service instance to be thedefault credential store, omit the vault parameters.

cf create-service predix-insights <plan_name><my_service_instance_name> -c

'{"trustedIssuerIds":["https://my-uaa.predix.io/oauth/token"],"clientId":"<insights-ui-client-id>",

"clientSecret":"<insights-ui-client-secret>","auto_scaling_min":2,

"auto_scaling_max":15, "spark_version":"2.1.0","python_libraries":["numpy",

"scipy"], "vault_url": "<https://predix-vault-asv-sb.gecis.io/v1/secret/secret_key/<vault_path>",

"vault_token": "<vault_token>"}’

Parameter Description

plan_name The Plan Option selected for the service.

my_service_instance_name The name of your Predix Insights service instance.

10 © 2020 General Electric Company

Page 15: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Parameter Description

trustedIssuerIds • Specify the trustedIssuerIds associated withthis service instance.

Note: Only the first trustedIssuerId specifiedin the array can be used to log into the UI.

• The trustedIssuerId URL must end inoauth/token.

• You can use a comma-separated list to specify multipletrusted issuers.

• You can retrieve this URL from the VCAP_SERVICESenvironment variable after binding your service instanceto an application.

• Example configuration object:'{"trustedIssuerIds":["https://uaa-url.predix.io/oauth/token"]}'.

clientId Specify name of the client ID for the dedicated clientmanaging authentication for the Insights UI.

clientSecret Specify the client secret for the dedicated client managingauthentication for the Insights UI.

auto_scaling_min Controls the minimum number of slave nodes the instancecluster will scale to.

• Default value is 2.• The minimum value is 2 and the maximum value is 50.• Example configuration

object:'{"auto_scaling_min":3}'.

auto_scaling_max Controls the maximum number of slave nodes the instancecluster will scale to.

• Default value is 25.• The minimum value is 2 and the maximum value is 50.• Example configuration object:

'{"auto_scaling_max":30}'.

spark_version Specifies the spark version installed on the service instancecluster.

• Supported spark versions: 2.1.0• A version of the cluster is chosen based on the

spark_version provided.

• Example configuration object:'{"spark_version":"2.1.0"}'.

python_libraries Specifies Python libraries to be installed on the spark cluster.

• Specify a list of package names as strings, and they willbe installed with the command: pip installpackage1 package2 --upgrade.

• By default, no packages are installed.• Example configuration object:

'{"python_libraries":["numpy","scipy"]}'.

Note: If an invalid Python library is submitted, clustercreation will fail and the cluster will have to be reprovisionedor the service instance deleted and recreated.

© 2020 General Electric Company 11

Page 16: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Parameter Description

vault_url Required if configuring your Vault service instance to be thedefault credential store.

• Specify the URL to your vault service instance followed bythe path to your insights credentials store.

• String data value.

• Example configuration object: ‘{“vault_url”:“<https://predix-vault-asv-sb.gecis.io/v1/secret/secret_key/<vault_path>”}’. Forexample, where <vault_path> = /insights-credential-store/<folder_name>.

vault_token Required if configuring your Vault service instance to be yourdefault credential store.

• Specify the vault service token value for your vaultservice instance.

• Example configuration object:‘{“vault_token”:“<vault_token>”}’

Note: If service instance creation succeeds, a request to create a cluster is submitted. The cluster willtake up to 10 minutes or more, depending on the auto_scaling_min value specified and howmany Python libraries are being installed.

2. To check the status of cluster creation, run the cf service <my_service_instance_name>command.

Status Message Description

create in progress The cluster is still in the process of being provisioned.

create succeeded The cluster has been created and the instance is ready to beused.

create failed The cluster failed to provision and will have to bereprovisioned or the service instance deleted and recreated.This issue is usually caused by specifying an invalid Pythonlibrary.

Binding an Application to a Predix Insights InstanceWhen you bind an application to your Predix Insights instance the connection details are stored in theVCAP_SERVICES environment variable. The Cloud Foundry runtime uses the VCAP_SERVICES environmentvariable to communicate with a deployed application about its environment.

About This Task

After creating your service instance you have two options to obtain connection details, select from thefollowing:

• Bind an application – Use the bind-service command to bind an existing application to yourPredix Insight instance. Use the following steps.

• Create service keys – Create service keys to generate credentials to access the UI and the API. It alsoprovides access the service from outside of the Predix Cloud Foundry. See Creating Service Keys onpage 14.

12 © 2020 General Electric Company

Page 17: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Procedure

1. Run the following command to bind your application to your service instance.

cf bind-service <your_app_name><predix_insights_service_instance_name>

2. Restage your application to ensure the environment variable changes take effect.

cf restage <application_name>3. Run the following command to view the environmental variables for your application.

cf env <application_name>

The result lists the VCAP_SERVICES environment variables, which include the API and UI endpoints.For example,

"VCAP_SERVICES": {"predix-insights": [{"credentials": {"api": {"uri": "https://insights-api.data-services.predix.io/api/v1","zone-http-header-name": "Predix-Zone-Id","zone-http-header-value": "ab782903-919d-4198-

b19e-169e0b3464d2","zone-token-scopes": ["predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.user","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.admin","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.operator"]

},"ui": {"uri": "https://insights-ab782903-919d-4198-

b19e-169e0b3464d2.predix-apphub-prod.run.aws-usw02-pr.ice.predix.io","zone-http-header-name": "Predix-Zone-Id","zone-http-header-value": "ab782903-919d-4198-

b19e-169e0b3464d2","zone-token-scopes": ["predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.user","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.admin","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.operator","predix-apphub-service.zones.2843f653-fed7-451a-9fd3-

fc2c241d4817.user"]

}},"label": "predix-insights","name": "predix-insights-test","plan": "Dedicated","provider": null,"syslog_drain_url": null,"tags": [

© 2020 General Electric Company 13

Page 18: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

"Predix-Insights","Predix Insights"],"volume_mounts": []}]}}

Creating Service KeysYou can retrieve Predix Insights credentials without binding an application to your Insights instance bycreating a service key. Service keys generate credentials to access the UI and API in addition to accessingthe Predix Cloud Foundry remotely.

About This Task

After creating your service instance you have two options to obtain connection details, select from thefollowing:

• Bind an application – Use the bind-service command to bind an existing application to yourPredix Insight instance.. See Binding an Application to a Predix Insights Instance on page 12.

• Create service keys – Create service keys to generate credentials to access the UI and the API. It alsoprovides access the service from outside of the Predix Cloud Foundry. Use the following steps.

Procedure

1. Run the following command to create your service keys.

cf create-service-key <service-instance-name> <service-key-name>

2. Run the following command to list the service keys.

cf service-key <service-instance-name> <service-key-name>

The output lists the service key. This service key provides the credentials needed to log into the UI andAPI. Use them for access from outside of the Predix Cloud or to use the API.

{"credentials": {"api": {"uri": "https://insights-api.data-services.predix.io/api/v1","zone-http-header-name": "Predix-Zone-Id","zone-http-header-value": "ab782903-919d-4198-

b19e-169e0b3464d2","zone-token-scopes": ["predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.user","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.admin","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.operator"]

},"ui": {"uri": "https://insights-ab782903-919d-4198-

b19e-169e0b3464d2.predix-apphub-prod.run.aws-usw02-pr.ice.predix.io",

14 © 2020 General Electric Company

Page 19: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

"zone-http-header-name": "Predix-Zone-Id","zone-http-header-value": "ab782903-919d-4198-

b19e-169e0b3464d2","zone-token-scopes": ["predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.user","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.admin","predix-insights.zones.ab782903-919d-4198-

b19e-169e0b3464d2.operator","predix-apphub-service.zones.2843f653-fed7-451a-9fd3-

fc2c241d4817.user"]

}}

}

Updating OAuth2 Client for Predix InsightsTo use an OAuth2 client for secure access to your Predix Platform service instance from your application,you must update your OAuth2 client to add additional authorities or scopes that are specific to eachservice. The following describes how to use the UAA Dashboard to do this for Predix Insights.

About This Task

If you use the UAA Dashboard to create additional clients, the client is created for the defaultclient_credentials grant type. Some required authorities and scopes are automatically added tothe client. You must add additional authorities or scopes that are specific to each service.

In addition, the admin client is not assigned the default authority to change the user password. To changethe user password, you must add the uaa.admin authority to your admin client.

You have the option to add the additional authorities or scopes that are specific to Predix Insights usingUAAC. For more information, see Updating OAuth2 Client for Predix Insights Using UAAC on page 17.

Use the following procedure to update the OAuth2 client using the UAA Dashboard. For more informationabout how to use the UAA Dashboard, see Updating the OAuth2 Client for Services in the UAA servicedocumentation.

Procedure

1. In the UAA Dashboard login page, specify your admin client secret and click Login.2. In UAA Dashboard, select the Client Management tab.

The Client Management tab has two views, Clients and Services. The Services view displays theservice instances that you have created for your services.

Note: The service instances displayed in the Services view are the instances that you created usingthe UAA that you are trying to configure. The service instances that you created using some other UAAinstance are not displayed on this page.

3. Select the Switch to Services View option.4. In the Services view, select the service that you need to update.5. Choose the dedicated client (insights-ui-client) created to manage authentication for Predix

Insights UI (Creating an OAuth2 Client for the UI on page 8).6. Click Submit.7. Click on the Switch to Clients View option.

© 2020 General Electric Company 15

Page 20: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

8. In the Clients view, click the edit icon corresponding to the client added in the previous step.9. Complete the Edit Client form.

Note: The following are example values only. Replace with actual values for your environment. Forinformation about retrieving credentials for your environment, see Binding an Application to a PredixInsights Instance on page 12 and Creating Service Keys on page 14.

Field Description

Redirect URI Specify a redirect URI to redirect the client after login orlogout. For example: "https://insights-ab782903-919d-4198-b19e-169e0b3464d2.predix-apphub-prod.run.aws-usw02-pr.ice.predix.io"

Scopes Add the following scopes for Predix Insights. For example:

• "predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.user"

• "predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.admin"

• "predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.operator"

• "predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.user"

Authorities If you select the client_credentials grant type,you must update the authorities with service specificauthorities. For example:

• "predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.user"

Auto Approved Scopes Add the following scopes for Predix Insights. For example:

• "predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.user"

• "predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.admin"

• "predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.operator"

• "predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.user"

16 © 2020 General Electric Company

Page 21: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Updating OAuth2 Client for Predix Insights Using UAACDefault scopes and authorities are assigned when you create a new OAuth2 client. Each service requiresthat you add additional unique authorities and scopes to each service. You can add the required PredixInsights authorities and scopes using UAA Command Line Interface (UAAC) as follows.

About This Task

For more information on installing UAAC, see https://github.com/cloudfoundry/cf-uaac. If you are usingUAAC, add these additional Predix zone token scopes to the JSON Web Token (JWT) to enableapplications to access the Predix Insights service.

You can perform this task using the UAA Dashboard. For more information, see Updating OAuth2 Clientfor Predix Insights on page 15.

Note: The following are example values only. Replace with actual values for your environment. Forinformation about retrieving credentials for your environment, see Binding an Application to a PredixInsights Instance on page 12 and Creating Service Keys on page 14.

Procedure

1. Log into UAAC client to get the admin token. For example,

uaac target https://insights-uaa-stage.predix-uaa.run.aws-usw02-dev.ice.predix.iouaac token client get admin

2. Configure the redirect URL For example,

uaac client update insights-ui-client --redirect_uri"https://insights-ab782903-919d-4198-b19e-169e0b3464d2.predix-apphub-prod.run.aws-usw02-pr.ice.predix.io"

3. Configure the required authorities. For example,

uaac client update insights-ui-client --authorities"predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.user"

4. Configure the required scopes. For example,

uaac client update insights-ui-client --scope"predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.userpredix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.adminpredix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.operatorpredix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.user"

5. Configure the auto-approved scopes. For example,

uaac client update insights-ui-client --autoapprove"predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.userpredix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.adminpredix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.operatorpredix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.user"

© 2020 General Electric Company 17

Page 22: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Creating Users and Groups for Predix Insights Using UAACYou must create users and assign them to the required groups to authorize access to the Predix InsightsUI.

About This Task

The following steps describe how to set up the required users and groups using UAAC. If you prefer to usethe UAA Dashboard, see Creating Users in a UAA Instance and Creating Groups in a UAA Instance.

Note: The following are example values only. Replace with actual values for your environment. Forinformation about retrieving credentials for your environment, see Binding an Application to a PredixInsights Instance on page 12 and Creating Service Keys on page 14.

Procedure

1. Log into UAAC client to get the admin token. For example,

uaac target https://insights-uaa-stage.predix-uaa.run.aws-usw02-dev.ice.predix.iouaac token client get admin

2. Create the required user groups. For example,

uaac group add predix-apphub-service.zones.2843f653-fed7-451a-9fd3-fc2c241d4817.useruaac group add predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.operatoruaac group add predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.adminuaac group add predix-insights.zones.ab782903-919d-4198-b19e-169e0b3464d2.user

3. Create the users. For example,

uaac user add <user-id> \--emails "[email protected]" \--password <user-password>

NOTE: adding given name (--given_name) and family name (--family_name) are optional

4. Add the users to the groups. Note the following:

• Being a member of a specific group grants one permission. If a user is added as a member to thegroup, they are granted the permission.

• Each user must be a member of the AppHub user group (predix-apphub-service.zones.<guid>.user) in order to access the UI.

• Each user must be a member of the UAA group (predix-insights.zones.<tenantID>.user). In addition, each user must be a member of eitherthe Predix Insights Admin group (predix-insights.zones.<guid>.admin) or the Operatorgroup (predix-insights.zones.<guid>.operator), depending on the type of accessrequired. The Admin group grants access to all UI features. The Operator group grants only accessto the Monitoring tab.

18 © 2020 General Electric Company

Page 23: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Add a user to a group as follows.

uaac member add <group-name> <user-id>

Logging into the Predix Insights UIThe Insights UI is supported on Chrome, Firefox, and Safari browsers.

You can obtain the Insights UI URI from the service credentials. For more information, see Binding anApplication to a Predix Insights Instance on page 12 or Creating Service Keys on page 14.

Only users who have been added to the required groups authorizing access to the Insights UI can log intothe UI (Creating Users and Groups for Predix Insights Using UAAC on page 18). To log in, use your <user-id> and <user-password>.

Updating a Predix Insights Service InstanceYou can update your Predix Insights service instance configuration by running the cf update-service command with the new configuration parameter values.

About This Task

The following configuration values can be changed:

• auto_scaling_max• auto_scaling_min• trustedIssuerIds• vault_url• vault_tokenIf you want to change python_libraries or spark_version configuration values, seeReprovisioning a Predix Insights Instance on page 21.

Procedure

1. To update your existing Predix Insights instance configuration, run the following command and providethe configuration options to be changed.

cf update-service <my_service_instance_name> -c '{<options>}'

© 2020 General Electric Company 19

Page 24: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Parameter Description

my_service_instance_name The name of your Predix Insights service instance.

trustedIssuerIds • The trustedIssuerIds that are associated with thisservice instance

• The trustedIssuerId URL must end in oauth/token.

• You can use a comma-separated list to specify multipletrusted issuers.

• You can retrieve this URL from the VCAP_SERVICESenvironment variable after binding your service instance to anapplication.

• Example configuration object:'{"trustedIssuerIds":["https://uaa-url.predix.io/oauth/token"]}'.

auto_scaling_min Controls the minimum number of slave nodes the instance clusterwill scale to.

• Default value is 2.• The minimum value is 2 and the maximum value is 50.• Example configuration

object:'{"auto_scaling_min":3}'.

auto_scaling_max Controls the maximum number of slave nodes the instancecluster will scale to.

• Default value is 25.• The minimum value is 2 and the maximum value is 50.• Example configuration object:

'{"auto_scaling_max":30}'.

vault_url Required if configuring your Vault service instance to be yourdefault credential store:

• Specify the URL to your vault service instance followed by thepath to your insights credentials store

• String data value.

• Example configuration object: ‘{“vault_url”:“<https://predix-vault-asv-sb.gecis.io/v1/secret/secret_key/<vault_path>”}’. For example, where<vault_path> = /insights-credential-store/<folder_name>.

vault_token Required if configuring your Vault service instance to be yourdefault credential store:

• Specify the vault service token value for your vault serviceinstance.

• Example configuration object: ‘{“vault_token”:“<vault_token>”}’

Note: It will take up to 10 minutes or more when updating a cluster configuration, depending on theauto_scaling_min value specified and how many Python libraries are being installed.

2. To check cluster status, run the following command.

cf service <my_service_instance_name>

20 © 2020 General Electric Company

Page 25: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Status Message Description

update in progress The cluster configuration is still in the process of beingupdated.

update succeeded The cluster configuration has finished updating and theinstance is ready to be used.

update failed The cluster configuration failed to update. Try updating again.

Reprovisioning a Predix Insights InstanceYou can reprovision your Predix Insights service instance by running the cf update-servicecommand with the reprovision option, and providing the new configuration parameter values. If youdo not specify new parameter values the old values will be used.

About This Task

Reprovision your Predix Insights service instance in the following situations.

• Cluster creation fails during service creation. This commonly happens when an invalid Python library isspecified.

• To change the python_libraries or spark_version values for your instance.

To update other configuration settings, see Updating a Predix Insights Service Instance on page 19.

Note: While reprovisioning your service instance, any running flows on the existing cluster will bestopped, and there will be a short downtime during which flows cannot be run.

Procedure

1. Run the following command to reprovision your service instance.

cf update-service <my_service_instance_name> -c'{"reprovision":true, <options>}’

For example, to change the spark version installed on the cluster to version 2.2.1 and reprovision:

cf update-service <my_service_instance_name> -c'{"reprovision":true, "spark_version":"2.2.1”}'

Parameter Description

my_service_instance_name The name of your Predix Insights service instance.

reprovision Controls whether the cluster for the instance is reprovisioned.

• Specify True or False.

• True terminates any running flows on the existing cluster.

• There will be a short downtime during which flows cannot berun.

• Example configuration object:'{"reprovision":true}'

Note: reprovision option value must be True for theconfiguration changes to take effect when updating your serviceinstance.

© 2020 General Electric Company 21

Page 26: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Parameter Description

spark_version Specifies the spark version installed on the service instancecluster.

• Supported spark versions: 2.1.0• A version of EMR cluster is picked based on the

spark_version provided.

• Example configuration object:'{"spark_version":"2.1.0"}'.

python_libraries Specifies Python libraries to be installed on the spark cluster.

• Specify a list of package names as strings, and they will beinstalled with the command: pip installpackage1 package2 --upgrade.

• By default, no packages are installed.• Example configuration object:

'{"python_libraries":["numpy","scipy"]}'.

Note: If an invalid Python library is submitted, cluster creation willfail and the cluster will have to be reprovisioned or the serviceinstance deleted and recreated.

2. To check cluster status, run the following command.

cf service <my_service_instance_name>

Status Message Description

update in progress The cluster is still in the process of being reprovisioned.

update succeeded The cluster has finished reprovisioning and the instance isready to be used.

update failed The cluster failed to reprovision and will have to bereprovisioned again. This issue is usually caused by specifyingan invalid Python library.

Deleting a Predix Insights Service Instance

Delete the Predix Insights service instance by issuing the following Cloud Foundry command.

cf delete-service <my_service_instance_name>

Where <my_service_instance_name> is the name of your Predix Insights service instance.

22 © 2020 General Electric Company

Page 27: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Supported Python LibrariesPredix Insights comes with an assortment of libraries built into the EMR clusters.

Table 1: Python Libraries Pre-Installed in Predix Insights

Library Version

Python 2.7

aws-cfn-bootstrap 1.4

awscli 1.11.29

Babel 0.9.4

backports.ssl-match-hostname 3.4.0.2

beautifulsoup4 4.4.1

boto 2.39.0

botocore 1.4.86

chardet 2.0.1

cloud-init 0.7.6

colorama 0.2.5

configobj 4.7.2

cycler 0.10.0

docutils 0.11

ecdsa 0.11

functools32 3.2.3.post2

futures 3.0.3

iniparse 0.3.1

Jinja2 2.7.2

jmespath 0.7.1

jsonpatch 1.2

jsonpointer 1.0

kitchen 1.1.1

lockfile 0.8

lxml 3.6.0

MarkupSafe 0.11

matplotlib 2.0.0

mysqlclient 1.3.7

nltk 3.2

nose 1.3.4

numpy 1.12.1

paramiko 1.15.1

© 2020 General Electric Company 23

Page 28: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Library Version

PIL 1.1.6

pip 19.0.2

ply 3.4

pyasn1 0.1.7

pycrypto 2.6.1

pycurl 7.19.0

pygpgme 0.3

pyliblzma 0.5.3

pyparsing 2.2.0

pystache 0.5.3

python-daemon 1.5.2

python-dateutil 2.6.0

pytz 2017.2

pyxattr 0.5.0

PyYAML 3.11

requests 1.2.3

rsa 3.4.1

scipy 0.17.0

setuptools 12.2

simplejson 3.6.5

six 1.10.0

subprocess32 3.2.7

urlgrabber 3.10

urllib3 1.8.2

virtualenv 12.0.7

windmill 1.6

yum-metadata-parser 1.1.4

Note:

Please contact the Predix Insights team to request support for any Python libraries not in the list above.

24 © 2020 General Electric Company

Page 29: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Develop

Quick Start

About Developing AnalyticsEmbedded in Predix Insights is an Apache Spark-based framework for analytics to fully leverage resourcesin big data, Predix, and Predix Asset Performance Management (Predix APM) ecosystems. The PredixSpark Framework (PSF) manages the retrieval and storage of data, and provides a simplified developmentand execution experience.

If you are new to Spark, the embedded framework helps you to get up and running quickly. The frameworktakes in custom analytic and job configurations. The job configuration specifies data input/output throughits data connectors. The framework parses the job configuration, retrieves the input data, and sends it tothe analytic. It also retrieves the analytic output and writes it to the output source specified by the jobconfiguration.

The framework's runtime dataset builder provides efficient data read/write capability throughparallelization and caching, and provides a row-based data structure. A developer can focus on algorithmimplementation using a SQL-like language without first learning about the detailed usage of every datasource or how to integrate into runtime executors. Generic Restful (REST) service-based and tenant-aware data providers (connectors) are available out-of-box. The framework's runtime executor supportsboth streaming and batch jobs

Stable versions of the framework libraries are pre-installed in your Predix Insights instance. Also providedare several data connectors (libraries) which can be used to read/write data from/to Predix and APM dataservices. Samples for using both the framework and data connectors are available. For more information,see Predix Insights Code Samples on page 26.

About the Postman Collections for Predix InsightsYou can use Postman to test your Predix Insights REST API requests. After you have created your PredixInsights service instance, you can use the sample Postman collections to customize your REST requestsand test them out.

Sample requests for all Predix Insights APIs can be found in the Postman collections provided for your use.These Postman collections can be used as is or customized to test your service REST API requests. Followthe instructions at https://github.com/PredixDev/predix-insights-examples/tree/master/predix-insights-postman to import these collections and to configure your Postman environment to connect to yourservice instance.

Configuring REST Request HeadersThe Predix Insights service REST API endpoints provide a JSON interface for passing data. The applicationmakes HTTPS requests to these API interfaces.

All REST requests must contain the following headers for multi-tenant support (provided by Predix UAAService). Whenever the application makes a call to the service, these headers identify which tenant theapplication belongs to and must be configured as follows.

© 2020 General Electric Company 25

Page 30: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Field Header Name Header ValueUAA Authorization

Authorization "Bearer <UAA token>"

Predix Zone Predix-Zone-Id <Predix-Zone-Id>

The following figure shows an example of how to set up the headers for a REST request in Postman.

For information about retrieving VCAP environmental variables, see Binding an Application to a Predix Insights Instance on page 12.

Predix Insights Code Samples

Predix Insights provides several types of code samples to help you with your development.

The Predix Insights code samples are located in the following GitHub repository for your use and download: Predix Insights Examples. The following table provides a list with description of the samples available, organized by type.

Name Description Location

Predix Insights Quick Start Examples

Predix Insights Spark Quick Start

Examples

Example files (Java, Python, Scala) are

available with instructions to show you

how to quickly start using Predix Insights

to create and launch a flow.

See README.md

Predix Spark Framework Examples

26 © 2020 General Electric Company

Page 31: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Name Description Location

Predix Spark Framework Examples Example files (Java, Python) are available

with instructions to show you how to

create and launch a flow using the sample

analytic code and datasources provided.

The Predix Spark Framework handles

both deploy and runtime property

loading. It provides a config-based

approach for input and output data, and

makes it easier to retrieve data from

multiple sources. Stable versions of the

framework libraries are preinstalled in

your Insights instance.

See README.md

Predix Spark Framework for Java

Language Examples

The Predix Spark Framework for Java

language examples provide the following

benefits:

• Handles both deploy and runtime

property loading.

• Provides a config based approach for

inputs and output.

• Leverages Predix Connectors to

retrieve data from all Predix data

sources. Also works with open source

connectors like blobstore(s3),

postgres(jdbc), etc.

See spark-java-examples

Predix Spark Framework for Python

Language Examples

The Predix Spark Framework for Python

language examples provide the following

benefits:

• Handles both deploy and runtime

property loading.

• Provides a config based approach for

input and output.

• Leverages Predix Connectors to

retrieve data from all Predix data

sources. Also works with open source

connectors like blobstore(s3),

postgres(jdbc), etc.

See spark-python-examples

© 2020 General Electric Company 27

Page 32: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Name Description Location

Predix Insights DAG Examples Following annotated DAG examples are

provided:

• quickstart-example-dag.py: Shows a simple example

of calling user defined flows.

• iris_example_sequential.py: Shows an

example of how to call different

operators in a sequential order.

• iris_example_parallel_dag: Show an

example of how two operators can be

run together and executed in parallel,

then brought back into a single

execution pipeline.

See workflow-examples

Predix Spark Data Connectors

Predix Spark Data Connectors Example files with instructions and

datasources provided for using the

connectors to retrieve data and write

output as a Spark DataFrame.

See README.md

Predix Blobstore Connector Example An Apache Spark-based connector

example that reads Predix Time Series

data from the S3 file.

See spark-blob-example

Event Hub PySpark Connector Read

Example (predix.stream)

An Apache Spark-based connector

example that reads generic messages

from Event Hub.

See pyspark-eventhub-read-example

Event Hub PySpark Connector Time

Series Write Example (predix.stream.ts)

An Apache Spark-based connector

example that supports read/write

functionality and will publish/subscribe to

an Event Hub topic. This example shows

how to publish Time Series data to Event

Hub.

See pyspark-eventhub-write-example

Predix Time Series PySpark Connector

(predix.ts)

An Apache Spark-based connector

example that can pull data from Predix

Time Series from multiple executors

simultaneously, create a RDD in each, and

output a DataFrame with the RDDs as a

reference. This connector does the

following:

• Reads data from Predix Time Series

and returns a Spark DataFrame.

• Authentication calls to the UAA

provider will be automatically

handled.

• Supports reading time series data

using different time intervals.

See pyspark-timeseries-example

28 © 2020 General Electric Company

Page 33: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Name Description Location

Event Hub Generic Stream Spark

Connector Read Example (predix.stream)

An Apache Spark-based connector

example that reads generic messages

from Event Hub.

See spark-eventhub-read-stream-

example

Event Hub Alarm Spark Connector

Example (predix.stream.alarm)

An Apache Spark-based connector

example that publishes alarm messages

(APM Alerts) to Event Hub.

See spark-eventhub-alarm-write-stream-

example

Event Hub Stream Spark Connector Time

Series Write Example (predix.stream.ts)

An Apache Spark-based connector

example that supports read/write

functionality and will publish/subscribe to

an Event Hub topic. This example shows

how to publish Time Series data to Event

Hub.

See spark-eventhub-ts-write-stream-

example

Event Hub Generic Stream Spark

Connector Write Example (predix.stream)

An Apache Spark-based connector

example that writes generic string

messages to Event Hub.

See spark-eventhub-generic-write-

stream-example

Event Hub Connector Read Time Series

Example (predix.stream)

An Apache Spark-based connector

example that shows how to read Time

Series messages from Event Hub.

See spark-eventhub-ts-read-stream-

example

Predix Time Series Spark Connector

Example (predix.ts)

An Apache Spark-based connector

example that can pull data from Predix

Time Series from multiple executors

simultaneously, create a RDD in each, and

output a DataFrame with the RDDs as a

reference. This connector does the

following:

• Reads data from Predix Time Series

and returns a Spark DataFrame.

• Authentication calls to the UAA

provider will be automatically

handled.

• Supports reading time series data

using different time intervals.

See spark-timeseries-example

Postgres Database Spark Connector

Example

An Apache Spark-based connector

example shows how to read/write Spark

DataFrames to/from Postgres database.

See spark-postgres-example

Open Source Spark Redis Connector An Apache Spark-based connector

example that shows how to read/write

from/to Redis.

See spark-redis-example

Predix Asset Performance Management (Predix APM) Data Connectors

© 2020 General Electric Company 29

Page 34: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Name Description Location

Predix APM Asset ( predix.apm.asset) Connector Example

An Apache Spark-based connector

example that can read multiple assets/

tags from APM Asset and Tags.

Authentication calls to the UAA provider

will be automatically handled.

See spark-apm-asset-example

Predix APM Time Series (predix.apm.ts) Connector Example

An Apache Spark-based connector

example that can read/write to/from APM

Time Series. Authentication calls to the

UAA provider will be automatically

handled.

See spark-apmts-example

Configure Flow Template

About Flow Template FilesA flow template contains the analytic code, including the required configuration and library files, in ZIPformat. You define a single flow template for each analytic you want to run data against, and can runmultiple flows or orchestrations against the same template.

Required File Structure

The flow template file is created outside of Predix Insights and can be added using the Develop > FlowTemplates page in the UI or by using the REST APIs.

The following is an example of a possible structure for a template file, which includes the necessarycomponent dependencies. For sample template files, see Predix Insights Code Samples on page 26.

Tip: Create the ZIP file when you are in the directory to ensure it is in the correct structure.

• conf folder – contains the conf.json file. The conf.json file contains the basic flow templateconfiguration in JSON format.

• lib folder – contains any required library files for the analytic code file. For example, if your analyticcode requires a specific version of a library, add it here.

• .jar file(s) containing the analytic code.

A flow run against a template will inherit the analytics configuration defined in that flow template. Thismeans you can create a flow template just once for an analytic, package it as described, and save it inPredix Insights for reuse in more than one flow (configuration). When defining a flow, you add Sparkparameters to set the Spark configuration. For example, if you want to run several Spark configurationsagainst the same analytic code, you would define different flows each with their separate Sparkconfigurations, then run each flow against the same template file.

30 © 2020 General Electric Company

Page 35: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Additional metadata is required when defining the template files, which helps identify the template fileand to aid in searching.

Related Concepts

About Flow Files on page 36

A flow file describes the tasks which can be run independently and collectively define how your data isto be routed and processed. The flow file is created outside of Predix Insights and can be uploaded oredited using the UI.

Creating a Flow Template File Using UIA flow template file contains the analytics code, configuration, and library files. The analytic code is addedto the template as a ZIP file. You can use the Flow Template page in the UI to create a template file asfollows.

Before You Begin

Bundle the analytic code into a ZIP file using the required directory structure described in About FlowTemplate Files on page 30.

Procedure

1. Navigate to Develop > Flow Templates.2. Click Upload.

The Upload a Flow Template dialog displays.

© 2020 General Electric Company 31

Page 36: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

3. Provide information about the template in the following fields.

Field Description

Name The name of the Flow Template.

Description The description of this template.

Version The version number of the template.

Type The analytics code technology type. Supported types areSPARK_JAVA, SPARK_PYTHON, SPARK_R.

Tags Enter keywords descriptive of the template. The keywordsprovided in this field will be added as values in the Tags listwhen using Search in Flow Template and Flows pages.

4. Choose from following methods to upload the template file.

32 © 2020 General Electric Company

Page 37: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

• Navigate to the folder containing the template .zip file.• Drag and drop the file to the UI drag and drop area.

5. Click Upload.

If the ZIP file is accepted, the template is added to the Flow Templates page.

© 2020 General Electric Company 33

Page 38: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Creating a Flow Template Using the APIA flow template file contains the analytics code, configuration, and library files that are the building blocksof flows (tasks). The analytic code is attached to the template as a ZIP file. You can use the REST APIs tocreate a flow template file as follows.

Before You Begin

Bundle the analytic code into a ZIP file as described in About Flow Template Files on page 30.

Create the flow template and attach the analytic code as a ZIP file by issuing the following REST APIcommand.

POST <insights_api_uri>/api/v1/flow-templates

The request payload is a multipart/form-data type with the following parts. The following example showssample values, modify as needed for your environment.

{"version": "1.0","user": "cf3-dev-client","name": "PREDIX-TS-EH-ALALRM3","description": "desc","type": "SPARK_PYTHON","tags": ["tag1:val1","tag2:val2"

]}

Name Value Required/Optional

file Executable analytic ZIP file. Required

version Version of the analytic Required

user Client id of UAA Required

name Name of the analytic Required

description Description of the analytic Required

type The analytics code technology type.

Supported types are;

• Java, scala: SPARK_JAVA

• Python: SPARK_PYTHON

• R: SPARK_R

Required

tags Tags (key:value pairs) that are descriptive

of the template. The keywords provided in

this field can be used to search.

Required

The following is a sample response showing example template attributes.

{"id": "6d62c36e-0ca1-4d87-9022-8f5fb3539224","created": 1520432270451,

34 © 2020 General Electric Company

Page 39: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

"updated": 1520432270451,"version": "1.0","user": "cf3-dev-client","name": "PREDIX-TS-EH-ALALRM3","description": "desc","type": "SPARK_PYTHON","tags": ["tag2:val2","tag1:val1"

],"blobPath": "andromeda/tenants/d7406965-66a8-49e7-9ca2-7ab775144e39/

analytics/PREDIX-TS-EH-ALALRM4/1.0/PREDIX-TS-EH-ALARM1.zip","flows": []

}

You can use Postman to test your Predix Insights REST API requests. For more information, see About thePostman Collections for Predix Insights on page 25.

Deleting a Flow Template Using UI

Before You Begin

You must first delete all flows associated with the template.

Procedure

1. Navigate to Develop > Flow Template.2. To search for the template, complete Search By Name field, select keywords from Tags list, and click

Filter.3. Select template from Name column.

The Template Details page displays for that template.4. Click Delete Flow Template.

Deleting a Flow Template Using the API

Before You Begin

You must first delete all flows associated with the template.

Delete the flow template by issuing the following REST API command.

DELETE <insights_api_uri>/api/v1/flow-templates/{flowTemplateId}>

You can use Postman to test your Predix Insights REST API requests. For more information, see About thePostman Collections for Predix Insights on page 25.

© 2020 General Electric Company 35

Page 40: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Configure Flows

About Flow FilesA flow file describes the tasks which can be run independently and collectively define how your data is tobe routed and processed. The flow file is created outside of Predix Insights and can be uploaded or editedusing the UI.

The following Flow details after upload are provided in the summary table.

Field Description

Name The name of the flow. Click the name of the flow to view theFlow Details page or to edit the flow.

Type Type represents the technology on which the flow is based;either Java or Python.

Created The date on which the flow was created.

Updated The date on which the flow details were updated.

Flow Template The name of the Flow Template associated with the Flow.

Actions

The Flows user interface contains the following components:

Name

The name of the flow. Click the name of the flow to view the Flow Details page or to edit the flow.

Type

Type represents the technology on which the flow is based; either Java or Python.

Created

Created is the date on which the flow was created.

Flow Template

Flow Template is basic building block for creating a flow.

Current Status

Current Status returns the status of the selected flow. The values are:

• Created – the flow has been created but not launched.• Accepted – the flow has been created and accepted into the queue. These flows are visible in the

Monitor→ Flows.

Actions

For each flow you can:

• Launch – starts the flow. When a flow is started it can be viewed in Monitor → Flows.• Delete – remove the flow. A tenant cannot delete a flow until it has completed.

36 © 2020 General Electric Company

Page 41: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

• Review flow details by clicking the flow name.

Flow Details

The Flow Details page lists:

• Parent flow template.• Type of technology on which the flow is based.• The version.• Tags ascribed to the flow, which are used by the Flow Templates screen for filtering.

Click the Parent Flow Template link to view the Template Details page which displays the flowtemplate on which the flow is built. Click Spark Arguments Edit to edit the Spark arguments created forthis flow. You can also click through to review any other flows created with this parent flow template.

Adding Spark ParametersYou add Spark parameters to set the Spark configuration when defining a Flow using the TemplateDetails page in the UI.

You have two methods to add parameters. You can use the Select Argument list to select one or morekey value pairs, or paste a JSON configuration in the box provided.

Spark Option Predix Insights JSON Property Description

--name name String

--driver-cores sparkArguments.driverCores Integer

--driver-memory sparkArguments.driverMemory String

--executor-cores sparkArguments.executorCores Integer

--executor-memory sparkArguments.executorMemory String

© 2020 General Electric Company 37

Page 42: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Spark Option Predix Insights JSON Property Description

--num-executors sparkArguments.numExecutors

Integer. This option is valid only when dynamicAllocation isdisabled. By default dynamicAllocation is enabled, You should use'spark.dynamicAllocation.initialExecutors','spark.dynamicAllocation.minExecutors' and'spark.dynamicAllocation.maxExecutors' to scaleup and down the number of executors registered with this application based onthe workload. For example:

{"confs": {"spark.dynamicAllocation.maxExecutors": "8","spark.dynamicAllocation.minExecutors": "4","spark.dynamicAllocation.initialExecutors": "4"}}

If you want to use static allocation with this option, setspark.dynamicAllocation.enabled=false. Forexample:

{"numExecutors": 2,"confs": {"spark.dynamicAllocation.enabled": "false"}}

--conf spark.extraListeners sparkArguments.sparkListeners String[]

--conf spark.logConf sparkArguments.logConf String

--conf spark.executorEnv.* sparkArguments.executorEnv Map<String, String>

--conf spark.driver.extraJavaOptions sparkArguments.driverJavaOptions--conf spark.executor.extraJavaOptions sparkArguments.executorJavaOptions String

--conf spark.predix.vault.enabled sparkArguments.vaultEnabled String (Value true or false)

--class sparkArguments.className String

--conf key=value sparkArguments.confs Map<String, String>

-Dkey=value appended to driver/executorJavaOptions sparkArguments.systemProps Map<String, String>

sparkArguments.applicationArgs String[]

Predix Insights infers the following options from the flow template file and you do not need to provide thisinformation separately.

38 © 2020 General Electric Company

Page 43: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Spark Submission Options Value Description

--jars Flow template lib jars Comma-separated list of local .jar files toinclude on the driver and executorclasspaths.

--py-files Flow template python files Comma-separated list of .zip, .egg, or .pyfiles to place on the PYTHONPATH forPython apps.

--files Flow template conf files

Flow conf files

Comma-separated list of files to beplaced in the working directory of eachexecutor.

--driver-class-path Flow template conf files

Flow conf files

Cluster level lib path

Cluster level conf path

Extra classpath entries to prepend to theclasspath of the driver.

--confspark.executor.extraClassPath

Flow template conf files

Flow conf files

Cluster level lib path

Cluster level conf path

Extra classpath entries to prepend to theclasspath of the executor.

Creating a Flow File Using UIA flow file contains configuration details that defines how data is to be processed at runtime. More thanone flow file can be associated with the same analytic (flow template).

About This Task

There are two methods to create a flow:

• A flow configuration file can be uploaded and saved in Predix Insights without being associated with ananalytic (template) until later.

• A flow can be created and associated with an analytic (template) at same time.

Procedure

1. Navigate to Develop > Flow Template page.2. From the Actions column, choose the Flow Template and click SelectCreate Flow.3. Add the name of the flow.4. Click Create.

The newly created flow is listed in the Flows column. Next, the flow must be edited to include Sparkparameters.

5. Click the flow name, or navigate to the Flows page and click the flow to open and edit.6. Enter the Spark arguments.

Spark parameters can be entered manually or by pasting a JSON file. For a list of Spark parametersand their JSON properties see Adding Spark Parameters on page 37.

Note: If using a Python analytic, fileName is a required Spark argument. The value is the name ofthe .py file uploaded to the Flow Template ZIP file.

© 2020 General Electric Company 39

Page 44: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Creating a Flow Using the APIsA flow file contains configuration details that defines how data is to be processed at runtime.

Create the flow and attach the analytic code (flow template) as a ZIP file by issuing the following REST APIcommand.

POST <insights_api_uri>/api/v1/{flow-template}/flow

The request payload is a multipart/form-data type with the following parts. The following example showssample values, modify as needed for your environment.

{"name": "job1","description": "desc","sparkArguments": {"applicationArgs": ["200000"

],"fileName": "test.py"

},"tags": ["tag1:val1","tag2:val2"

],"file1": "/Users/..........test1.properties","file2": "/Users/..........test2.properties"

}

Name Value Required/Optional

name Name of the flow. Required

description Description of the flow. Required

sparkArguments Add the spark parameters to set the

spark configuration. Provide a list of spark

arguments to run spark-submit.

Required

fileName Executable analytic ZIP file (flow

template).

Required

tags Tags (key:value pairs) that are descriptive

of the template. The keywords provided in

this field can be used to search.

Required

file1 Specify the filepath specific to flow. The

following are supported: *.{xml, conf, txt,

properties, json, config, yml, yaml}

Optional

file2 Specify the filepath specific to flow. The

following are supported: *.{xml, conf, txt,

properties, json, config, yml, yaml}

Optional

{"id": "41cd47ac-dd3f-48ad-bb78-c4c6be0ddd6a","created": 1520434632590,"updated": 1520434632590,

40 © 2020 General Electric Company

Page 45: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

"version": "1.0","name": "arf-java","description": "desc","type": "SPARK_JAVA","tags": ["tag2:val2","tag1:val1"

],"blobPath": "andromeda/tenants/d7406965-66a8-49e7-9ca2-7ab775144e39/

analytics/arf-java-example/1.0/flows/arf-java","sparkArguments": {"applicationArgs": ["200000"

],"fileName": "test.py"

},"flowTemplate": {"id": "dd3050fb-be57-4a3d-9762-006361545158","name": "arf-java-example"

}}

You can use Postman to test your Predix Insights REST API requests. For more information, see About thePostman Collections for Predix Insights on page 25.

Launching a Flow Using the API

You can launch (run, execute) a flow by issuing the following REST API command.

POST <insights_api_uri>/api/v1/flow-templates/{flowTemplateId}/flows/{flowId}/launch

The following is a sample flow execution response.

{"id": "application_1517268829965_26388","applicationType": "SPARK","startTime": 1520459003215,"finishTime": 0,"status": "ACCEPTED","tags": [],"flow": {"id": "c88b0c57-0779-4a9a-840a-304867ab0de3","name": "arf-java"

},"submitDetails": {"command": "/opt/spark_client/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --deploy-mode cluster --master yarn --confspark.yarn.submit.waitAppCompletion=false --conf

spark.eventLog.enabled=true--conf spark.hadoop.yarn.timeline-service.enabled=false --conf

spark.yarn.dist.archives=--conf spark.yarn.dist.files= --name arf-java --driver-class-path

runtime.properties:job.json:/predix/andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/

dependencies/lib/*:/predix/andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/dependencies/conf:/predix/

andromeda/frameworks/predix-spark-framework/latest/lib/*:/predix/andromeda/frameworks/

© 2020 General Electric Company 41

Page 46: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

predix-spark-framework/latest/conf --conf spark.yarn.appMasterEnv.ANDROMEDA_LIB_DIR=/predix/

andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/dependencies/lib --conf spark.executorEnv.ANDROMEDA_LIB_DIR=/

predix/andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/dependencies/lib --conf

spark.yarn.appMasterEnv.ANDROMEDA_CONF_DIR=/predix/andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/

dependencies/conf --conf spark.executorEnv.ANDROMEDA_CONF_DIR=/predix/andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/

dependencies/conf --conf spark.yarn.appMasterEnv.PREDIX-SPARK-FRAMEWORK_ROOT_DIR=/predix/andromeda/frameworks/

predix-spark-framework/latest --conf spark.executorEnv.PREDIX-SPARK-FRAMEWORK_ROOT_DIR=/predix/andromeda/frameworks/

predix-spark-framework/latest --confspark.executor.extraClassPath=runtime.properties:job.json:/predix/andromeda/

d7406965-66a8-49e7-9ca2-7ab775144e39/dependencies/lib/*:/predix/andromeda/d7406965-66a8-49e7-9ca2-7ab775144e39/

dependencies/conf:/predix/andromeda/frameworks/predix-spark-framework/latest/lib/*:/predix/andromeda/frameworks/

predix-spark-framework/latest/conf --jars /tmp/andromeda_local/d7406965-66a8-49e7-9ca2-7ab775144e39/instances/

arf-java_162026a436d/analytics/sample-1.0.0-SNAPSHOT.jar --files /tmp/andromeda_local/d7406965-66a8-49e7-9ca2-7ab775144e39/

instances/arf-java_162026a436d/flows/runtime.properties,/tmp/andromeda_local/d7406965-66a8-49e7-9ca2-7ab775144e39/

instances/arf-java_162026a436d/analytics/conf/job.json --classcom.ge.arf.jobexec.SimpleSparkJobRunner /

predix/andromeda/frameworks/predix-spark-framework/latest/lib/spark-predixts-connector-1.0.5-SNAPSHOT.jar 200000",

"environment": {"HADOOP_CONF_DIR": "/tmp/andromeda_local/

d7406965-66a8-49e7-9ca2-7ab775144e39/clusters/j-14E34BOB4LFI8/hadoop","SPARK_HOME": "/opt/spark_client/spark-2.1.0-bin-hadoop2.7","SPARK_CONF_DIR": "/tmp/andromeda_local/

d7406965-66a8-49e7-9ca2-7ab775144e39/clusters/j-14E34BOB4LFI8/spark"}

},"user": "andromeda","name": "arf-java"

}

Updating a Flow Configuration File

Procedure

1. Navigate to Monitor > Flows page.2. Locate the flow instance from the table and click Flow name.

The Flow Details page displays.3. Edit the Spark Arguments using the list or paste the JSON file containing the changed configuration.4. Click Save.

42 © 2020 General Electric Company

Page 47: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Manage Dependencies

Working with DependenciesThe Dependencies page allows you to manage the flow dependencies.

Dependencies are files, like spark-cassandra-connector or postgres-jdbc called by your flowsor DAGs. Instead of having to remember to package the file for each analytic, you can upload thedependency so it is always available.

Note: If you want to specify a specific version of a dependency for an analytic, package the dependencyfiles instead with the analytic upload when creating the Flow Template.

Uploading a Dependency File Using UI

Upload libraries and other common dependency files and stored them in Predix Insights, to be calledwhen needed.

Procedure

1. Navigate to Develop > Dependencies page.2. Click Upload.

The Upload a dependency dialog displays.

© 2020 General Electric Company 43

Page 48: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

3. Select the file Type from list (lib or conf).4. Drag and drop the file or select file from local drive.5. Click Upload.

Uploading a Dependency File Using the API

Upload the dependency file by issuing the following REST API command.

POST <insights_api_uri>/api/v1/dependencies/

The request payload is a multipart/form-data type with the following parts.

Name Value Required/Optional

file The dependency file. For example,

runtime.properties.

Required

type File type. Supported values are:

• conf: configuration file

• lib: library JAR file

Required

The following is a sample execution response.

[{"id": "2cec65a6-5f11-4c21-8351-56d5b5464e85","name": "runtime.properties","tenant": "d7406965-66a8-49e7-9ca2-7ab775144e39","type": "conf","deployed": false

}]

44 © 2020 General Electric Company

Page 49: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Deploying a Dependency File Using the API

Deploy the dependency file by issuing the following REST API command.

POST <insights_api_uri>/api/v1/dependencies/deploy/{dependencyId}

Deleting a Dependency File Using the API

Delete the dependency file by issuing the following REST API command.

DELETE <insights_api_uri>/api/v1/dependencies/{dependencyId}

© 2020 General Electric Company 45

Page 50: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Orchestrate

Configure Orchestration

About OrchestrationAn orchestration is a collection of tasks to be run together, organized in a manner that reflects theirmutual relationships. The orchestration defines a group of analytic flows you want to be run together as asingle unit, the execution task order is defined in the corresponding directed acyclic graph (DAG) files.

The lowest unit of orchestration is a DAG file. A DAG models the relationships between the constituenttasks through dependencies. a single orchestration definition can include either one DAG or multipleDAGs as long as they fit coherently together as a group.

Each task within a DAG represents a node in the graph, which can be either a macro or a function andshare a common set of parameters. An operator allows for the generation of certain types of tasks thatbecome nodes in the DAG when instantiated.

For example, suppose you have three different analytic flows to run and want to control the order ofexecution. You can create a multi-step orchestration where the analytic output from step 1 is next sent asinput to step 2, and then the results are provided as input to step 3. You define each task and order ofexecution using a DAG file.

Directed Acyclic Graph (DAG) DevelopmentA directed acyclic graph (DAG) file contains the collection of tasks to be run as a single unit and definesthe order of execution at runtime. A DAG describes how you want the flow to be run by defining thecorrect order of tasks.

• Overview on page 46• DAG Parameters on page 47• Code Sample: DAG File on page 47• Common Operator Parameters on page 47• Macros on page 48• Code Sample: PredixInsightsOperator on page 49• predix_insights_flow_params• Jinja Templating• Functions on page 51• Task Dependencies on page 52• Configuration Management on page 53• Sample: DAG File with Example Values on page 53

Overview

Follow these guidelines when developing a DAG file for an orchestration being executed in the Sparkruntime.

1. The first argument to the DAG function is a unique id for any user DAG.2. The second argument to the DAG function is a list of default arguments that apply to the overall

functioning of the DAG.

46 © 2020 General Electric Company

Page 51: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

DAG Parameters

Name Description

start_date Specify using one of the following formats:

• airflow.utils.dates.days_ago(<number>) <number> days ago

• Python datetime format. For example,

datetime(18,02,01).

retries The maximum number of retries in failed mode.

retry_delay The delay time between retries.

schedule_interval Specify using one of the following formats:

• cron presets (None, "@once","@hourly", "@daily", "@weekly","@monthly", "@yearly")

• datetime.timedelta object (eg.datetime.timedelta(minutes=15))

• string denoting a 'cron' expression. For example,

"*/5 * * * *" for scheduling every 5 minutes.

Code Sample: DAG File

import airflowfrom airflow import DAGfrom airflow.operators.predix_insights_operator importPredixInsightsOperatorfrom datetime import timedeltaimport os

default_args = {'owner': 'EMR-SA-86d0-4c60-a462-70f27b9d01c6', # any string'start_date': airflow.utils.dates.days_ago(1), # start date, can alsobe in'retries': 1,'retry_delay': timedelta(minutes=5),}

dag = DAG(os.path.splitext(os.path.basename(__file__))[0],default_args=default_args,schedule_interval="*/15 * * * *")

Common Operator Parameters

An operator describes a single task in the flow. A DAG file describes the correct order to execute operators(tasks). An operator allows for the generation of certain types of tasks that become nodes in the DAGwhen instantiated. The following parameters are common to all supported Predix Insights operators: thePredixInsightsOperator macro, the PythonOperator function, and theBranchPythonOperator function.

© 2020 General Electric Company 47

Page 52: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Table 2: Common Parameters Applicable to Operators

Name Type Description

task_id string Unique identifier for task.

owner string Owner of task.

retries integer Number of retries before failing the task.

retry_delay timedelta Delay between retries.

max_retry_delay timedelta Maximum delay interval between retries.

start_date datetime Start date for task. The start_datefor the task, determines the

execution_date of the first task

instance.

end_date datetime [Optional] End date for task, no tasks will

go beyond this date.

execution_timeout timedelta Maximum time allowed for the execution

of task instance . If runtime of task goes

beyond, the task will fail.

on_failure_callback callable The function to call when a task instance

fails.

on_retry_callback callable A function to be called when a task

instance retries.

on_success_callback callable A function to be called when a task

instance succeeds.

Macros

Predix Insights supports the PredixInsightsOperator macro, which can be used to execute PredixInsights flows.

Multiple macros may be instantiated for a single DAG. Each instantiation represents a Predix Insights flow.When the DAG is executed, the macro launches an instance of the flow and waits for it to complete.

The macro is idempotent, it will first check to see if a flow instance already exists in Predix Insights. If aninstance exists and is running, the macro will wait for completion of the instance.

The following parameters are specific to the PredixInsightsOperator macro.

48 © 2020 General Electric Company

Page 53: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Table 3: PredixInsightsOperator Macro Parameters

Name Type Description

predix_insights_flow_name

string The name of the flow that was previously

created in Predix Insights

predix_insights_launch_options

dictionary Extra options for launching a flow

instance. It is a dictionary of options,

where key is string and value depends on

the option being modified.

predix_insights_flow_params

dictionary Extra options for substituting dynamic

parameter tags inside arguments defined

for already created Predix Insight flows.

force_relaunch boolean Whether the task should launch the flow

instance with the unique name even

when it is already present in the history.

This could occur if the task originally

failed, the error was corrected and now

needs to run again. By default the task

will skip and return if an analytics with

the same unique name was already run.

timeout_seconds integer Maximum time in seconds to wait for the

flow instance to complete. The default

time out is set to -1 (indefinite), meaning

no timeout is enforced. To create a

timeout period, enter value in seconds.

For example, a value of 120 seconds sets

a 2 minute timeout period. Addresses

jobs that may hang and never complete.

response_check function A check against the flow instance run

response object. Returns True for 'pass'

and False otherwise.

Code Sample: PredixInsightsOperator

macro_flow1 = PredixInsightsOperator(task_id='pio-123',predix_insights_flow_name='SparkPyPiFlow',predix_insights_launch_options={'name' : 'my_flow_instance_name'},predix_insights_flow_params = {'confs' : {'data_start' : "2018-01-3100:00:00", 'data_end' : "2018-01-31 00:00:15", 'partitions' : "3"}},timeout_seconds=30,dag=dag)

macro_flow2 = PredixInsightsOperator(task_id='pio-456',predix_insights_flow_name='SparkPyPiFlow',predix_insights_launch_options={'name' : 'my_flow_instance_name'},predix_insights_flow_params = {'confs' : {data_start' : "2018-01-3100:00:01", 'data_end' : "2018-01-31 00:00:16", 'partitions' : "4"},'applicationArgs': ["arg1", "2", "3"]},timeout_seconds=30,dag=dag)

© 2020 General Electric Company 49

Page 54: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

macro_flow2.set_upstream(macro_flow1)

predix_insights_flow_params

The predix_insights_flow_params operator parameter is a dictionary that dynamically updatesarguments for an existing Predix Insights flow at runtime. The arguments are contained in the confsvalue object and applicationArgs array in the predix_insights_flow_params JSON object.

The confs object contains key:value pairs that are matched to flow parameters; the key is matched tothe parameter name, the value is provided as an argument. If the key does not match an existingparameter, the key:value pair is appended to the flow parameter list in the Spark arguments.

The applicationArgs array is a list of values that are appended to the end of the Spark submitcommand.

For example:

Given the following Spark submit command …

./bin/spark-submit \--class <main-class> \--master <master-url> \--deploy-mode <deploy-mode> \--conf <key>=<value> \... # other options/predix/andromeda/frameworks/predix-spark-framework/latest/lib/

arf_runner.py \[application-arguments]

… and Spark arguments (as displayed in the user interface), …

{'applicationArgs': [<default values>],...'confs': {'start_value': <default value>,'another_argument': <default value>

},...

}

… the predix_insights_flow_params operator parameter in this Airflow DAG …

...flow1 = PredixInsightsOperator (task_id='flow1-id',...predix_insights_flow_params = {'confs': {'start_value': "Test Value",

'another_argument':"Another Value",

'new_argument_not_in_ui':"New Value"},

'applicationArgs':["10","20","30"]},...dag=dag)

...

50 © 2020 General Electric Company

Page 55: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

… updates the Spark submit command as follows:

./bin/spark-submit \... \--conf start_value=Test Value \--conf another_argument=Another Value \--conf new_argument_not_in_ui=New Value \... \/predix/andromeda/frameworks/predix-spark-framework/latest/lib/

arf_runner.py 10 20 30

Jinja Templating

Predix Insights supports Jinja templating in the predix_insights_flow_params parameter of thePredixInsightsOperator macro.

For example (Airflow DAG):

...def push_function(**context):context['ti'].xcom_push(key='fruit', value='apple')context['ti'].xcom_push(key='applicationArgs', value=['52', '67',

'12'])return 'values pushed'

flow1_task1 = PythonOperator(task_id='push_values_task',provide_context=True,python_callable=push_function,op_kwargs={},dag=dag)

flow1_task2 = PredixInsightsOperator(task_id = 'id-123',provide_context=True,...predix_insights_flow_params = {'confs': {'veggie': "onion",

'fruit': "{{ task_instance.xcom_pull(task_ids='push_values_task',key='fruit')}}" },

'applicationArgs':"{{ task_instance.xcom_pull(task_ids='push_values_task',key='applicationArgs')}}"},...dag=dag)

...

Note: Jinja templating is supported only in the predix_insights_flow_params parameter of thePredixInsightsOperator macro.

See the Apache Airflow Templating with Jinja documentation for more information.

Functions

Predix Insights supports two functions to execute Predix Insights flows.

• PythonOperator, for generic Python code execution.• BranchPythonOperator, for conditional branching.

© 2020 General Electric Company 51

Page 56: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

The following parameters are specific to the PythonOperator function.

Table 4: PythonOperator Parameters

Name Description

python_callable User Python function.

op_args Arguments passed to the PythonOperator function.

op_kwargs Keyword arguments passed to the PythonOperatorfunction

dag The user DAG object which the operator belongs to.

The following is an example of the structure of a PythonOperator function.

def f (*args, **kwargs):first_arg = args[0]; # fetches 5first_kw_arg_value = kwargs['param1'] # fetches 20

def on_success(context):print "task succeeded"

pyop = PythonOperator(task_id='py_callable',python_callable=f,op_args=[5],op_kwargs={'param1': 20},on_success_callback=on_success,dag=dag)

The BranchPythonOperator function is very similar to PythonOperator, except thepython_callable parameter must return the task_id of the downstream task it wants to branch to.

For example, if you have three downstream tasks with IDs child_task_1, child_task_2, andchild_task_3, then branching out to child_task_2 requires that python_callable parameterreturns "child_task_2".

def f(*args, **kwargs):

....

return "child_task_2"

Task Dependencies

Relationships between tasks can be defined using two methods:

To schedule step2 to start after step1 is complete, use the following:

step1.set_downstream(step2) #

To schedule step3 to complete before step3 is started, use the following:

step4.set_upstream(step3)

52 © 2020 General Electric Company

Page 57: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Configuration Management

A DAG can be uploaded as a Python (.py) or ZIP (.zip) file artifact. Multiple files can be uploaded at thesame time.

ZIP artifacts must contain the actual DAG files (as .py) and may optionally contain other files, such asconfiguration files, that are read by the DAG at runtime. For example, if you want to read specificparameters from a configuration file and use them inside a DAG file, you must include the configurationfile in the ZIP artifact you upload, and write the DAG file code so it reads the configuration.

The following is an example of the DAG ZIP file structure.

To reference a configuration file inside the DAG file, create a reference to the configuration file beforeaccessing it, such as (read_config.path('mydagconfig.json') where the configuration file isnamed mydagconfig.json. Assume the structure of mydagconfig.json is as follows:

{"dag" :

Unknown macro: { "dag_name" }}

From airflow.utils import config as read_configuration:

read_config = json.load(open(read_config.path('mydagconfig.json')))# create file reference and base your read code off of the

reference instead of thefilename itself..dag = DAG(read_config['dag']['dag_name'], ......)..

Sample: DAG File with Example ValuesThis annotated DAG sample includes PredixInsightsOperator, PythonOperator, andBranchPythonOperator definitions with example values.

import airflowfrom airflow import DAGfrom airflow.operators.predix_insights_operator importPredixInsightsOperatorfrom airflow.operators.python_operator import PythonOperatorfrom airflow.operators.python_operator import BranchPythonOperatorfrom airflow.operators.sensors import HttpSensorfrom datetime import timedelta, datetime

© 2020 General Electric Company 53

Page 58: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

import jsonimport randomimport os

import logging

log = logging.getLogger(__name__)

#Sets the default arguments the dag needs. The owner can be anydesired user defined string.default_args = {

'owner': 'd7406965-66a8-49e7-9ca2-7ab775144e39','start_date': datetime(2018, 2, 14, 0, 0),'retries': 2,'retry_delay': timedelta(minutes=2),

}

#Define the dag object to be used. First specify the name, theschedule using chron syntac (currently every minute),#then specify the arguments that we created above.dag = DAG('new_dag_name', schedule_interval = "*/1 * * * *",default_args=default_args)log.info("DAG instantiated")

dag.doc_md = __doc__log.info("Doc provided")

#This creates an operator which can allow the user to specifydifferent operator flows based on user#defined logic. In this example, we preform some arbitrary logic toshow how we can specify the next operator we want ran.#Here we pass a random number as an argument, then a mod number as akeyword argument. Using these different arguments,#we preform a calculation. Based on the result, we return the name ofthe next operation that the dag should run.def branch_option (*args, **kwargs):

first_arg = args[0] # fetches randomfirst_kw_arg_value = kwargs['modSize'] # fetches 10rand_number = (first_arg%first_kw_arg_value) + 1print "Random number from 1-10: " + str(rand_number)if rand_number < 11 and rand_number > 0:

return 'Run_Iris'return 'Should_Not_Run_Op'

def on_branch_success(context):print "Fetch succeeded"

t1 = BranchPythonOperator(task_id='branch_py_callable',python_callable=branch_option,op_args=[random.randint(0,100)],op_kwargs={'modSize': 10},on_success_callback=on_branch_success,dag=dag)

#Here de define an operator that should never run, as we have aconditional statement in our branch operator#that skips past it every time.def return_error (*args):

first_arg = args[0]

54 © 2020 General Electric Company

Page 59: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

return first_arg

def on_failure_occurance(context):print "Branch should never come here!"

t2 = PythonOperator(task_id='Should_Not_Run_Op',python_callable=return_error,op_args=[-1],on_success_callback=on_failure_occurance,dag=dag)

#This operator refers to a flow that we have defined inside PredixInsights (defined using the 'predix_insights_flow_name').#If running sample, either define a test flow named Iris or updatethis flow name to a flow that is defined in your#Predix Insights instance.t3 = PredixInsightsOperator(

task_id='Run_Iris',predix_insights_flow_name='iris_example_flow',dag=dag)

#Example of A python Operator using simple print statementsdef print_message (*args):

print "Analytic Pipeline Completed"

def on_print_success(context):print "Final Print Completed"

t4 = PythonOperator(task_id='print_operation',python_callable=print_message,on_success_callback=on_print_success,dag=dag)

#Set the order of each operator in the dagt2.set_upstream(t1)t3.set_upstream(t1)t4.set_upstream(t3)

For more DAG file samples, see Predix Insights Code Samples on page 26.

Uploading a DAG File Using UIA DAG file contains the collection of tasks to be run as a single unit and defines the order of execution atruntime.

About This Task

Using the Develop → Orchestration page in the UI you can Deploy, Update, or Delete a DAG or flow.Once a DAG or flow has been deployed it can be viewed in Monitor → Flows and Monitor →Orchestration pages.

Procedure

1. Navigate to Develop > Orchestration.2. Click Upload.

© 2020 General Electric Company 55

Page 60: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

3. Drag the file to the upload dialog, or, alternatively, click Choose File and navigate to the file.

4. Click Upload.

The DAG file is added to the Orchestration page as a new row.

Uploading a DAG File Using the APIThe orchestration defines a group of analytic flows you want to be run together as a single unit, theexecution task order is defined in the corresponding directed acyclic graph (DAG) files. A DAG is defined instandard Python files and uploaded to Predix Insights.

You can upload the DAG file by issuing the following REST API command.

POST <insights_api_uri>/api/v1/dags

The request payload is a multipart/form-data type with the following parts.

Name Value Required/Optional

file DAG file in Python format. For example,

demo-dag.py.

Required

name Name of the DAG file. For example,

demo_dag.

Required

The following is an example response showing sample values.

{"id": "f6a4a377-8f02-438a-86e9-45bfde7c1096","created": 1520551509019,"updated": 1520551509019,"version": "v-snapshot","name": "demo_dag","description": null,"type": "AIRFLOW","tags": [],

56 © 2020 General Electric Company

Page 61: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

"blobPath": "andromeda/tenants/d7406965-66a8-49e7-9ca2-7ab775144e39/airflow/dags/demo_dag_vani6/demo_dag.py","deployed": false

}

Deploying a DAG File Using the API

Deploy a DAG file using the following REST API command.

POST <insights_api_uri>/api/v1/dags/{dagName}/deploy

© 2020 General Electric Company 57

Page 62: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Monitor

Monitoring Flows

Monitoring Flow StatusThe Monitor Flows dashboard provides flow and job status details. From here you can drill down formore details and actions using the shortcut links provided in the summary table.

About This Task

Completed jobs older than 30 days are not available in the Monitor Flows page. Job data is purged fromthe system 30 days after the job is run.

Procedure

Navigate to Monitor > Flows.

The Monitor Flows page displays.

58 © 2020 General Electric Company

Page 63: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

The table provides summary information about each flow and job with additional links for moreinformation.

© 2020 General Electric Company 59

Page 64: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Column Description

Name The name of the flow instance and its unique ID. The name is alink to the Flow Instance Details page having moreinformation and about the instance.

Status The current status of the flow:

• Accepted - The flow has been schedule to run and isqueued.

• Failed - The flow did not complete processing.• Finished - The flow successfully completed processing.• Killed - The flow stopped before processing was completed.• Running - The flow is currently running.• Succeeded - The flow execution was successful.

Job Result The result of the job.

Start Time The time the flow job started.

Finish Time The time the flow job completed.

Flow The name of the flow job. If you select the flow name from thecolumn, the Flow Details page displays.

Retrieving Flow Execution Status Using the APIYou can retrieve the flow execution instance status, including a list of containers created, using the APIsas follows.

The response provides the following information for each container: containerId, startTime,finishTime, node, memoryMB, and vcores.

You can retrieve flow instance status by issuing the following REST API command.

GET <insights_api_uri>/api/v1/instances/{instanceId}/containers/

The following is an example response showing sample values.

[{"containerId": "container_1517268829965_26388_01_000001","startTime": 1520459003697,"finishTime": 1520459008652,"node": "http://ip-10-72-153-242.us-west-2.compute.internal:8042","memoryMB": 1408,"vcores": 1

},{"containerId": "container_1517268829965_26388_02_000001","startTime": 1520459008754,"finishTime": 1520459014104,"node": "http://ip-10-72-153-228.us-west-2.compute.internal:8042","memoryMB": 1408,"vcores": 1

}]

60 © 2020 General Electric Company

Page 65: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Retrieving Logs Using the API

Retrieve the log files (stdout, stderr) for a container by issuing the following REST API command.

GET <insights_api_uri>/api/v1/instances/{instanceId}/containers/{containerId}/logs

To retrieve a specific log file type, use the following command.

GET <insights_api_uri>/api/v1/instances/{instanceId}/containers/{containerId}/logs/{logName}

Monitoring Orchestration

Monitoring DAG Status Using the UI

The Predix Insights UI provides a variety of options for viewing task status. Use Monitor > Orchestrationto access the DAG dashboard (Figure 1: DAG Dashboard on page 62) to see a summary status. Fromhere, you can drill down for more details, actions, and visualization options using the shortcut links toadditional pages. Hover text provides additional details and options throughout the dashboard.

To access additional status pages, select from the Browse list in dashboard for more options.

© 2020 General Electric Company 61

Page 66: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Figure 1: DAG Dashboard

Additional visualization options are available from the Links column in the DAG dashboard (Figure 1: DAGDashboard on page 62) and from a menu bar in some of the Orchestration Monitoring pages (Figure 2:Orchestration Monitoring Visualization Menu on page 63).

62 © 2020 General Electric Company

Page 67: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Figure 2: Orchestration Monitoring Visualization Menu

Graph ViewUse to visualize the dependencies and status for the selected DAG run.

Tree ViewUse to visualize a tree representation of DAG runs spanning time. Can be used to identify differentsteps in a pipeline and identify blockers.

© 2020 General Electric Company 63

Page 68: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Task DurationUse to visualize the different tasks across previous runs. Can be used to find any outliers and interprethow time is spent across DAG runs.

Task TriesUse to visualize the specified number of task runs across time as a line graph.

Landing TimesUse to visualize the time when a job completes (minutes) against the time when the job should havestarted.

GanttUse to visualize task duration and overlap. Can be used to identify bottlenecks in the pipeline and howtime is spent for select DAG run.

DetailsUse to visualize selected DAG details in depth: number of times runs failed or succeeded, scheduleinterval, concurrency, default Spark arguments, task count, task Ids, filepath, and owner.

64 © 2020 General Electric Company

Page 69: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Code

Use to quickly see the code that generated the selected DAG.

Retrieving DAG Run Status Using the API

Retrieve list of DAG runs and status by issuing the following REST API command.

GET <insights_api_uri>/api/v1/dags/status/{dagName}/runs

The following is an example response showing sample values.

"demo_dag_vani5": [{"run_id": "scheduled__2018-03-08T23:30:00","dag_name": "demo_dag_vani5","dag_tenant_id": "d7406965-66a8-49e7-9ca2-7ab775144e39","end_date": "None","state": "running","execution_date": "2018-03-08 23:30:00","external_trigger": "False","dag_owner": "d7406965-66a8-49e7-9ca2-7ab775144e39","dag_id": "demo_dag_vani5","start_date": "2018-03-08 23:50:28.112710"}

© 2020 General Electric Company 65

Page 70: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

]}

Retrieving Task Status Using the API

Retrieve a DAG task status by issuing the following REST API command.

GET <insights_api_uri>/api/v1/dags/status/{dagName}/tasks/runs/{runId}}

The following is an example response showing sample values.

{"tasks": [{"task_id": "Prepare_data-vani1","end_date": "2018-03-08 23:52:24.842813","execution_date": "2018-03-08 23:30:00","state": "up_for_retry","duration": "2.48294","start_date": "2018-03-08 23:52:22.359873","job_id": "652114"

},{"task_id": "Run_Analytic-vani1","end_date": "None","execution_date": "2018-03-08 23:30:00","state": "None","duration": "None","start_date": "None","job_id": "None"

}],"end_date": null,"run_id": "scheduled__2018-03-08T23:30:00","external_trigger": false,"dag_name": "demo_dag_vani5","execution_date": "2018-03-08T23:30:00Z","start_date": "2018-03-08T23:50:28Z","state": "running","dag_tenant": "d7406965-66a8-49e7-9ca2-7ab775144e39","dag_owner": "d7406965-66a8-49e7-9ca2-7ab775144e39","dag_id": "demo_dag_vani5"

}

66 © 2020 General Electric Company

Page 71: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Release Notes

Q1 2019Enhancements, issues, and fixes for the first quarter 2019 release.

Enhancements

Addition of confs and applicationArgs to predix_insights_flow_params

The predix_insights_flow_params operator parameter can now include confs andapplicationArgs properties. Either or both of the properties may be included.

The confs property is a value object that enables dynamic updating of parameters in the Sparkarguments. Key:value pairs of the confs object are appended to the flow parameter list when the keycannot be matched to an existing flow parameter. Formerly, unmatchedpredix_insights_flow_params key:value pairs caused an error.

The applicationArgs property enables the addition of an array of application arguments that areappended to the Spark submit command.

See the predix_insights_flow_params section of the Directed Acyclic Graph (DAG) Development onpage 46 documentation for more information.

Full Support for Jinja Templating

Jinja templating may be used in the confs value object and applicationArgs array in thepredix_insights_flow_params operator parameter.

See the Jinja Templating section of the Directed Acyclic Graph (DAG) Development on page 46documentation to learn more.

Discontinuation of $ Templating of Flow Parameters

Flow parameters may instead be updated using the new confs property of thepredix_insights_flow_params operator parameter.

See the predix_insights_flow_params section of the Directed Acyclic Graph (DAG) Development onpage 46 documentation.

Note: DAGs that use $ templating in their flow parameters will no longer function.

Predix Insights (Beta) Version 1.0 Release Notes

Note:

To request participation in the Predix Insights Limited Access Beta release, send email to the PredixInsights Beta distribution list @Digital Predix Insights Beta ([email protected]).

February 2018 Release

These are the new features, enhancements, and known and resolved issues for Predix Insights.

New Features

This release contains the following new features:

• Built to be agnostic to execution runtime (Hadoop distributions and cloud providers).

© 2020 General Electric Company 67

Page 72: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

• Extensible to support multiple big data processing frameworks (Spark, Apex, etc.). Currently supportsSpark.

• Provides a managed infrastructure so a developer can focus on building their applications.• Provides the ability to create a Directed Acyclic Graph (DAG) to define the order of tasks for a Flow, as

well as orchestrate and schedule their execution.• Provides distributed log aggregation and monitoring of the big data applications.• Supports SQL (Spark based) on Predix Blobstore and other data stores such as Predix Columnar Store,

Predix-Search, etc.• Auto scales both orchestration and analytic execution.• Provides dedicated plans with robust security.• Provides data connectors to connect to Predix data stores such as Predix Time Series, Predix Event

Hub, Predix Asset, etc.• Provides the embedded Predix Spark Framework, which enables user to write spark applications in a

declarative manner using JSON with data input and sinks/outputs.

Enhancements

This release contains the following enhancements.

Resolved Issues

The following issues have been resolved in this release:

ID Description

.

Known Issues

This release contains the following known issues.

ID Description Workaround

68 © 2020 General Electric Company

Page 73: Predix Insights - ge.com · Predix Insights is a big data processing and analytics service. It provides native Apache Spark support and orchestration features using Apache Airflow.

Troubleshooting

General Issues

© 2020 General Electric Company 69