The Analytics Pipeline and Data Flow - MeriTalk · The Analytics Pipeline and Data Flow September...

23
The Analytics Pipeline and Data Flow September 20, 2018 Linton Ward, PhD IBM Distinguished Engineer OpenPower Cognitive Solutions

Transcript of The Analytics Pipeline and Data Flow - MeriTalk · The Analytics Pipeline and Data Flow September...

The Analytics Pipeline and Data Flow

September 20, 2018

Linton Ward, PhD

IBM Distinguished Engineer

OpenPower Cognitive Solutions

Emmanuel Macron Talks to WIRED About France's AI Strategy

2

EM: I think artificial intelligence will

disrupt all the different business

models and it’s the next disruption to

come. So I want to be part of it.

Otherwise I will just be subjected to

this disruption without creating jobs

in this country. So that’s where we

are. And there is a huge acceleration

and as always the winner takes all in

this field.

https://www.wired.com/story/emmanuel-macron-talks-to-wired-about-frances-ai-strategy/

Nicholas Thompson business 03.31.18 06:00 am

One Month of Civilian Agency news …

3

HHS CTO Report Calls Data Silos to TaskA new report from the Department of Health and Human Services’ (HHS) CTO calls

out the department and its individual agencies for keeping their data in silos, and

calls for a department-wide data governance framework.

“Whether surveillance, survey, or claims data, HHS expends an enormous amount

of financial resources to report on the state of the health of the population it serves,”

https://www.meritalk.com/articles/hhs-cto-report-calls-data-silos-to-task/

House Bill to Codify CDM Moves to Senate“Cyberattacks are escalating at an alarming rate, making it vital that our Federal

agencies have access to programs and tools to help mitigate these risks,”

https://www.meritalk.com/articles/house-bill-to-codify-cdm-moves-to-senate/

New DHS S&T Program Targets Internet, Critical

Infrastructure DisruptionThe new program – the Predict, Assess Risk, Identify (and Mitigate) Disruptive Internet-

scale Network Events (PARIDINE) project – aims to study Network/Internet-scale

Disruptive Events (NIDE), which can cut internet or network connectivity, leading to

disruptions of “energy and water systems, the finance sector, commerce, and public

safety and emergency communications systems, as well as other essential systems.”

https://www.meritalk.com/articles/new-dhs-st-program-paridine/

NIST Wants to Know: Can You Trust Your IoT?The draft publication outlines 17 trust-related issues “that may negatively impact

the adoption of IoT products and services,” spanning scalability,

predictability, difficult in measurement, lack of certification criteria, all the way

down to usability, performance, and reliability.

https://www.meritalk.com/articles/can-you-trust-your-iot/

GAO Releases Updated Cyber Risk ReportThe Government Accountability Office (GAO) today released an updated version of a

report it issued in July detailing major cybersecurity challenges facing the Federal

government and critical actions needed to address them.

https://www.meritalk.com/articles/gao-releases-updated-cyber-risk-report/

DHS CIO Says Priorities Include Modernization, Workforce,

Supply Chainhe Department of Homeland Security (DHS) is focused on modernizing its mindset to

tackle a host of pressing issues including reducing its reliance on legacy systems,

competing to attract cybersecurity talent, and combating supply chain threats, said DHS

CIO John Zangardi today at the Billington Cybersecurity Summit.

“We’re in a very, very different world than we have been in the past,” said Zangardi.

“I’ve been in government for a long time. We’re really good at routine. But cyber threats

are asymmetrical. The adversary’s not thinking about routine, the adversary is thinking

about how to do things differently.”

https://www.meritalk.com/articles/dhs-cio-says-priorities-include-modernization-

workforce-supply-chain/

SEC Looking for Social Media Monitoring ToolThe Securities and Exchange Commission (SEC) on Thursday issued a solicitation

for “a web-based subscription to a Commercial-Off the Shelf (COTS) social media

monitoring tool that provides emailed alerts to SEC staff based on keyword searches

for relevant topics with ability to monitor social media sites.”

https://www.meritalk.com/articles/sec-looking-for-social-media-monitoring-

tool/

State Department Looking for Platform to Track, Analyze

Online InfoThe State Department has issued a request for information for systems that collect

relevant online information to “analyze and track global developments in (near) real-

time.” … The State Department listed its needs for a monitoring system, including:

aiding the ability to verify the credibility of a source; ensuring the accuracy of machine-

generated content from different languages; and distributing information quickly.

https://www.meritalk.com/articles/state-department-looking-for-platform-to-track-

analyze-online-info/

4

The Administration is developing a Federal Data Strategy

to leverage data as a strategic asset to grow the economy,

increase the effectiveness of the Federal Government,

facilitate oversight, and promote transparency.

Strategy 1: Enterprise Data Governance. Set priorities for

managing Government data as a strategic asset, including

establishing data poli- cies, specifying roles and responsibilities

for data privacy, security, and confidentiality protection, and

monitoring compliance with standards and policies …

•Strategy 2: Access, Use, and Augmentation. Develop policies and procedures and incent investments that

enable stakeholders to effectively and efficiently access and

use data assets by: (1) improving dissemination, making data

available more quickly and in more useful formats; (2)

maximizing the amount of non-sensitive data shared with the

public; and (3) leveraging new technologies and best

practices to increase access to sensitive or restricted data

while protecting the privacy, security, and confidentiality, and

interests of data providers.

•Strategy 3: Decision-Making and Accountability. Improve the use of data assets for decision-making and

accountability for the Federal Government, including both

internal and external uses. This includes: (1) providing high

quality and timely information to inform evidence-based

decision-making and learning; (2) facilitating external

research on the effectiveness of Government programs and

policies which will inform future policymaking; and (3)

fostering public accountability and transparency ....

Strategy 4: Commercialization, Innovation, and

Public Use. Facilitate the use of Federal Government data

assets by external stakeholders at the forefront of making

Government data accessible and useful through commercial

ventures, innovation, or for other public uses. This includes use

by the private sector and scientific and research communities;

by states, localities, and tribes for public policy pur- poses; for

education; and in enabling civic engagement.

PRESIDENT'S MANAGEMENT AGENDA

Application Transformation

5

Insight

Cognitive Platforms

Analytic data

Platforms

Digital transformation is driving demand for

new applications, new databases, and new insights

Engagement

Mobile, Web,

Call centers,

Edge & IoT

Engaging

partners, clients,

employees &

machines

Record

Business Logic

Operational Data

Platforms

Operating

business process

flows

Insights,

Trained Models

Scores

Insights,

Trained Models

Scores

Customer

context, queries

Machine data

Customer,

Transaction

Data

Enabling

data-driven

Decisions

New:

In-line analytics

Scoring capability

Context Data

New:

Applications

Data types &

representations

Database types

Systems of Insight Landscape

6

Enterprise

Data Warehouses

Business Intelligence Tools

Data Ingest

Hadoop Data Lakes

Conventional

Emerging

AI Grid

+Open Source

Python, R

Data Science Workbench

Modern Databases

Modern Business

Intelligence

Data Governance

Statistics Tools

Application

Development

PlatformSQL

+

+File Systems

Success with analytics projects (ways to succeed)

7

How do we derisk analytics projects?

Clarity on the question Apply critical thinking techniques with buy-in

Enable faster exploration The data science workbench: create an ad hoc

workflow quickly

Enable quicker win Data science sandbox: prototype from data scientist

rather than presentation alone

Scale to production AI Grid: multi-tenant, high stability, high efficiency

cluster

Cognitive Platform: Analytic Project Lifecycle

Progression from Data Science Workbench to operationalized insights

Prototype Pilot Scale

Highly Stable Highly Agile

Minimize

Investment

Demonstrate

Value

Operationalize

Value

Optimized

Value / $

SandboxProduction

Model Build

Common Data

Maintain

Model

Currency

Sustain

Value

Dev Ops Stable

Streamlined

Maintenance

Innovation

Early

Libraries

Unstable

Explanable

Mature

Libraries

Stable

© 2015 IBM Corporation

Welcome to the Waitless World

© 2016 IBM Corporation

The Data Science Workbench

9

10 © IBM Corporation, 2017

Workload flow and data flow are key to results

Traditional Business

IoT & Sensors

Collaboration Partners

Mobile Apps & Social Media

Legacy

Data Preparation

Pre-Processing

Training

Dataset

Data Source Model Training Inference

AI Deep Learning

Frameworks

(Tensorflow & Caffe)

Monitor

& Advise

Instrumentation

Iterate

Distributed & Elastic Deep

Learning (Fabric)

Parallel Hyper-Parameter

Search & Optimization

Network

ModelsHyper-

Parameters

Testing

Dataset

Trained Model

Deploy in

Production using

Trained Model

New Data

Years

of Data

Hours of

preparation

Weeks &

months of

training

Seconds

to results

Heavy IO

Cognitive Systems – Capabilities in the Data Science Workbench

Structured

Text

Audio

Image

Video

The Data Science Workbench comprises a set of capabilities

Data Platforms

Yarn (Map-Reduce)SparkStreams

Visualization

Exploration

Interpretive

Environments

NLP Text

Analytics

Graph

Analytics

Image

Analytics

Machine Learning

Deep Learning

Analysts

Toolbox

HPDA

HPC

HDFS

Spectrum

Scale

Open Stack

SwiftCloud Object

Store

Cassandra Redis Mongo

Geospatial

Analytics

Streaming

Analytics

Statistics &

Classification

Titan

Neo4j

11

Ingest

Streaming

Message

BatchPostgres

Execution Frameworks and AI Grid

Data Science Workbench

IBM Spectrum Conductor

AI Grid

PowerAI: Optimized Open Source ML Frameworks

Large Model Support (LMS)

Distributed Deep Learning (DDL)

PowerAI: Open Source ML Frameworks

PowerAI Enterprise

Distribution

Package Manager

Efficient multi-tennant

Resource Scheduler

Python & R Ecosystem

Deep Learning Impact

PowerAI Vision

Productivity &Simplification

Data & Model Management,

Visualize, AdviseAuto-hyperparameter

optimization

End to EndImage Classification

DRIVERLESS AI Auto ML

Scale DL to Hundreds of GPUs

DL for much higherresolution

13

ANACONDA Accelerates Adoption of

Open Data Science for Enterprises

• Easy to install

• Agile data exploration

• Powerful data analysis

• Simple to collaborate

• Accessible to everyone

PYTHON & R OPEN SOURCE ANALYTICS

NumPy SciPy Pandas Scikit-learn Jupyter/IPython

Numba Matplotlib Spyder Numexpr Cython Theano

Scikit-image NLTK NetworkX IRKernel dplyr shiny

ggplot2 tidyr caret PySpark & 720+ packages

14

IBM AI / Data Science Workbench: DSX Local

14

DSX (Data Science Experience)

IBM ML

Libraries

Jupyter Notebooks & Rstudio, Model & Data

Management, Hyper-parameter Tuning, GUI

Spark, Data Lake, Connectors to DBs

Cognitive Systems Data Stores

H2O &

Anaconda

PowerAI DL

Distribution

Non-IBM ProductsLegend

Hadoop

Spark

Object Store

PowerAI

Deep Learning Frameworks

DDL: Distributed Deep Learning

Hyper-Parameter Tuning, GUI

Spectrum Conductor

The AI Grid

15

Tape

Servers & Storage

IBM Software Defined Infrastructure

Multi-scale Infrastructure for High Performance Computing & Analytics

Workload AwareScheduling

SharedResourceManagement

High Performance Computing

Design / Simulation / Modeling

Hybrid Cloud Infrastructure

‘New-gen Workloads’

Hadoop, Spark, Containers

Disk Flash Power

SharedMulti-tier Data Management

Cloud

IBM Spectrum Conductor

17

Faster Time to Results

• Proven High-performance scalable resource and job scheduler

• Multitenant resource sharing

Simplified Deployment & Management

• Complete solution: scheduling, monitoring, alerting, reporting &

diagnostics

• Lifecycle management supporting multiple concurrent and different

versions

Lower Infrastructure Costs with Optimized Resource Sharing

Coming Soon

Secure Multi-tenant, deploy and manage modern computing frameworks & services

Workload Management

Services Management

Resource Management and Orchestration

Services andSupport

Mo

nit

ori

ng

an

d R

ep

ort

ing

• Enhanced Notebook &

Anaconda Integrations

• Job Dependencies

• DSX Integration

• Fine Grained Resource

Allocation

Delivering Value for Data Science

18

Cognitive Systems are built with optimized hardware and software

Open Source

Software

Partner Software

Industry Solutions

Dev E

co

syste

m

Accelerator Roadmaps

Open Accelerator Interfaces

Not Just About Hardware Design

hardware

software

+

It’s about co-optimized

which just work for Machine Learning,

Deep Learning, and AI

Optimized Libraries

9DaysRecognition

Recognition

54x

Learning

runs with

Power 8

What will you do?

Iterate more and create more accurate models?

Create more models?

Both?

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

4 H

ou

rs

Faster Training Time with Distributed Deep Learning

21

libGLM (C++ / CUDA

Optimized Primitive Lib)

Distributed Training

Logistic Regression Linear Regression

Support Vector

Machines (SVM)

Distributed Hyper-

Parameter Optimization

More Coming Soon

APIs for Popular ML Frameworks

Snap ML

Distributed GPU-Accelerated Machine Learning Library

(coming

soon)

Snap Machine Learning (ML) Library

An Optimized AI Infrastructure Stack

22

Data Platform

Applications and Services

Cognitive APIs (Eg: Watson)

In-House APIs

Machine & Deep Learning Libraries & Frameworks

Distributed Computing

Data Lake & Data Stores

Segment Specific:

Finance, Retail, Healthcare

Speech, Vision,

NLP, Sentiment

TensorFlow, Caffe,

SparkML

Spark, MPI

Hadoop HDFS,

NoSQL DBs

Accelerated

InfrastructureAccelerated Servers Storage

PowerAI

AI Grid

Open Source and ISV ToolsFunction Specific

Finance, Retail, Healthcare

Open Source Programming Ecosystem

Python, R, etc

Languages and

Libs

Data Science

Workbench

Open Source

Software

Partner Software

Industry Solutions

Dev E

co

syste

m

Accelerator Roadmaps

Open Accelerator Interfaces

Optimized Libraries

Time to value for new intelligence

Data Science Productivity

Data Productivity

AI for the rest of us

“We can do new science”

Solve larger problems

Solve previously intractable problems