Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies...

30
Executive Summary This reference architecture describes a hybrid cloud solution that can handle existing enterprise workloads (like using SQL and NoSQL databases for transactional processing) but also extends capabilities to include compute- and memory-hungry artificial intelligence (AI) jobs. From data warehousing to machine learning and deep learning, the Hybrid Cloud Data Analytics Solution is just what enterprises need as data volumes continue to swell and data centers are increasingly pressured to provide real-time insights to launch their businesses into the future. This solution is a unique combination of the latest hardware innovations from Intel, VMware’s broad portfolio of virtualization software products, container orchestration, and AI tools optimized to run on 2nd Generation Intel® Xeon® Scalable processors. These processors feature built-in inferencing acceleration through Intel Deep Learning Boost (Intel DL Boost) with Vector Neural Network Instructions (VNNI). With this end-to-end solution, enterprises can quickly operationalize database processing and AI to discover insights hidden in their data—and scale to meet future needs. Analytics/AI Cloud Infrastructure: VMware Cloud Foundation Intel Data Center Blocks For Cloud – VMware (vSAN ReadyNodes) VMware vSphere Cluster VMware vSphere Cluster VMware NSX-T VMware vCenter Server 6.7 U VMware vSAN 6.7 EP VMware ESXi Hypervisor 6.7 EP vSAN ReadyNode VRN2208WFAF82R vSAN ReadyNode VRN2208WFAF83R Intel Architecture Optimized Building Blocks Intel Distribution for Python Deep Learning Reference Stack Intel MKL Intel MKL-DNN NoSQL Databases SQL Databases H2O.ai VMware Enterprise PKS Orchestration VMware Software-Defined Data Center Manager Intel SSD DC P4510 Series NVM-e Based 3D NAND 2nd Gen Intel Xeon Scalable Processors Gold 6230-6248 Platinum 8260 Intel Optane™ SSD DC P4800X Series Intel Ethernet 700 Series Consistent High-Performance Architecture Based on Intel Technology Machine-Learning Software Data Warehousing Figure 1. Building blocks of the reference architecture for the Hybrid Cloud Data Analytics Solution with vSAN ReadyNodes. Run traditional enterprise workloads plus newer demanding AI workloads both on‑premises and in the cloud with an end‑to‑end solution optimized by Intel and VMware Easily Consumable Hybrid Cloud Data Analytics Solution REFERENCE ARCHITECTURE Intel Builders Enterprise Data Center Intel Data Center Group Authors Patryk Wolsza Cloud Solutions Architect Karol Brejna Senior Architect Marcin Hoffmann Cloud Solutions Engineer Intel Data Center Group Contributors Lokendra Uppuluri Software Architect Ewelina Kamyszek Undergraduate Intern Technical Marek Małczak Cloud Solutions Engineer Lukasz Sitkiewicz Software Engineer VMware Contributor Enrique Corro Fuentes Data Science Staff Engineer, Office of the CTO 2nd Generation Intel Xeon Scalable Processors The Hybrid Cloud Data Analytic Solution features 2nd Generation Intel Xeon Gold and Platinum processors.

Transcript of Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies...

Page 1: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Executive SummaryThis reference architecture describes a hybrid cloud solution that can handle existing enterprise workloads (like using SQL and NoSQL databases for transactional processing) but also extends capabilities to include compute- and memory-hungry artificial intelligence (AI) jobs. From data warehousing to machine learning and deep learning, the Hybrid Cloud Data Analytics Solution is just what enterprises need as data volumes continue to swell and data centers are increasingly pressured to provide real-time insights to launch their businesses into the future.

This solution is a unique combination of the latest hardware innovations from Intel, VMware’s broad portfolio of virtualization software products, container orchestration, and AI tools optimized to run on 2nd Generation Intel® Xeon® Scalable processors. These processors feature built-in inferencing acceleration through Intel Deep Learning Boost (Intel DL Boost) with Vector Neural Network Instructions (VNNI).

With this end-to-end solution, enterprises can quickly operationalize database processing and AI to discover insights hidden in their data—and scale to meet future needs.

Analytics/AI

Cloud Infrastructure: VMware Cloud Foundation

Intel Data Center Blocks For Cloud – VMware (vSAN ReadyNodes)

VMware vSphere ClusterVMware NSX-T

VMware vCenter Server 6.7 U

VMware vSAN 6.7 EPVMware vSphere ClusterVMware NSX-T

VMware vCenter Server 6.7 U

VMware vSAN 6.7 EP VMware ESXi Hypervisor 6.7 EP

vSAN ReadyNode VRN2208WFAF82R vSAN ReadyNode VRN2208WFAF83R

Intel Architecture Optimized Building Blocks

Intel Distribution for Python

Deep Learning Reference Stack Intel MKL

Intel MKL-DNN

NoSQL Databases SQL DatabasesH2O.ai

VMware Enterprise PKSOrchestration

VMware Software-DefinedData Center Manager

Intel SSD DCP4510 SeriesNVM-e Based

3D NAND

2nd Gen Intel XeonScalable Processors

Gold6230-6248

Platinum8260

Intel Optane™ SSD DC P4800X Series

Intel Ethernet700 Series

Consistent High-Performance Architecture Based on Intel Technology

Machine-LearningSoftware

Data Warehousing

Figure 1. Building blocks of the reference architecture for the Hybrid Cloud Data Analytics Solution with vSAN ReadyNodes.

Run traditional enterprise workloads plus newer demanding AI workloads both on‑premises and in the cloud with an end‑to‑end solution optimized by Intel and VMware

Easily Consumable Hybrid Cloud Data Analytics Solution

RefeRence ARchitectuReIntel BuildersEnterprise Data Center

Intel Data Center Group AuthorsPatryk Wolsza Cloud Solutions Architect Karol Brejna Senior Architect

Marcin Hoffmann Cloud Solutions Engineer

Intel Data Center Group ContributorsLokendra Uppuluri Software Architect

Ewelina Kamyszek Undergraduate Intern Technical

Marek Małczak Cloud Solutions Engineer

Lukasz Sitkiewicz Software Engineer

VMware ContributorEnrique Corro Fuentes Data Science Staff Engineer, Office of the CTO

2nd Generation Intel Xeon Scalable Processors The Hybrid Cloud Data Analytic Solution features 2nd Generation Intel Xeon Gold and Platinum processors.

Page 2: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 2

Why Hybrid Cloud for Machine Learning?While many enterprises choose to run certain workloads in the public cloud, other workloads are better suited to staying on-premises. Increasingly, enterprises want a hybrid cloud option for flexibility and business agility, especially as artificial intelligence (AI) and machine learning workloads become increasingly prevalent. Forbes says that, “Artificial intelligence and machine learning will be the leading catalyst driving greater cloud computing adoption by 2020.”1

VMware and Intel are firm proponents of a hybrid cloud strategy as the most effective way to tackle the requirements of machine-learning development and execution. For example, training machine-learning models using regulated datasets containing sensitive data can be done on-premises; once trained, the models can be affordably deployed in the public cloud to take advantage of special capabilities such as extended geographical coverage, redundancy across availability zones, and on-demand increased capacity for seasonal demand. What’s more, a cloud environment such as VMware Cloud Foundation on Amazon Web Services (AWS) can provide access to the latest innovative Intel architecture-based high-performance infrastructure that can handle ever-larger data warehouses and computationally intense workloads that benefit from hardware acceleration. The hybrid cloud environment can strike the right balance between cost and the feature set of an IT infrastructure that can support successful adoption of AI.

A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will run properly at any point in the hybrid cloud deployment without the need of laborious modifications. The combination of VMware Cloud Foundation (deployed on premises) and VMware Cloud on AWS solves machine-learning workload mobility challenges by delivering a hybrid cloud service that integrates VMware’s Software-Defined Data Center (SDDC) technologies for compute, storage, and network virtualization products. This integration enables companies to use the same tools and skills to operate VMware SDDCs deployed both on-premises and in the public cloud (see Figure 2).

Table of Contents

Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

Why Hybrid Cloud for Machine Learning? . . . . . . . . . . . . .2

Solution Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3Software Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Solution Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . .3Intel Data Center Blocks for Cloud . . . . . . . . . . . . . . . . . . . . . 3Cloud Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Analytics and AI Building Blocks . . . . . . . . . . . . . . . . . . . . . . 5

Platform-Verified Workloads . . . . . . . . . . . . . . . . . . . . . . . . .6Deep Learning on VMware Cloud Foundation . . . . . . . . . . 6Data Warehousing on VMware Cloud Foundation . . . . . . 7

Bill of Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

Deployment Considerations . . . . . . . . . . . . . . . . . . . . . . . . .9VMware Cloud Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9VMware vSAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9VMware NSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9VMware Enterprise PKS (Kubernetes) . . . . . . . . . . . . . . . . . 11Integration of VMware Enterprise PKS, VMware Cloud Foundation, NSX-T, and vSAN . . . . . . . . . . . . . . . . . 13

Environment Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . 13Hardware and Software Requirements . . . . . . . . . . . . . . . 13VMware Cloud Foundation Deployment . . . . . . . . . . . . . . 13NSX-T Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14NSX-T Workload Domain Creation. . . . . . . . . . . . . . . . . . . . 14Adding NSX-T Edge Host to the Newly Created Workload Domain . . . . . . . . . . . . . . . . . . . . 15Configure the Remaining NSX-T Resources for VMware Enterprise PKS . . . . . . . . . . . . . . . . . . . . . . . . . . 16VMware Enterprise PKS Solution Deployment on an NSX-T Workload Domain . . . . . . . . . . . . . . . . . . . . . . 16

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Appendix A – Solution Features Validation and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Appendix B – Deep-Learning Experiment Setup Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Appendix C – Deploying a Kubernetes Cluster Using VMware Enterprise PKS . . . . . . . . . . . . . . . . . . . . . . . . 26Appendix D – Data Warehouse Benchmarking . . . . . . . . 29

VMware Cloud Foundation

PrivateCloud

PublicCloud

VMware SDDC Manager

VMware NSXVMware vSANVMware vSphere

Figure 2. VMware Cloud Foundation is a cloud solution managed through VMware SDDC Manager and built on VMware vSphere, VMware vSAN, and VMware NSX.

Page 3: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 3

Using a Hybrid Cloud Data Analytics Solution based on VMware Cloud on AWS and VMware Cloud Foundation, both running on the latest Intel technologies, enterprises can expect the following benefits:

• Efficiency. Dramatically reduce the effort required to achieve seamless machine-learning workload mobility.

• Decreased risk. Run your applications on familiar and proven VMware environments and Intel hardware, combined with the global AWS footprint, reach, and scale.

• Simplicity and consistency. No staff retraining or revamping of operational processes is required.

• Cost-effective. Applications do not need to be refactored.

• Flexibility. You can choose to use resources in the cloud or on-premises, according to business rules, potentially reducing cost and improving efficiency.

Solution OverviewThis reference architecture presents configuration details for building a hybrid cloud solution that can handle existing enterprise workloads—such as SQL and NoSQL databases for transaction processing—but can also extend capabilities to include compute- and memory-intensive AI jobs.

The reference architecture consists of three layers (Figure 1):

• Intel Data Center Blocks for Cloud – VMware (vSAN ReadyNodes), which contain a hardware foundation from Intel

• Cloud/virtualization technology from VMware

• Application building blocks optimized for Intel architecture

The remainder of this document provides detailed configuration information for building a unified cloud solution through which an organization can run applications hosted on both VMs and containers located in an on-premises data center as well as in the cloud. The hybrid cloud nature of the solution allows enterprises to extend available resources and easily migrate workloads from on-premises to the cloud.

Software OverviewThe solution consists of:

• VMware Cloud Foundation, which provides a management domain and the ability to create and manage workload domains.

• VMware NSX-T Data Center networking virtualization solution, which enables software-defined networking in virtualized environments.

• VMware Enterprise PKS, which delivers a native Kubernetes solution to the stack.

VMware Cloud Foundation deployed with VMware Enterprise PKS offers a simple solution for securing and supporting containers within existing environments that already support VMs based on VMware ESXi, without requiring any retooling or rearchitecting of the network.

With this easy-to-deploy and comprehensive solution, enterprises can quickly operationalize database processing and AI to unlock the insights hidden in their data—and scale the solution as future needs dictate. The hybrid cloud capability provides flexibility in workload placement as well as business agility.

Hardware OverviewFor fast data analytics and inferencing, the hardware for the solution can scale from a single rack of just a few servers to about 1,000 servers with the following components:

• Workload-optimized 2nd Generation Intel Xeon Scalable processors with support for Intel Deep Learning Boost (Intel DL Boost) with Vector Neural Network Instructions (VNNI)

• High-capacity Intel DC 3D NAND SATA SSDs for the vSAN capacity tier, which can scale to PBs if necessary

• Low-latency Intel Optane™ DC SSDs for the vSAN caching tier

• Reliable, fast Intel Ethernet networking components

Solution Architecture DetailsThis section describes the building blocks in each of the reference architecture’s layers: vSAN ReadyNodes with Intel hardware, cloud infrastructure, and analytics and AI building blocks. For the complete bill of materials for the Base and Plus configurations, refer to Table 1 on page 8.

Intel Data Center Blocks for Cloud – VMware (vSAN ReadyNodes) with Intel HardwareIntel Data Center Blocks make it easier to adopt and qualify the latest Intel technology, helping you address the demands of today’s data centers. You can choose a workload-optimized, pre-configured system or customize a server for your unique needs. Intel Data Center Blocks for Cloud are pre-certified, fully validated purpose-built systems designed to simplify and accelerate cloud deployments. In particular, Intel Data Center Blocks for Cloud – VMware are certified vSAN ReadyNodes. This reference architecture uses two Intel Data Center Blocks for Cloud – VMware: Intel Server System VRN2208WFAF82R and Intel Server System VRN2208WFAF83R.

2nd Generation Intel Xeon Scalable ProcessorsIntel 2nd Generation Intel Xeon Scalable processors are designed for the most demanding data-centric and in-memory database workloads. These processors incorporate a performance-optimized multi-chip package that delivers up to 48 cores per CPU, 12 DDR4 memory channels per socket, and support for Intel Optane DC persistent memory DIMMs, which provide large-capacity memory to the system.

Page 4: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 4

For the “Base” configuration, the Intel Xeon Gold 6248 processor provides an optimized balance of price and performance in a mainstream configuration. The Intel Xeon Platinum 8260 processor powers the “Plus” configuration, which is designed for high-density deployments or more demanding, latency-sensitive environments. Even higher-performance processors can also be used in either configuration.

Intel SSD Data Center FamilyThe Intel Optane SSD DC P4800X series is the first product to combine the attributes of memory and storage. With an industry-leading combination of high throughput, low latency, high quality of service (QoS), and high endurance, this innovative solution is optimized to break through data access bottlenecks. The Intel Optane SSD DC P4800X and P4801X accelerate applications with fast caching and fast storage to increase scale per server and reduce transaction costs for latency sensitive workloads. In addition, the Intel Optane DC P4800X helps enable data centers to deploy bigger and more affordable datasets to gain new insights from large memory pools.

VMware vSAN performs best when the cache tier is using fast SSDs with low latency and high endurance. Workloads that require high performance can benefit from empowering the cache tier with the highest-performing SSDs rather than mainstream Serial ATA (SATA) SSDs. Therefore, in this reference architecture, Intel Optane DC SSDs with Non-Volatile Memory Express (NVMe) are used to power the cache tier. Intel Optane DC SSDs offer IOPS per dollar with low latency, coupled with 30 drive-writes-per-day (DWPD) endurance, so they are ideal for write-heavy cache functions.2

The vSAN capacity tier is served by Intel DC 3D NAND SSDs with NVMe, delivering optimized read performance with a combination of data integrity, performance consistency, and drive reliability.

Intel Ethernet Connections and Intel Ethernet AdaptersThe Intel Ethernet 700 series accelerates the performance of VMware vSAN platforms powered by Intel Xeon Scalable processors, delivering validated performance ready to meet high-quality thresholds for data resiliency, service reliability, and ease of provisioning.3, 4, 5, 6

Intel Optane DC Persistent MemoryIntel Optane DC persistent memory represents a new class of memory and storage technology that allows organizations to maintain larger amounts of data closer to the processor with consistently low latency and near-DRAM performance. Organizations can use Intel Optane DC persistent memory with VMware vSAN deployments to cost effectively expand the capacity of memory available to support more or larger VMs in virtual desktop infrastructure (VDI) deployments, or higher quantities of “hot” data available for processing with in-memory databases, analytics, and other demanding workloads.

Physical Networking LayerThis reference architecture uses two models of network switches:

• Data Plane: 2x Arista DCS-7060CX2-32S-R

• Management Plane: 1x Arista DCS-7010T-48-R

An enterprise-grade router solution is also necessary to ensure adequate routing capabilities for the multiple virtual local area networks (VLANs) that are required in the solution.

Cloud Infrastructure

VMware Enterprise PKSVMware Enterprise PKS is a container services solution that enables Kubernetes to operate in multi-cloud environments. VMware Enterprise PKS simplifies the deployment and management of Kubernetes clusters with Day 1 and Day 2 operations support. VMware Enterprise PKS manages container deployment from the application layer all the way to the infrastructure layer, according to the requirements for production-grade software using BOSH and Pivotal Ops Manager. VMware Enterprise PKS supports high availability, autoscaling, health-checks, and self-repairing of underlying VMs and rolling upgrades for the Kubernetes clusters.

VMware Cloud Foundation and VMware Cloud on AWSVMware Cloud Foundation is a unified SDDC platform for both private and public clouds. It brings together a hypervisor platform; software-defined services for compute, storage, network, and security; and network virtualization into an integrated stack whose resources are managed through a single administrative tool. VMware Cloud Foundation provides an easy path to hybrid cloud through a simple, security-enabled, and agile cloud infrastructure on premises and as-a-service public cloud environments.

VMware Cloud on AWS is a highly scalable, secure, hybrid cloud service that enables users to run VMware SDDC on AWS with enterprise tools like vSphere, vSAN, and NSX as a managed service. It allows organizations to seamlessly migrate and extend their on-premises VMware-based environments to the AWS cloud.

VMware SDDC ManagerVMware SDDC Manager manages the start-up of the Cloud Foundation system, creates and manages workload domains, and performs lifecycle management to ensure the software components remain up-to-to date. SDDC Manager also monitors the logical and physical resources of VMware Cloud Foundation.

VMware vSphereVMware vSphere extends virtualization to storage and network services and adds automated, policy-based provisioning and management. As the foundation for VMware’s complete SDDC platform, vSphere is the starting point for building your SDDC.

Page 5: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 5

VMware HCXVMware HCX extends continuous, hybrid cloud capabilities to VMs. It enables customers to migrate workloads between public clouds and data centers without any modification to application or VM configuration (see Figure 3). It provides full compatibility with the VMware software stack and helps make the migration simple, secure, and scalable.

The HCX Multi-Site Service mesh provides a security-enabled pipeline for migration, extension, and VM protection between two connected VMware HCX sites. It can be used to extend VLANs and retain IP and MAC addresses, as well as existing network policies, during migration between two sites. It also enables flexibility when planning complex, growing workloads across physical sites.

Analytics and AI Building BlocksEnterprises need high-performance data analytics and AI to remain competitive. They require flexible solutions that can run traditional data analytics and AI applications. The Hybrid Cloud Data Analytics Solution includes performance optimizations that take advantage of Intel hardware in a VMware infrastructure. These building blocks enable enterprises to quickly operationalize analytics, AI, and machine-learning workloads. Because the building blocks are already optimized for Intel architecture, you don’t have to spend days or weeks fine tuning parameters. And using any of the building blocks typically requires either no or minimal changes to your applications (such as adding a single line of code).

Deep Learning Reference StackThe Deep Learning Reference Stack (see Figure 4) is an integrated, highly-performant open source and containerized stack optimized for Intel Xeon Scalable processor-based platforms. This open source community release is part of an effort to ensure AI developers have easy access to all features and functionality of Intel platforms. Highly tuned and built for cloud-native environments, the Deep Learning Reference Stack enables developers to quickly prototype by reducing complexity associated with integrating multiple software components, while still giving them the flexibility to customize their solutions.

The Deep Learning Reference Stack includes highly tuned software components across the OS (Clear Linux), deep-learning frameworks (TensorFlow and PyTorch), Intel Distribution of OpenVINO™ toolkit, and other software components. The following sections provide a few details about some of the components in the Deep Learning Reference Stack.

• Intel Distribution of OpenVINO Toolkit. The Intel Distribution of OpenVINO toolkit (short for Open Visual Inference and Neural Network Optimization) provides developers with improved neural network performance on a variety of Intel processors and helps to further unlock cost-effective, real-time vision applications. The toolkit enables deep-learning inference and easy heterogeneous execution across multiple Intel architecture-based platforms, providing implementations across cloud architectures to edge devices and across all types of computer vision accelerators—CPUs, GPUs, Intel Movidius™ Neural Compute Sticks, and Intel field-programmable gate arrays (Intel FPGAs)—using a common API. The OpenVINO toolkit of functions and pre-optimized kernels helps speed time to market.

• Optimized Versions of TensorFlow and PyTorch. The Deep Learning Reference Stack’s version of TensorFlow is an end-to-end open source platform for machine learning that is optimized to run on Intel hardware. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers explore state-of-the-art machine learning and enables developers to easily build and deploy machine-learning-powered applications.

PyTorch is a Torch-based open source machine-learning library for Python that developers can use for deep-learning applications such as natural language processing. Developers use this scientific computing package as a replacement for NumPy.

Internet

DCNetworks MPLS/Private

Direct Connect

HCX Interconnect

VM MOBILITY PIPELINE

NETWORK EXTENSION PIPELINE

HCX

TARGETENVIRONMENT

HCX

SOURCEENVIRONMENT

MPLS = MULTIPROTOCOL LABEL SWITCHING

Figure 3. VMware HCX enables VM mobility and network extension. Source: docs.vmware.com/en/VMware-HCX/services/user-guide/GUID-5D2F1312-EB62-4B25-AF88-9ADE129EDB57.html

Intel Hardware

Hypervisor – VMware ESXi

OS – Linux

ContainerRuntime

PyTorch

Python

ClearLinux OS

TensorFlow

OpenVINO

PyTorch

Python

ClearLinux OS

TensorFlow

OpenVINO

ContainerRuntime

PyTorch

Python

ClearLinux OS

TensorFlow

OpenVINO

PyTorch

Python

ClearLinux OS

TensorFlow

OpenVINO

OS – Linux OS – Linux

Kubeflow

Kubernetes

Software Optimized for Intel ArchitectureInfrastructureContainer Controllers (Optional)

Figure 4. Deep Learning Reference Stack in a multi‑node configuration.

Page 6: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 6

• Intel Distribution for Python. Intel Distribution for Python is a ready-to-use, integrated package that delivers faster application performance on Intel architecture-based platforms. With it you can:

n Accelerate compute-intense applications—including numeric, scientific, data analytics, and machine-learning applications—that use NumPy, SciPy, scikit-learn, and more.

n Optimize performance with native performance libraries and parallelism techniques.

n Implement and scale production-ready algorithms for scientific computing and machine-learning workloads.

• Intel MKL. Intel MKL optimizes code with minimal effort for future generations of Intel processors. It is compatible with a wide choice of compilers, languages, operating systems, and linking and threading models. Intel MKL features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family. It uses industry-standard C and Fortran APIs for compatibility with popular Basic Linear Algebra Subprograms (BLAS), LAPACK, and Fastest Fourier Transform in the West (FFTW) functions. No code changes are required to use Intel MKL, and it automatically dispatches optimized code for each processor without the need to branch code. You can also take advantage of Priority Support, which connects you directly to Intel engineers for confidential answers to technical questions.

• Intel MKL‑DNN. Intel MKL-DNN is an open source performance library for deep-learning applications that includes basic building blocks for neural networks, optimized for Intel architecture. Intel MKL-DNN is intended for deep-learning applications and framework developers interested in improving application performance on Intel processors. Note that Intel MKL-DNN is distinct from Intel MKL, which is a general math performance library.

H2O.aiThis solution demonstrates H2O, a popular machine-learning platform that has been optimized for Intel architecture. H2O Driverless AI is a high-performance platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. H2O Driverless AI automates several time-consuming aspects of a typical data science workflow, including data visualization, model optimization, feature engineering, predictive modeling, and scoring pipelines. When combined with the Intel Data Analytics Acceleration Library (Intel DAAL), H2O Driverless AI can take advantage of algorithms optimized for Intel architecture, such as the XGBoost algorithm.

Data Warehousing Building BlocksData warehouses are considered one of the core components of business intelligence. They are a central location to store data from one or more disparate sources as well as current and historical data. Numerous methods can be used to organize a data warehouse. Hardware, software, and data

resources are the main components of this architecture, and VMware Cloud Foundation is an excellent platform to deploy data warehousing solutions (see Figure 5).

In addition to traditional SQL data warehouses, VMware Cloud Foundation also accommodates NoSQL databases, and it’s an efficient platform for running the Apache Hadoop framework and all of its related services that support big data and data mining. You can run Apache services like Hive, Kafka, and HBase to achieve Bigtable-like capabilities on top of Hadoop and the Hadoop Distributed File System (HDFS), and easily scale them according to your temporary needs. Everything runs on vSAN, which provides additional policy configuration options in terms of data redundancy and can be used by both platform administrators and end users (such as when processing persistent volume claims upon Kubernetes deployments) to obtain the maximum usage of the entire platform storage system.

Platform‑Verified WorkloadsThis section discusses performance testing results for deep learning and data warehousing.

Deep Learning on VMware Cloud FoundationImage classification is one of the most popular use cases for deep learning. Our tests benchmarked the ResNet50 inferencing model, using both TensorFlow and the OpenVINO toolkit from the Deep Learning Reference Stack. The accuracy of the model was validated using the ImageNet dataset. For detailed instructions on downloading the model, installing the frameworks, and running the benchmark, refer to the “Deep Learning“ section in Appendix A – Solution Features Validation and Benchmarking; this appendix also includes more extensive discussion of our benchmarking results.

Data WarehousingVirtual Machines (VMs)

Software‑Defined Data CenterVMware Cloud Foundation

Software‑Defined NetworkingVMware NSX

Software‑Defined StorageVMware vSAN External Datastore

ManagementVMware vCenter

HypervisorVMware ESXi

APPOS

APPOS

APPOS

APPOS

APPOS

ML WorkloadsContainers

ESXi 1 ESXi 2 ESXi 3 ESXi n…

Figure 5. VMware Cloud Foundation is an excellent platform for all your data analytics and machine‑learning workloads.

Page 7: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 7

Deep-learning workloads are compute- and memory-intensive. Therefore, sizing the resources of the VMs is crucial to obtain optimal performance. We ran various experiments with VM sizes ranging from two virtual CPUs (vCPUs) to 80 vCPUs. We determined that for ResNet50 v1 inference workloads, a VM with 16 vCPUs provides the optimal performance for both the Base and Plus configurations of the Hybrid Cloud Data Analytics Solution.

We measured throughput scaling by adding new VMs, each with 16 vCPUs, and measuring throughput for the following batch sizes: 1, 16, 32, 64, and 128. Figure 6 illustrates the results for batch sizes 1 and 128 for clusters with 1, 2, 3, and 4 VMs, using the Base configuration. The measured data has been normalized with batch size 1 on 1 VM as the baseline. As you can see, in a multimode system, as the VMs running the workload are scaled, the overall throughput of the ResNet50 v1 workload scales efficiently.7 This demonstrates the effectiveness of the ESXi scheduler.

Data Warehousing on VMware Cloud FoundationOur data warehouse test used Microsoft SQL Server 2017, with both the Base and Plus configurations of the reference architecture. Microsoft SQL Server is a relational database management system (RDBMS) that uses Transact-SQL (T-SQL) as its query language. T-SQL is an SQL language extension that allows the use of basic programming constructions such as variables, loops, and conditional instructions.

For the benchmark, we used HammerDB—an open source tool that uses specially generated tables and virtual users to measure and test database workloads. HammerDB has several types of database engines. In our tests, we used the Microsoft SQL Server engine because it’s a popular choice for benchmarking tests. Although it’s possible to run a HammerDB instance directly on the SQL Server, we recommend creating another instance of Microsoft Windows

Server and test SQL Server databases remotely. For more information, visit the HammerDB documentation and the GitHub HammerDB web page.

HammerDB is an online transactional processing (OLTP) workload that allowed us to measure maximum throughput for the workload. Key performance indicators (KPIs) from Microsoft SQL Server used to evaluate the results are as follows:

• Transactions per minute (TPM) from HammerDB

• Latency CPU Time:Requests (the time the CPU spent on the request)

• CPU Time:Total (the total time the CPU spent on the batch)

In addition, we considered KPIs from the infrastructure to measure resource saturation (CPU, memory, storage, and networking).

The intention of the benchmarking was to compare the Base and Plus configurations, using two different VMware Cloud Foundation workload domains, under a load and simultaneously measure a maximum density from the client perspective without violation of the service level agreement (SLA).

In our case, the SLA metrics are based on latency measurements:

• SQL latency CPU Time:Requests 99th percentile < 5 ms

• SQL latency CPU Time:Total 80th percentile < 20 ms

Our tests confirmed that the Plus configuration can deliver up to 20 percent more throughput; and while spawning up to 50 percent more data warehouses it can achieve up to 50 percent higher user density without violating the SLA, compared to the Base configuration (see Figure 7).8 See Appendix D – Data Warehouse Benchmarking for full benchmarking details.

10x

8x

6x

4x

2x

0

Virtual Machines (VMs)

Rela

tive

Thro

ughp

ut

1 2 3

Batch Size = 1Batch Size = 128

2.67

11.77

2.65

3.72

7.98

NORMALIZEDBASELINE

ResNet50 v1 Throughput Scaling with VMsNormalized Baseline: Batch Size = 1, 1 VM (16 vCPUs)

Figure 6. Scalability of the overall throughput of the ResNet50 v1 workload.

10x

1x

0

Nor

mal

ized

Perf

orm

ance Base

PlusBase

2S Intel Xeon Gold 62482S Intel Xeon Platinum 8260

20% uP tO

fASteR

TransactionsPer Minute

HammerDB

50% uP tO

hiGheRDenSitY

Online Transaction Processing on SQL 2017

Data Warehousesand Users

Plus

Figure 7. Data warehousing benchmarks reveal that the Plus configuration of the Hybrid Cloud Data Analytics Solution can increase performance and density, compared to the Base configuration.

Page 8: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 8

Bill of Materials

Hardware SpecificationsThe Hybrid Cloud Data Analytics Solution can scale from a single rack with just eight servers up to 15 workload domains with a total of 960 servers. This reference architecture uses 12 Intel architecture-based servers (see Table 1). Each rack consists of two top-of-rack (ToR) Arista switches, and a single out-of-band Arista management switch in a single rack. Additional racks can be added at the time of purchase or later. In multi-rack deployments, an additional set of spine switches are recommended (usually installed in the second rack). With the introduction of VMware Cloud Foundation 3.0 and Bring You Own Network (BYON), VMware no longer certifies switch compatibility with VMware Cloud Foundation.

The initial software imaging requires an additional server or laptop running virtualization software and a privately managed switch. These components are not part of the solution and are not needed after completing the VMware Cloud Foundation imaging and start-up process.

To demonstrate support for heterogeneous hardware configurations, this reference architecture uses two types of servers, which use different CPU models, memory size, and number of drives. Customers can modify the Base vSAN ReadyNode configuration to some extent, adding more memory or drives or replacing the CPU with a higher core-count or better clock-speed model. The general rules are described in the blog, “What You Can (and Cannot) Change in a vSAN ReadyNode.”

For the full vSAN ReadyNode hardware specification, see Intel Data Center Blocks for Cloud – vSAN Ready Node: System Deployment and Configuration Guide.

Software and Firmware SpecificationsThis reference architecture consists of two main software component suites: VMware Cloud Foundation and VMware Enterprise PKS. VMware Enterprise PKS requires multiple components and supporting infrastructure. In addition, several networking services like an enterprise NTP server and a DNS server are needed for seamless integration with the external networks and global time synchronization. For a complete list of requirements and prerequisites, refer to the

Table 1. Hardware Bill of Materials for the Hybrid Cloud Data Analytics Solution Reference Architecture

MANAGEMENT CLUSTER 4 Nodes

BASE WORKLOAD DOMAIN 4 Nodes

PLUS WORKLOAD DOMAIN 4 Nodes

Part Description Qty Part Description Qty Part Description Qty

Base SKU Intel Server System VRN2208WFAF82R 1 Intel Server System

VRN2208WFAF82R 1 Intel Server System VRN2208WFAF83R 1

Mainboard Intel Server Board S2600WF0R 1 Intel Server Board S2600WF0R 1 Intel Server Board S2600WF0R 1

CPU Intel Xeon Gold 6230 processor (20 cores, 2.10 GHz) 2 Intel Xeon Gold 6248 processor

(20 cores, 2.50 GHz) 2 Intel Xeon Platinum 8260 processor (24 cores, 2.40 GHz) 2

Memory 32 GB RDIMM DDR4-2933 12 32 GB RDIMM DDR4-2933 12 32 GB RDIMM DDR4-2666 24

Caching Tier 375 GB Intel Optane SSD DC P4800X Series (PCIe v4 U.2) 2 375 GB Intel Optane SSD DC

P4800X Series (PCIe v4 U.2) 2 375 GB Intel Optane SSD DC P4800X Series (PCIe v4 U.2) 4

Capacity Tier 4 TB Intel SSD DC P4510 Series (2.5” NVMe U.2) 6 4 TB Intel SSD DC P4510 Series

(2.5” NVMe U.2) 6 4 TB Intel SSD DC P4510 Series (2.5” NVMe U.2) 12

Boot Device 480 GB Intel SSD D3-S4510 Series (M.2, 80 mm) 1 480 GB Intel SSD D3-S4510 Series

(M.2, 80 mm) 1 480 GB Intel SSD D3-S4510 Series (M.2, 80 mm) 1

NIC Intel Ethernet Converged Network Adapter XXV710-DA2 1 Intel Ethernet Converged

Network Adapter XXV710-DA2 1 Intel Ethernet Converged Network Adapter XXV710-DA2 1

Table 2. VMware Cloud Foundation Main Products and Services. For other components refer to: VMware Cloud Foundation Release Notes.

COMPONENT VERSION BUILD

VMware Cloud Foundation Bundle 3.8.0 14172583

VMware Cloud Builder VM 2.1.0.0 14172583

VMware ESXi Hypervisor ESXi670-201906002 13981272

VMware vSAN 6.7 Express Patch 10 13805960

VMware NSX Data Center for vSphere 6.4.5 13282012

VMware NSX-T Data Center 2.4.1 13716575

VMware vCenter Server Appliance 6.7 Update 2c 14070457

VMware SDDC Manager 3.8.0 14172583

VMware vRealize Suite Lifecycle Manager 2.1 Patch 1 13685821

VMware Enterprise PKS 1.4.1 1.4.1.0-24363153

Page 9: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 9

official VMware documentation. Table 2 details the VMware Cloud Foundation products and services.

From the hardware perspective, Table 3 provides the firmware and driver versions that were used in this solution.

Table 3. BIOS and Firmware Specifications for the Hybrid Cloud Data Analytics Solution Reference Architecture

INGREDIENT VERSION

BIOS SE5C620.86B.02.01.0008.031920191559

BMC 1.93

ME 04.01.04.251

SDR 1.93

NIC Firmware 6.80 0x8003d05 1.2007.0

NIC version 1.7.17

Intel Optane SSD DC P4800X

E2010435

Intel SSD DC P4510 VDV10152

Microcode Base: 0x05000021 Plus: 0x05000021 Management: 0x0400001c

Deployment Considerations The goal of using solutions like VMware Cloud Foundation, NSX-T, vSAN and VMware Enterprise PKS is the transformation to a SDDC, where administrators can define, deploy, and manage clusters and resources based on actual demand from users. Each of these components is a standalone product and can be used independently. The following sections provide some deployment considerations for these solution components.

VMware Cloud FoundationVMware Cloud Foundation consists of core components that are compute virtualization (VMware vSphere), network virtualization (VMware NSX), storage virtualization (VMware vSAN), and cloud monitoring (VMware vRealize Suite). VMware Cloud Foundation allows you to build enterprise-ready cloud infrastructure for the private and public cloud.

The standard architecture model for VMware Cloud Foundation includes a dedicated management domain (one per instance) for all management components and up to 15 virtual infrastructure workload domains created by users.

Management DomainThe management domain is a special-purpose workload domain that is used to host the infrastructure components needed to instantiate, manage, and monitor the VMware Cloud Foundation infrastructure. It is automatically created using VMware Cloud Builder on the first rack in a VMware Cloud Foundation system during start-up, and it contains management components such as SDDC Manager, vCenter Server, NSX, and vRealize Log Insight. The management domain uses vSAN as primary storage and requires a minimum of four nodes to work properly. If you add more

racks to your system, the management domain covers additional components automatically.

Workload DomainThe workload domain represents a cluster of resources that can contain up to 64 servers with its own vCenter Server appliance, integrated vSAN, and NSX. A workload domain can span multiple racks; if you add more racks, you can scale any existing workload domains to the additional racks as needed. All tasks related to the workload domains are performed using the SDDC Manager web interface. This includes the creation, expansion, and deletion of workload domains, along with physical infrastructure monitoring and management.

VMware vSANvSAN is storage virtualization software that is fully integrated with VMware vSphere, which joins all storage devices across a vSphere cluster into a shared data pool (see Figure 8). Two vSAN cluster configurations are possible: hybrid and all-flash vSAN. A hybrid vSAN cluster uses two types of storage devices—flash devices for the cache tier and magnetic drives for the capacity tier. In an all-flash vSAN configuration, both the cache and capacity tiers use flash drives. Application of such a solution eliminates the need for external shared storage. vSAN offers users flexibility to define policies on demand and delivers ease of management of storage for containers.

VMware vSphere VMware vSANVMware vSphere VMware vSAN

APPOS

APPOS

APPOS

APPOS

APPOS

Managed by VMware vCenter

ESXi 1

SSD SSD

ESXi 2

SSD SSD

ESXi n

SSD SSD

vSAN Shared Storage

Figure 8. VMware vSAN configuration. Source: vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsan/vmware-vsan-datasheet.pdf

VMware NSXVMware NSX is a network virtualization solution that allows you to build software-defined networks in virtualized data centers. Just as VMs are abstracted from physical server hardware, virtual networks (including switches, ports, routers, firewalls, etc.) are constructed in the virtual space. Virtual networks are provisioned and managed independent of the underlying hardware.

VMware NSX allows you to define network connectivity among virtualized elements running on vSphere and to harden network security through micro-segmentation rules. VNFs) defined by VMware NSX include switching, routing, firewalling, load balancing, and virtual private networks (VPNs—specifically IPsec and SSL).

Page 10: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 10

NSX‑V Compared to NSX‑TVMware offers two types of the NSX software-defined networking platform – NSX-V and NSX-T.

• NSX‑V (NSX for vSphere) is designed for vSphere deployments only and is architected so that a single NSX-V manager platform is tied to a single VMware vCenter Server instance. The NSX-V platform is the original NSX platform that has been available for several years.

• NSX‑T (NSX‑Transformers) was designed for different virtualization platforms and multi-hypervisor environments; it can also be used in cases where NSX-V is not applicable. While NSX-V supports software-defined networking for only VMware vSphere, NSX-T also supports a network virtualization stack for KVM, Docker, Kubernetes, and OpenStack, as well as AWS-native workloads. VMware NSX-T can be deployed without a vCenter Server and is adopted for heterogeneous compute systems. It is designed to address emerging application frameworks and architectures that have heterogeneous endpoints and technology stacks. A popular use case for NSX-T is with containers because it includes the NSX-T Container Networking interface (CNI) plugin that allows developers to configure network connectivity for container applications.

With NSX-T, as used in this reference architecture, VMware has shifted from the Virtual Extensible LAN (VXLAN)-based encapsulation that is utilized by NSX-V to the newer GENEVE encapsulation. This tunneling protocol preserves traditional offload capabilities available on network interface cards (NICs) for great performance. Additional metadata can be added to overlay headers to help improve context differentiating for processing information such as end-to-end telemetry, data tracking, encryption, security, and so on in the data transferring layer. Additional information in the metadata is called TLV (Type, Length, Value). GENEVE (co-developed by VMware, Intel, Red Hat, and Microsoft) is based on the best concepts of the VXLAN, STT, and NVGRE encapsulation protocols.

The maximum transmission unit (MTU) value for Jumbo frames must be at least 1700 bytes when using GENEVE encapsulation. This is because of the additional metadata field of variable length for GENEVE headers. The VXLAN protocol requires an MTU value of 1600 or higher.

NSX ComponentsThe main components of VMware NSX are NSX Manager, NSX Controllers, and NSX Edge gateways.

• NSX Manager is a centralized component of NSX that is used for network management. It is a virtual appliance that provides the GUI and the RESTful APIs for creating, configuring, orchestrating, and monitoring NSX-T Data Center components (such as logical switching and routing, networking and edge services, and security and distributed firewall services), as well as NSX Edge services gateways. NSX Manager is the management plane for the NSX-T Data

Center eco-system. NSX Manager provides an aggregated system view and is the centralized network management component of NSX-T Data Center. It provides configuration and orchestration of logical networking components: logical switching and routing, networking and edge services, and security services and distributed firewall.

• NSX Controller is a distributed state management system used to overlay transport tunnels and control virtual networks, which can be deployed as VMs on VMware ESXi or KVM hypervisors. The NSX Controller manages all logical switches within the network, and handles information about VMs, hosts, switches, and VXLANs. Having three controller nodes ensures data redundancy in case of failure of one NSX Controller node.

• NSX Edge is a gateway service that provides access to physical and virtual networks for VMs. It can be installed as a distributed virtual router or as a services gateway. The following services can be provided: dynamic routing, firewalls, NAT, DHCP, VPNs, load balancing, and high availability.

An NSX Edge VM has four internal interfaces: eth0, fp-eth0, fp-eth1, and fp-eth2. eth0 is reserved for management, while the other interfaces are assigned to the Data Plane Development Kit (DPDK) fastpath. These interfaces are allocated for uplinks to TOR switches and for NSX-T Data Center overlay tunneling.

NSX Edge can connect to two transport zones—one for overlay and other for north-south peering with external devices. These two transport zones define the limits of logical network distribution on the NSX Edge (see Figure 9):

n Overlay Transport Zone. Any traffic that originates from a VM participating in an NSX-T Data Center domain might require reachability to external devices or networks. This is typically described as external north-south traffic. The NSX Edge node is responsible for decapsulating the overlay traffic received from compute nodes as well as encapsulating the traffic sent to compute nodes.

n VLAN Transport Zone. In addition to the encapsulate or decapsulate traffic function, NSX Edge nodes also need a VLAN transport zone to provide uplink connectivity to the physical infrastructure.

Note: You must match the physical interfaces of an NSX-T Edge node with profiles you have previously created.

Transport Nodes (TNs) and virtual switches represent NSX data transferring components. The TN is the NSX-compatible device participating in the traffic transmission and the NSX networking overlay. A node must contain a hostswitch that can serve as a transport node. NSX-V requires the use of a vSphere distributed virtual switch (VDS) as usual in vSphere. Standard virtual switches cannot be used for NSX-V. NSX-T presumes that you have deployed an NSX-T virtual distributed switch (N-VDS). Open vSwitches (OVS) are used for KVM hosts, while VMware vSwitches are used for ESXi hosts.

Page 11: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 11

N-VDS is a software NSX component on the transport node that performs traffic transmission. N-VDS is the primary component of the transport node’s data plane, which forwards traffic and owns at least one physical NIC. Each N-VDS of the different transport nodes is independent, but they can be grouped by assigning the same names for centralized management.

Transport zones are available for both NSX-V and NSX-T. Transport zones define the limits of logical networks distribution. Each transport zone is linked to its NSX Switch (N-VDS). Transport zones for NSX-T are not linked to clusters. There are two types of transport zone for VMware NSX-T due to GENEVE encapsulation: Overlay and VLAN. As for VMware NSX-V, a transport zone defines the distribution limits of VXLAN only.

Figure 9. VMware NSX‑T Data Center transport zones. Source: docs.vmware.com/en/VMware-NSX-T-Data-Center/2.3/com.vmware.nsxt.install.doc/GUID-F47989B2-2B9D-4214-B3BA-5DDF66A1B0E6.html

NSX Edge

WebVM Host

LogicalSwitch 2

Overlay TransportZone 2

TIER‑0VLAN TransportZone

VLAN TransportZone

SecureVMHost

LogicalSwitch 1

Overlay TransportZone 1 TIER‑1

PhysicalArchitecture

PhysicalArchitecture

VMware Enterprise PKS (Kubernetes)VMware Enterprise PKS is a solution for deploying Kubernetes in multi-cloud environments. It simplifies Kubernetes clusters deployment with Day 1 and Day 2 operations support and manages container deployment from the application layer all the way to the infrastructure layer. The deployed Kubernetes cluster is available in its native form; there are no add-ons or proprietary extensions. The native Kubernetes command-line interface (CLI) can be used.

VMware Enterprise PKS uses BOSH for instantiating, deploying, and managing Kubernetes clusters on a cloud platform. After the VMware Enterprise PKS solution is deployed on the Ops Manager Dashboard, users can provision Kubernetes clusters using the CLI and run container-based workloads on the clusters with the Kubernetes CLI (kubectl).

VMware Enterprise PKS InfrastructureVMware Enterprise PKS consists of several components:

• Ops Manager is a graphical dashboard that deploys with BOSH. Ops Manager works with the BOSH Director to manage, configure, and upgrade Pivotal Cloud Foundry (PCF) products such as Pivotal Application Service (PAS), VMware Enterprise PKS, and PCF services and partner products. Ops Manager represents PCF products as tiles with multiple configuration panes that let you input or select configuration values needed for the product. Ops Manager generates BOSH manifests that contain the user-supplied configuration values and sends them to the Director. After you install Ops Manager and BOSH, you use Ops Manager to deploy almost all PCF products.

• BOSH is an open source tool that lets you run software systems in the cloud. BOSH and its infrastructure-as-a-service (IaaS) cloud provider interfaces (CPIs) are what enable PCF to run on multiple instances of infrastructure as a service (IaaS). VMware Enterprise PKS uses BOSH to run and manage Kubernetes container clusters. PKS is based on the Cloud Foundry Foundation’s open source Container Runtime (formerly Kubo) project.

• VMware Harbor Registry is an enterprise-class registry server that stores and distributes container images. Harbor allows you to store and manage images for use with VMware Enterprise PKS.

VMware Enterprise PKS Control Plane OverviewThe VMware Enterprise PKS control plane enables users to deploy and manage Kubernetes clusters. It manages the lifecycle of Kubernetes clusters deployed using PKS. A special command-line tool (PKS CLI) is used to communicate with the VMware Enterprise PKS control plane and it allows users to:• View cluster plans

• Create clusters

• View information about clusters

• Obtain credentials to deploy workloads to clusters

• Scale clusters

• Delete clusters

• Create and manage network profiles for VMware NSX-T

VMware Enterprise PKS Control Plane ArchitectureThe VMware Enterprise PKS control plane is located on a single VM and includes the following components:• PKS API server

• PKS Broker

• User Account and Authentication (UAA) server

Note: The PKS API Load Balancer is used for deployments without NSX-T. If the deployment is NSX-T-based, a destination NAT (DNAT) rule on the enterprise external router is configured for the PKS API host.

Page 12: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 12

Components of the control plane interact with each other as shown in Figure 10.

Figure 10. VMware Enterprise PKS control plane. Source: docs.pivotal.io/runtimes/pks/1-4/control-plane.html

LocalWorkstation

KubernetesCluster 1

kuberconfig

kuberctl

Cluster 1Load Balancer

KubernetesCluster 2

Cluster 2Load Balancer

KubernetesCluster 3

Cluster 3Load Balancer

BOSH

Ops ManagerVM

UAA

PKS API

PKS Broker

PKS ControlPlane VM

PKS CLIPKS API

Load Balancer

Virtual Infrastructure Overview for VMware Enterprise PKS with NSX‑T Workload DomainsThe complete ecosystem of VMware Enterprise PKS deployed on VMware Cloud Foundation is illustrated in Figure 11. Note that you may have multiple workload domains, and thus multiple VMware Enterprise PKS instances, within a single VMware Cloud Foundation environment.

The hosts in the VMware Cloud Foundation workload domain provide resources for the VMware Enterprise PKS workloads; they host Kubernetes clusters deployed by VMware Enterprise PKS and the containerized applications that run on them. On VMware vSphere, Kubernetes clusters consist of a series of master and worker nodes that run as VMs. As defined within your cluster plans, Kubernetes clusters can reside either within or across physical vSphere clusters.

Cluster Workload ManagementSince the Kubernetes version delivered by PKS is in its native form, the default kubectl command line tool is used for managing the containers and deployments on the Kubernetes clusters. Full documentation for the tool can be found in the official Kubernetes documentation.

Networking in VMware Enterprise PKSVMware Enterprise PKS relies on virtual networks that must be configured and ready prior to VMware Enterprise PKS deployment. VMware Enterprise PKS components are installed in a logical switch over a Tier-1 NSX-T router. Pod and node networks are configured over different Tier-1 routers. All Tier-1 routers must be connected to a physical network over Tier-0 routers with proper Uplink configuration (Figure 12).

Figure 11. Complete ecosystem of VMware Enterprise PKS deployed on VMware Cloud Foundation. Source: docs.vmware.com/en/VMware-Validated-Design/5.1/sddc-architecture-and-design-for-vmware-enterprise-pks-with-vmware-nsx-t-workload-domains/GUID-8F4D6F40-8126-4C41-952D-192A45AF5AF3.html

Shared Edge and Compute Cluster

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

Managed By:Compute vCenter Server

SDDCKubernetes

Workload Clusters

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

Virtual InfrastructureShared Edge and

Compute

APPOSAPPOSAPPOS

PCF OpsManager

APPOSAPPOSAPPOS

NSX-TEdges

APPOSAPPOSAPPOS

BOSHDirector

APPOSAPPOSAPPOS

VMwareEnterprise PKS

APPOSAPPOSAPPOS

VMware HarborRegistry

Management

VirtualInfrastructureManagement

APPOSAPPOSAPPOS

NSX-VController

APPOSAPPOSAPPOS

NSX-VEdge

Compute

APPOS

NSX-TManager

APPOS

vCenterServer

APPOS

vCenterServer

APPOS

NSX-VManager

OtherMgmtApps

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

APPOSAPPOSAPPOS

NSX‑T Transport Zones NSX‑TTransport Zones (Mgmt)

N‑VDS (Compute) vDS (Mgmt)

3 Compute Clusters

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

Managed By:Compute vCenter Server

Management Cluster

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

ESX

i Tra

nspo

rtN

ode

Managed By:Management vCenter Server

INTERNAL SDDC NETWORKEXTERNALNETWORK

Figure 12. Logical diagram of a single VMware Enterprise PKS cluster deployment. Source: docs.vmware.com/en/VMware-Enterprise-PKS/1.3/vmware-enterprise-pks-13/GUID-nsxt-multi-pks.html

VMware vCenter NSX-T ManagerNSX-T Controllers

TIER‑1

TIER‑1

TIER‑1

TIER‑1

PKS Foundation A

Router10.0.0.0/16

PhysicalL2/L3 Switches

InfrastructureOps Manager, BOSH, PKS API10.0.1.0/24

Shared ServicesRabbitMQ, MySQL, etc…10.0.2.0/24

Kubernetes Node NetworkMaster, Work Nodes10.1.1.0/24

Kubernetes Namespaces172.17.1.0/24

Load BalancersKubernetes Services/SNAT172.16.2.1 … 172.16.2.100

TIER‑1

Management Network (vSphere vSwitch)

TIER‑0

Router20.0.0.0/16

TIER‑0

TIER‑1

TIER‑1

TIER‑1

TIER‑1

PKS Foundation B

InfrastructureOps Manager, BOSH, PKS API20.0.1.0/24

Shared ServicesRabbitMQ, MySQL, etc…20.0.2.0/24

Kubernetes Node NetworkMaster, Work Nodes20.1.1.0/24

Kubernetes Namespaces173.17.1.0/24

Load BalancersKubernetes Services/SNAT173.16.2.1 … 172.16.3.100

TIER‑1

INTERNET

PKS Control Plane

Management Network (Logical Switch)

Ops Manager BOSH

PKS Control Plane

Management Network (Logical Switch)

Ops Manager BOSH

Page 13: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 13

Integration of VMware Enterprise PKS, VMware Cloud Foundation, NSX‑T, and vSANVMware Cloud Foundation combines compute, storage, networking, security, and cloud management services—an ideal platform for running enterprise workloads and containerized applications. vSAN offers users flexibility to define policies on demand and delivers ease-of-management of storage for containers. Developers can consume storage as code by abstracting the complexity of the underlying storage infrastructure. With the help of NSX-T, end users no longer need to know the underlying network architecture. NSX-T can automatically create load balancers, routers, and switches for use by VMware Enterprise PKS.

The close integration of VMware Enterprise PKS, NSX-T, and vSAN based on a VMware Cloud Foundation network for containers in Kubernetes clusters makes it easy to manage ephemeral and persistent storage as well as to access vSAN’s availability and data service features. In addition, vSphere High Availability and vSphere Fault Tolerance can protect VMs from physical server failure. The combination of these technologies makes PKS on VMware Cloud Foundation a complete solution, perfect for Kubernetes administrators and developers.

Environment ProvisioningAs mentioned in the “Solution Overview” section, the complete environment consists of three main products: VMware Cloud Foundation, VMware NSX-T, and VMware Enterprise PKS. The following sections describe how to provision these products.

Hardware and Software Requirements Multiple requirements must be met before deploying the VMware Cloud Foundation platform. The complete details are listed in the VMware Cloud Foundation Planning and Preparation Guide.

There are specific requirements for network (Jumbo Frames and 802.1Q tagging), network pools, VLANs and IP pools, hostnames and IP addresses. You need to be familiar with the “Planning and Preparation Guide” before taking the next steps. The entire setup relies heavily on multiple VLANs. Each domain requires its own set of isolated VLANs for management, VMware vMotion, vSAN, VXLAN (NSX virtual tunnel end-point (VTEP)), and uplink. Read VMware’s “VLANs and IP Subnets” documentation for more details.

VMware Cloud Foundation DeploymentThe VMware Cloud Foundation deployment process includes several steps after you obtain all necessary hardware components, install them in a rack, and provide

the necessary power, cooling, and uplink connection to the data center infrastructure. First, you deploy Cloud Builder VM, which can be used (you may also install the ESXi OS manually) for imaging the ESXi software on VMware Cloud Foundation servers. Then you need to download and complete the Deployment Parameter Sheet, and finally you initiate the VMware Cloud Foundation start-up process.

The entire deployment of VMware Cloud Foundation is described in the VMware Cloud Foundation Architecture and Deployment Guide – VMWare Cloud Foundation 3.8 document, chapter: “Deploying Cloud Foundation.”

Step 1: Deploy Cloud Builder VMCloud Builder VM is used to deploy VMware Cloud Foundation; it also includes the VMware Imaging Appliance (VIA), which can be used for imaging the ESXi servers. For details go to the “Software Requirements” chapter of the VMware Cloud Foundation Planning and Preparation Guide.

The detailed deployment procedure of Cloud Builder VM is available in VMware’s “Deploy Cloud Builder VM” documentation.

Step 2: Install ESXi Software on VMware Cloud Foundation ServersImaging ESXi servers using Cloud Builder VM (which is done with VIA) is optional. If you already have servers with a supported version of ESXi, you do not need to use VIA. You may also install ESXi manually on each machine. Using VIA has some advantages, as it not only installs ESXi, but also deploys any additional vSphere Installation Bundles (VIBs) and configures standard passwords across all machines.

Detailed information on how to prepare hosts and set up VIA for ESXi installation is available in the “Installing ESXi Software on Cloud Foundation Servers” chapter of the “Cloud Foundation Guide.”

Best Known MethodBe sure to add to the VIA Bundle any required or custom VIBs that you need. In most cases, those will be specific drivers for NICs or SSDs. VIBs can be added directly in the VIA Web interface by going to Bundle ➔ Modify VIBs. For the reference architecture described in this document, we added the following VIBs:

• NIC driver and update tool for Intel Network adapter: i40en‑1.7.17‑1OEM.670.0.0.8169922.x86

• Intel Volume Management Device (Intel VMD) driver for NVMe: intel‑nvme‑vmd‑1.7.0.1000‑1OEM.670.0.0.8169922.x86_64

• Intel SSD Data Center Tool for updating NVMe firmware: intel_ssd_data_center_tool‑3.0.19‑400

Page 14: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 14

Step 3: Download and Complete the Deployment Parameter SheetYou import the parameter spreadsheet (Excel file) during the VMware Cloud Foundation start-up process. Before you begin the start-up process, collect all the information needed to configure network connectivity, including a list of VLANs, network addresses, and uplinks, and put that data into the Excel file. You should also plan for the DNS infrastructure and subdomain name reserved for VMware Cloud Foundation.

A detailed description of all the necessary fields is available in the VMware Cloud Foundation Architecture and Deployment Guide “About the Deployment Parameter Sheet.”

Best Known MethodBe sure that all the passwords you provide meet the password complexity criteria.

Step 4: Start Up VMware Cloud FoundationOnce the imaging process is completed and the parameter sheet is ready, move to the final phase—starting up VMware Cloud Foundation.

Best Known MethodAt this point, we recommend making a snapshot of the Cloud Builder VM. In case of a failed start up, you will be able to quickly restore the VM and start a fresh process instead of reinstalling the whole Cloud Builder.

During start up, SDDC Manager, all vCenter Servers, NSX Managers and Controllers, Platform Services Controller (PSC), vSAN, and vRealize Log Insight components are deployed, creating the management domain of VMware Cloud Foundation. The process takes about two hours.

After the start-up process is complete you should see a notification with a link to the new SDDC Manager Web interface. The SDDC Manager interface is accessible through a standard web browser.

The complete description of the VMware Cloud Foundation start-up process is included in VMware’s “Initiate the Cloud Foundation Bring-Up Process” documentation.

The management workload domain is now created and contains all the components needed to manage the infrastructure. It will also host additional components like VMware Enterprise PKS or NSX-T, which will be installed later. You should not deploy any user applications on this management cluster. Instead, create one or more workload domains that comprise a separate vSphere cluster with vSAN and NSX preinstalled and configured, along with a dedicated instance of vCenter Server for management purposes. All instances of vCenter Server and NSX Manager are deployed on the management domain.

NSX‑T InstallationInstallation of NSX-T usually requires deploying NSX Manager, three NSX controllers, creating an NSX controller cluster, installing VIBs (kernel modules) on ESXi hosts, and installing NSX Edge VMs. When using VMware Cloud Foundation, most of the steps are executed automatically during workload domain deployment; however, the NSX Edge VMs must be deployed and configured manually. PKS workload domains have some additional requirements, which are discussed later, in the “Configure the Remaining NSX-T Resources for VMware Enterprise PKS” section.

NSX‑T Workload Domain CreationYou create an NSX-T workload domain using the Virtual Infrastructure (VI) Configuration wizard that is accessible from the SDDC Manager Web interface. During several of the wizard steps you will specify the storage, name, compute, and NSX platform details for the new workload domain. Each workload domain has its own vCenter Server instance for isolation and easier updates without impacting other domains.

Prerequisites for an NSX‑T‑based Workload DomainSince we are using NSX-T, the following additional considerations and requirements must be met before a workload domain can be created:

• The NSX-T binaries must be downloaded locally and uploaded to the SDDC Manager. Refer to the “Download Bundles” section of the VMware Cloud Foundation Product documentation for details. The NSX-T bundle on the SDDC is required, because for the first NSX-T workload domain to be created, the SDDC Managers will deploy a cluster of three NSX-T Manager VMs in the Management domain. All subsequent NSX-T workload domains will share this NSX-T Manager cluster.

• NSX-T workload domain will need an NSX Edge VM to enable overlay and public networks for the north-south traffic. NSX Edge VMs are NOT deployed during workload domain creation. You need to deploy them manually after the workload domain is created. Further details on how to deploy and configure an NSX Edge VM can be found in VMware’s “Deploy and Configure NSX Edges in Cloud Foundation” documention.

• A DHCP server must be configured on the VXLAN VLAN of the management domain. VXLAN VTEPs for the workload domain are assigned IP addresses by the DHCP server.

Complete details are available in VMware’s “Prerequisites for a Workload Domain” documentation.

Page 15: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 15

Creating a Workload DomainCreating a new workload domain on VMware Cloud Foundation is a fairly easy process because it is controlled and orchestrated by SDDC Manager, which installs and configures all the needed components (including vCenter Server, vSAN, and NSX). It then performs all the steps needed to integrate these components, including creating new VLANs on the switches and reconfiguring the ports leading to selected servers. To deliver the new workload domain, the end user must simply use the VI Configuration wizard and provide all necessary data.

VMware’s “Start the VI Configuration Wizard” documentation provides a complete guide for creating a new workload domain. Use the following step-by-step guide only as a reference and familiarize yourself with the official documentation before starting your deployment.

Best Known MethodEnsure all ESXi hosts accept SSH connections. Traffic from the SDDC Manager must be allowed. If you provisioned ESXi servers using VIA and then migrated the hosts to another management network, you will most likely need to modify the network firewall rules on the ESXi servers to add the new management subnet and the SDDC Manager IP.

Update DNS records. For the first workload domain, you will need to create entries for the three nsxt01/02/03 machines as well as for the NSX-T cluster virtual IP that will be used, and then redirect traffic to the three machines in the Management Domain VM subnet. You also need to create a dedicated entry for vCenter and the Edge VM.

After all creation tasks are completed, go to the newly created vCenter server for the workload domain and restart each ESXi server to resolve the “vSAN requires host restart” notification.

The time needed to create a workload domain depends on the server configuration and the requested infrastructure size. In the example environment, the process took about one hour and 40 minutes to provision the complete infrastructure of a workload domain with four servers. This means that, depending on system and network configuration, the environment can be provisioned in as little as two hours instead of several days, before the automation provided by SDDC Manager in VMware Cloud Foundation was available.

Moreover, because the whole process is automated, there is a much lower risk of a misconfiguration that can often occur during manual installation. Such configuration errors could lead to serious issues or additional delays in infrastructure provisioning.

Later, you can use SDDC Manager to add new servers to an existing workload domain as needed. And when you no longer need the infrastructure used previously by the workload domain, you can remove it using the SDDC Manger and return unused servers to the pool of available resources.

Adding NSX‑T Edge Host to the Newly Created Workload DomainNSX Edges are needed to enable overlay virtual infrastructure and public networks for north-south traffic. NSX Edges are not deployed automatically.

Before you read the documentation linked to in the following sections, keep in mind that VMware Cloud Foundation automates some of the configuration during the workload domain creation process. You may expect that the NSX-T logical switches, both transport zones (overlay-tz-... and vlan-tz-... ) and some segments for management, vSAN, and vMotion will be already present in the environment as you proceed through the documented steps.

Create the Transport Zones for System and Overlay TrafficIn this step you create Transport Zones for overlay traffic, uplink, and ESXi management. To create them, follow the procedure in VMware’s “Create the Transport Zones for System and Overlay Traffic” documentation.

Create Uplink Profiles and the Network I/O ControlUplink profiles set the policies for the links from ESXi hosts to NSX-T segments and from NSX Edge nodes to the top of the rack switches. Later, these profiles will be matched with physical NICs of the NSX-T Edge VM. To create profiles, follow the procedure documented in VMware’s “Create Uplink Profiles and the Network I/O Control Profile” documentation.

Create the NSX‑T Segments for System, Uplink, and Overlay TrafficSegments connect nodes that send VLAN and overlay traffic. Create them using the procedure in VMware’s “Create the NSX-T Segments for System, Uplink, and Overlay Traffic” documentation.

Configure Dynamic Routing in the Shared Edge and Compute ClusterDynamic routing is needed to enable communication between tenant workloads by using application virtual networks in NSX-T and to connect tenant workloads to the external network. The complete procedure is described in VMware’s “Configure Dynamic Routing in the Shared Edge and Compute Cluster” documentation.

Note: When you join the NSX-T Edge node to the management plane, you must provide the name/IP of the NSX-T controller that was used to obtain the thumbprint ID of the certificate. When logging in using the NSX-T Manager domain name, you are always redirected to one of three NSX-T controllers.

Page 16: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 16

Configure the Remaining NSX‑T Resources for VMware Enterprise PKSTo properly configure networking infrastructure to host VMware Enterprise PKS, additional manual steps are needed. The following sections describe the required actions.

Adding IP PoolsPKS requires two special IP pools. First, create an IP pool for the floating Kubernetes network. It must be a routable network and will be used to access the Kubernetes cluster. Another IP pool is for the VTEP. It will be used for the GENEVE tunnels between ESXi nodes and NSX-T Edge nodes. The tunnel endpoints are the source and destination IP addresses used in the external IP header to uniquely identify the hypervisor hosts originating and terminating the NSX-T Data Center encapsulation of overlay frames. Alternatively, DHCP for tunnel endpoint IP addresses may be used.

To create an IP pool, open NSX-T Manager and go to Advanced Networking & Security ➔ Inventory ➔ Groups ➔ IP Pools, then click the Add button.

Adding IP BlocksTwo IP blocks must be present for Kubernetes: one for nodes (PKS-NODES) and one for Pods (PKS-PODS). To add them, open the NSX-T Manager and go to Advanced Networking & Security ➔ Networking ➔ IPAM, then click the Add button.

Configuring Switches for Overlay1. Create a logical switch for PKS management traffic on the

overlay transport zone (name it PKS-MGMT).

2. Create a logical switch for PKS service traffic on the overlay transport zone (name it PKS-SERVICE).

Adding Tier‑0 Routers1. Create and configure a Tier-0 gateway and connect it to

the external routable segment.

2. Open NSX-T Manager, go to Advanced Networking & Security ➔ Networking ➔ Routers, then click + ➔ Tier‑0 Router.

3. In the new router, go to Configuration ➔ Logical Router Ports, and add a connection to the external routable segment (the Transport Node) ➔ Uplink.

Enabling Route Redistribution on the Tier‑0 Router1. Open the NSX-T Manager and go to Advanced Networking

& Security ➔ Networking ➔ Routers ➔ <Your Tier‑0 Router> ➔ Routing ➔ Route Redistribution ➔ Add.

2. In the new submenu, check all of the Sources (leave the Route Map empty) and click Add.

Creating and Configuring Tier‑1 Routers1. In the Routing tab that was used to create the Tier-0

router, click Add Tier‑1 Router and specify the Tier-0 router created earlier, the edge cluster, and the edge transport node.

2. Now configure the downlink ports to PKS Management and Service logical switches. Go to the configuration section of the new Tier-1 router and click Router Ports ➔ Add. Select the name, downlink type, and logical switch for the particular router, and define the uplink address for the logical network.

Enabling Route Advertisement for Each Router1. Configure the Route Advertisement for each of the

Tier-1 routers.

2. In the Routing/Route Advertisement settings, select Edit, enable it, and then check all “Advertise Routes” options.

VMware Enterprise PKS Solution Deployment on an NSX‑T Workload Domain

Prepare the EnvironmentFrom network.pivotal.io, download the images for Pivotal Ops Manager, Pivotal Container Services, and VMware Harbor.

The complete procedure is described on the “Deploying VMware PKS on VMware Cloud Foundation” community page.

Deploy Ops ManagerFor the complete Ops Manager deployment description, follow the steps described in the “Deploying Ops Manager with NSX-T for Enterprise PKS” documentation.

Here is a brief description of the process:

1. Deploy the Open Virtualization Format (OVF) template of Ops Manager to vSphere:a. Select the storage type.b. Choose a network.c. Provide the admin password, choose a hostname and

then provide the DNS, default gateway, NTP server, and management IP address that Ops Manager should have after installation.

2. After deployment is finished, access the Ops Manager using the previously configured management IP address and configure the authentication system for the Web interface.

Replace NSX CertificatesFor information on NSX Manager Superuser Principal Identity, refer to “Generating and Registering the NSX Manager Superuser Principal Identity Certificate and Key.”

For information on generating and registering the NSX Manager certificate, refer to “Generating and Registering the NSX Manager Certificate for PKS.”

Page 17: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 17

Configure BOSH Director for vSphereFollow the steps in the “Configure BOSH Director for vSphere” section in Chapter 3 of the “Deploying VMware PKS on VMware Cloud Foundation” documentation.

Here is a brief description of the process:1. Log in to the Ops Manager Web interface and click the

BOSH Director tile.

2. Provide information about vCenter for this workload domain: IP address, admin username and password, and so on.

3. Provide information about NSX networking for this workload domain: IP address, admin username, and password.

4. Provide NSX CA Certificate info in BOSH.

5. Provide NTP server information and create availability zones.

6. Provide information about networks created during “Configure the Remaining NSX-T Resources for VMware Enterprise PKS” section earlier in this document.

7. Match the PKS-MGMT availability zone with the management network where the BOSH Director will be deployed. Note that it gets the first IP address after the specified reserved range.

8. Proceed to the main page and start the deployment by selecting Review Pending Changes, and then Apply Changes.

Install VMware Enterprise PKSGeneral guidance from Pivotal is available in the “Installing Enterprise PKS on vSphere with NSX-T” documentation.

Here is a brief description of the process:1. Log in to the Ops Manager Web interface.

2. Select Import Product and upload the downloaded image of VMware Enterprise PKS. After that, you should be able to configure VMware Enterprise PKS.

3. Select the previously created network as PKS Management Network; the Service network will be created for you by VMware Enterprise PKS.

4. Specify the FQDN of the PKS API and create that record in your DNS server, then choose to generate an RSA certificate for *.cluster.pks.example (wildcard domain).

5. Configure Plans, which are sizing flavors of Kubernetes clusters.

6. Set the Kubernetes Cloud Provider as vSphere and provide vCenter details for this workload domain.

7. On the Networking tab, specify the NSX-T Manager and provide super-user certificates (which were created in the “Replace NSX Certificates” section) that will be used by VMware Enterprise PKS to automatically create new networks and resources.

8. Provide the IDs of the NSX-T resources created for VMware Enterprise PKS: Pods, Nodes, Floating IP block, and Tier-0 router created during “Configure the Remaining NSX-T Resources for VMware Enterprise PKS” process described earlier in this document.

9. Apply pending changes and start the deployment, which takes about an hour.

Set up the Harbor RegistryFollow the steps in the “VMware Harbor Registry” documentation.

Install the VMware Enterprise PKS CLI ClientFor information on installing the VMware Enterprise PKS CLI client, see “Installing the PKS CLI” for instructions.

SummaryWith the need for high-performance data analytics and AI on the rise in an increasingly digital world, enterprises seek flexible solutions that can run traditional data analytics and AI applications on-premises or in the public cloud. The Hybrid Cloud Data Analytics Solution combines the best of Intel hardware, VMware virtualization software, and Pivotal’s Kubernetes management expertise. With this end-to-end solution that is ready to deploy, enterprises are poised to run both their traditional data analytics workloads and the AI and machine-learning workloads of the future.

At the same time, one needs to be aware that warehousing workloads are latency- and I/O-sensitive, while analytics workloads are CPU- and memory-intensive. This reference architecture has been validated to meet expected key performance indicators in demanding customer workload scenarios like completion of 99th percentile of the requests within 5ms for SQL workloads and almost linear improvement in throughput with additional resources for AI inference workloads—proving excellent scalability.

Find the solution that is right for your organization. Contact your Intel representative or visit Intel Select Solutions for VMware vSAN.

Learn MoreYou may find the following resources useful:

• Intel Data Center Blocks for Cloud – vSAN ReadyNodes

• Intel Optane DC Persistent Memory “Memory Mode” Virtualized Performance Study

• Deep Learning Reference Stack

• 2nd Generation Intel Xeon Scalable processors

• Intel Select Solutions for VMware vSAN ReadyNodes

• Intel Solid State Drives Data Center Family

• Intel Distribution of OpenVINO Toolkit

• Intel Deep Learning Boost

• Intel Framework Optimizations

• Intel Optane DC persistent memory

• VMware vSAN

Page 18: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 18

Appendix A – Solution Features Validation and BenchmarkingThis appendix provides information about running machine-learning and deep-learning workloads on the Hybrid Cloud Data Analytics Solution and presents some benchmarking results. For instructions on reproducing the benchmark tests, see Appendix B – Deep-Learning Experiment Setup Instructions.

Machine LearningThe platform makes it easy to run machine-learning workloads, which is demonstrated by running the H2O scorer. Follow these instructions to train a machine-learning model and perform inferencing/scoring on new data using the model.

1. Install H2O Driverless AI (refer to the “Installing H2O Driverless AI” section in Appendix B).

2. Train the machine-learning model. From the H2O Driverless AI graphical user interface, choose a new experiment. Upload or import a dataset to train. H2O Driverless AI comes with a few sample data sets for training; alternatively, you could import a dataset from your local file system or from a database using a JDBC connector. In our example, we use the creditcard.csv sample dataset to demonstrate training. A target feature/column is selected for the model to be trained on. In this case, the Default on Next Payment field is selected. After adjusting a few seed parameters like accuracy, time, and interpretability, the training job can be started. The Progress dial on the user interface shows the progress of training. Once training is completed, the user interface presents an option to download the trained model/scorer pipeline.

3. Download, containerize, and publish the scorer (refer to the “Preparing H2O Scorer Artifacts” section in Appendix B, up to and including the “Publishing the Docker Image” subsection).

4. Prepare a Kubernetes environment to deploy the scorer (refer to the “Deploying in Kubernetes” section in Appendix B).

5. Perform inferencing or scoring using new data.

A single scorer service can host multiple models. The model ID is required to use the model; the ID can either be read from the model details using the H2O Driverless AI user interface, or by listing the supported models by a given service:

curl -XGET -H “Content-Type: application/json” <<scorer-service-url>>/models

A JSON object containing new data points to be inferenced/scored with all values required by the model is prepared. For the current example, here is a sample JSON object formatted in scorer-request.json:

{ “fields”: [“LIMIT_BAL”, “SEX”, “EDUCATION”, “MARRIAGE”, “AGE”, “PAY_0”, “PAY_2”, “PAY_3”, “PAY_4”, “PAY_5”, “PAY_6”, “BILL_AMT1”, “BILL_AMT2”, “BILL_AMT3”, “BILL_AMT4”, “BILL_AMT5”, “BILL_AMT6”, “PAY_AMT1”, “PAY_AMT2”, “PAY_AMT3”, “PAY_AMT4”, “PAY_AMT5”, “PAY_AMT6”], “includeFieldsInOutput”: [“LIMIT_BAL”], “rows”: [ [“20000”, “2”, “2”, “1”, “24”, “-2”, “2”, “-1”, “-1”, “-2”, “-2”, “3913”, “3102”, “689”, “0”, “0”, “0”, “0”, “689”, “0”, “0”, “0”, “0”], [“10”, “2”, “2”, “1”, “24”, “-99”, “-100”, “0”, “0”, “0”, “0”, “3013”, “4102”, “989”, “0”, “0”, “0”, “0”, “89”, “1”, “1”, “1”, “1”] ]}

Then using a REST client, the model in the Kubernetes environment is queried for inferences using a request as shown here:

curl -XPOST -H “Content-Type: application/json” -d @scorer-request.json <<scorer-service-url>>/models/<<model id>>/score

Deep LearningBecause TensorFlow is one of the most popular deep-learning frameworks, it was a natural choice for validating the deep learning capabilities of the Hybrid Cloud Data Analytics Solution. On the other hand, the OpenVINO toolkit can be used to develop and deploy vision-oriented solutions on Intel architecture-based platforms. It allows developers to import models (including TensorFlow models) and prepare them for optimal execution with the Deep Learning Inference Engine.

The validation and benchmarking of this feature was done for the deep-learning inferencing workload using the standard models ResNet50 v1 and Inception v3, pre-trained on the Imagenet dataset at FP32 precision. The objective was to maximize the throughput of the model on this solution. (For details of the benchmark setups, refer to the “Running the TensorFlow Benchmark” and “Running the OpenVINO Benchmark” sections in Appendix B – Deep-Learning Experiment Setup Instructions.)

To achieve maximum throughput and utilization of system resources, we recommend the following:

• Run multiple instances of the model with one instance deployed per VM.

• Vary VM resource allocation to find appropriate VM sizing for optimal throughput from the system.

• Vary batch sizes and streams of inference requests to the model.

Page 19: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 19

Deep‑Learning Benchmark – TensorFlow ResultsAll benchmarks were done on a VMware Enterprise PKS cluster deployed with the Hybrid Cloud Data Analytics Solution on the Base and Plus configurations. Each VMware Enterprise PKS node is based on 16 vCPUs. The batch size defines the number of samples that will be propagated through the network. The nireq defines the number of inference requests for pipelined mode. The fps defines the number of frames per second analyzed.

Software optimizations are crucial in optimizing deep-learning performance. Figure A1 shows the performance improvements gained by using the Intel Optimization for TensorFlow that takes advantage of the Intel Math Kernel Library (Intel MKL) and the Intel Math Kernel Library - Deep Neural Networks (Intel MKL-DNN). Throughput was measured in a single VM of 16 virtual CPUs (vCPUs). The measured data has been normalized with a batch size of 1, using the stock version of TensorFlow as the baseline. As you can see, for a batch size of 1, the optimized version of TensorFlow provided 1.36x more throughput than the stock version; for a batch size of 128, the performance gain was even more—up to 3.64x.9

As described previously, the OpenVINO toolkit can run trained models from other frameworks like TensorFlow. The OpenVINO toolkit converts the models into an internal representation that takes advantage of Intel architecture and also performs model optimizations like layer fusion to further improve the inference performance of the model. Figure A2 shows the relative performance of the stock version of TensorFlow and OpenVINO for maximum throughput (achieved by measuring throughput with various batch sizes ranging from 1 to 512 and taking the maximum of all those values) in a single VM with 16 vCPUs, running on the Base configuration. OpenVINO delivered a phenomenal performance improvement of up to 2.85x, without retraining the model.10

Tables A1 and A2 describe the software and system set-up details. See Table A1 for the full description of the Base configuration of the Hybrid Cloud Data Analytics Solution.

Figure A1. Performance gains from the Intel Optimization for TensorFlow.

10x

4x

3x

2x

1x

0Batch Size = 1 Batch Size = 128

Non-optimized for TensorFlowIntel Optimization for TensorFlow

1.361

1.99

3.64

NORMALIZEDBASELINE

ResNet50 v1 ThroughputOptimized and Non‑Optimized for TensorFlow

Normalized Baseline: Batch Size = 1 TensorFlow, 1 VM (16 vCPUs)

Figure A2. Maximum throughput of stock TensorFlow compared to OpenVINO toolkit.

3x

2x

1x

0TensorFlow OpenVINO Toolkit

2.85

1NORMALIZEDBASELINE

ResNet50 v1 Throughput TensorFlow vs OpenVINO ToolkitNormalized Baseline with TensorFlow, 1 VM (16 vCPUs)

Table A1. Software Specifications

COMPONENT DETAIL

OS UbuntuOS Build 4.15.0-50-genericCompiler GCC 4.8.5Workload and Version

ResNet50• TensorFlow • OpenVINO toolkit

Frameworks and Version

• OpenVINO toolkit 2019 R1.0.0.094• Intelaipg/intel-optimized-tensorflow: 1.14-rc0-devel-mkl

Other Software Intel MKL-DNN v0.18Dataset Synthetic

Table A2. Virtual Hardware Specifications

COMPONENT DETAIL

vCPUs per VM 16vRAM 16 GBvNIC VMXNET3

Page 20: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 20

Appendix B – Deep‑Learning Experiment Setup InstructionsThe experiments described in Appendix A are deployed on a VMware Enterprise PKS Kubernetes cluster. To reproduce them, the following are required:

• A Kubernetes cluster (refer to Appendix C – Deploying a Kubernetes Cluster Using VMware Enterprise PKS)

• Access to the Kubernetes cluster (refer to “Accessing a Kubernetes Cluster with the kubectl CLI” in Appendix C)

The following sections present information on preparing particular experiments.

Installing H2O Driverless AIH2O Driverless AI is an AI platform for automated machine learning. During the development of this reference architecture, H2O Driverless AI was used to train a scoring model and produce running scorer code. The first step is to install the binaries as follows:

1. Confirm that the prerequisites are met.

2. Obtain the binaries; version 1.7.0 was used for this experiment.

3. Proceed with the installation instructions.

Preparing H2O Scorer ArtifactsH2O Scorer is a RESTful Web service that allows for performing inference via HTTP requests. To deploy the model to the cluster, the following elements are required:

• The model is trained and the MOJO runtime is downloaded.

• You have a Jar file for a generic Java scorer implementation, based on SpringBoot.

• A Docker image is built, published, and pushed to the Docker registry.

• Kubernetes descriptors exist for the deployment.

These elements are discussed in the following sections.

MOJO RuntimeAfter training the model, you can download MOJO (Model ObJect, Optimized) using the H2O Driverless AI user interface. The key element here is pipeline.mojo, which contains the complete definition of the model.

Generic Java Scorer ImplementationThe scorer (local-rest-scorer) is a part of the Driverless AI Deployment Templates project. Use the build instructions to produce the required Jar file. For this experiment, we used local-rest-scorer-0.2.6-SNAPSHOT.jar.

Building a Docker ImagePut your license key file, MOJO runtime, REST scorer jar file, and the following Dockerfile into one directory:FROM clearlinux/openjdkCOPY pipeline.mojo /dai/pipeline.mojoCOPY local-rest-scorer-0.2.6-SNAPSHOT.jar /dai/local-rest-scorer.jarCMD [“/usr/bin/java”, \ “-Dmojo.path=/dai/pipeline.mojo”, \ “-Dai.h2o.mojos.runtime.license.file=/dai/license.key”, \ “-jar”, “/dai/local-rest-scorer.jar”]

You can build and tag the image like this:docker build -t shortdemo/h2o-dai-rest-scorer:0.0.2 -t shortdemo/h2o-dai-rest-scorer:latest

To verify that the image works, you can start a container (exposing port 8080):docker run -it -p 8080:8080 h2o-dai-rest-scorer

and issue “get available models” request:curl -XGET -H “Content-Type: application/json” http://localhost:8080/models

It should return a JSON array with exactly one model ID.

Publishing the Docker ImageWhen running in production, the Docker image is typically pushed to a private Docker repository. For the purpose of this experiment, we demonstrate how to publish the image to a public hub:

docker push shortdemo/h2o-dai-rest-scorer:0.0.2docker push shortdemo/h2o-dai-rest-scorer:latest

Page 21: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 21

Deploying in KubernetesDeploying in Kubernetes requires the following steps:1. Create a secret that contains the H2O license key.2. Create a deployment for REST scorer web service on the Kubernetes cluster (scorer.deployment.yaml script, given below).

Note that the Kubernetes clusters must be able to pull the required Docker image (h2o-dai-rest-scorer).

These steps are described below.

CREATING THE SECRETUsing H2O Driverless AI requires a license key. To safely store the license, you can use a Kubernetes secret. To create the secret from an existing license key file, use the following command:

kubectl create secret generic dai --from-file=./license.key

CREATING A DEPLOYMENTThe following script (scorer.deployment.yaml) can be used for creating the desired deployment:

apiVersion: apps/v1kind: Deploymentmetadata: name: scorer-deployment labels: app: scorerspec: replicas: 1 selector: matchLabels: app: scorer template: metadata: labels: app: scorer spec: containers: - name: scorer image: shortdemo/h2o-dai-rest-scorer:0.0.2 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 volumeMounts: - mountPath: “/dai/license.key” name: dai-volume subPath: license.key volumes: - name: dai-volume secret: secretName: dai

The script assumes the following:

• The license key is stored in a secret called dai.

• The Docker image is published under shortdemo/h2o-dai-rest-scorer:0.0.2.

To use the script, issue the following command:

kubectl apply -f scorer.deployment.yaml

Validating the ScorerIf Kubernetes deployment is successful, it can be exposed as a service. A single scorer service can host multiple models. Listing the supported models can be a way of validating the availability of the service:

curl -XGET -H “Content-Type: application/json” <<scorer-service-url>>/models

Page 22: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 22

Running the TensorFlow BenchmarkYou first need to use SSH to connect to a machine of Ops Manager for Pivotal Cloud Foundry (PCF).

Then you must install Helm according to available documentation.

When Helm is ready, run the helm create tf-benchmark command to create a Helm Chart for deployment of the TensorFlow benchmark.

Inside the tf-benchmark directory, edit the following files and copy/paste the content:

values.yaml:

image: repository: intelaipg/intel-optimized-tensorflow tag: 1.14-rc0-devel-mkl

Chart.yaml:

apiVersion: v1appVersion: “1.0”description: A Helm chart for Kubernetesname: tf-benchmarkversion: 0.1.0

templates/job.yaml (you need to adjust CPU values to maximum on your environment):

apiVersion: batch/v1kind: Jobmetadata: labels: role: tf-benchmark name: tf-benchmarkspec: template: metadata: labels: role: tf-benchmark name: tf-benchmark spec: containers: - name: tf-benchmark image: {{ .Values.image.repository }}:{{ .Values.image.tag }} imagePullPolicy: Always resources: requests: cpu: “16” memory: “8Gi” limits: cpu: “16” memory: “8Gi” command: - “bash” args: - “/usr/local/bin/job.sh” volumeMounts: - name: usr-local-bin mountPath: “/usr/local/bin/job.sh” subPath: job.sh restartPolicy: Never volumes: - name: usr-local-bin configMap: name: tf-benchmark-job

Page 23: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 23

templates/configmap.yaml:

apiVersion: v1kind: ConfigMapmetadata: name: tf-benchmark-jobdata: job.sh: | #!/bin/bash mkdir /tf && cd /tf apt install -y numactl google-perftools git clone https://github.com/tensorflow/models.git tf-models git clone https://github.com/IntelAI/models.git intel-models wget -q https://storage.googleapis.com/intel-optimized-tensorflow/models/resnet50v1_5_int8_pretrained_model.pb wget -q https://zenodo.org/record/2535873/files/resnet50_v1.pb wget -q https://storage.googleapis.com/intel-optimized-tensorflow/models/inceptionv3_int8_pretrained_model.pb wget -q https://storage.googleapis.com/intel-optimized-tensorflow/models/inceptionv3_fp32_pretrained_model.pb echo “resnet50v1_5_int8”; python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50v1_5_int8_pretrained_model.pb --model-name resnet50v1_5 --framework tensorflow --precision int8 --mode inference --batch-size=1 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50v1_5_int8_pretrained_model.pb --model-name resnet50v1_5 --framework tensorflow --precision int8 --mode inference --batch-size=16 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50v1_5_int8_pretrained_model.pb --model-name resnet50v1_5 --framework tensorflow --precision int8 --mode inference --batch-size=32 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50v1_5_int8_pretrained_model.pb --model-name resnet50v1_5 --framework tensorflow --precision int8 --mode inference --batch-size=64 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50v1_5_int8_pretrained_model.pb --model-name resnet50v1_5 --framework tensorflow --precision int8 --mode inference --batch-size=128 --benchmark-only echo “resnet50v1_5_fp32”; python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50_v1.pb --model-name resnet50v1_5 --framework tensorflow --precision fp32 --mode inference --batch-size=1 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50_v1.pb --model-name resnet50v1_5 --framework tensorflow --precision fp32 --mode inference --batch-size=16 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50_v1.pb --model-name resnet50v1_5 --framework tensorflow --precision fp32 --mode inference --batch-size=32 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50_v1.pb --model-name resnet50v1_5 --framework tensorflow --precision fp32 --mode inference --batch-size=64 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/resnet50_v1.pb --model-name resnet50v1_5 --framework tensorflow --precision fp32 --mode inference --batch-size=128 --benchmark-only echo “inceptionv3_int8”; python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_int8_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision int8 --mode inference --batch-size 1 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_int8_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision int8 --mode inference --batch-size 16 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_int8_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision int8 --mode inference --batch-size 32 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_int8_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision int8 --mode inference --batch-size 64 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_int8_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision int8 --mode inference --batch-size 128 --benchmark-only echo “inceptionv3_fp32”; python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_fp32_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision fp32 --mode inference --batch-size 1 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_fp32_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision fp32 --mode inference --batch-size 16 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_fp32_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision fp32 --mode inference --batch-size 32 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_fp32_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision fp32 --mode inference --batch-size 64 --benchmark-only python /tf/intel-models/benchmarks/launch_benchmark.py --in-graph /tf/inceptionv3_fp32_pretrained_model.pb --model-name inceptionv3 --framework tensorflow --precision fp32 --mode inference --batch-size 128 --benchmark-only

When you have prepared the tf-benchmark Helm Chart, you can run it on the Kubernetes cluster by running:

helm install --namespace default --name tf-benchmark tf-benchmark/

After this command, the Kubernetes job is spawned on the cluster. To obtain the benchmark output, run (adjust the name of your pod in the command):

kubectl logs tf-benchmark-kzcnc -f

Page 24: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 24

Running the OpenVINO BenchmarkYou first need to use SSH to connect to a machine of Ops Manager for PCF. Then you must install Helm according to the available documentation.

Build the Docker image based on this file:

opemvino.Dockerfile:

FROM centos:7.6.1810 RUN yum install -y epel-releaseRUN yum install -y ansible git git-lfs wget which jq cmake3 gcc-c++ sudo python36 python36-devel python36-setuptools && \ easy_install-3.6 pipRUN git clone --depth=1 -b 2019_R1.1 https://github.com/opencv/dldt.git /root/dldt \ && cd /root/dldt && git submodule init && git submodule update --recursive \ && /root/dldt/inference-engine/install_dependencies.sh -y \ && wget -q -P /root/dldt https://github.com/intel/mkl-dnn/releases/download/v0.18/mklml_lnx_2019.0.3.20190220.tgz && \ tar -xzvf /root/dldt/mklml_lnx_2019.0.3.20190220.tgzRUN mkdir -p /root/dldt/inference-engine/build && cd /root/dldt/inference-engine/build && \ cmake3 -DGEMM=MKL -DMKLROOT=/root/dldt/mklml_lnx_2019.0.3.20190220 -DENABLE_MKL_DNN=ON -DTHREADING=OMP -DCMAKE_BUILD_TYPE=Release .. \ && make --jobs=$(nproc --all)RUN git clone --depth=1 https://github.com/opencv/open_model_zoo.git /root/open_model_zooRUN pip3 install -r /root/open_model_zoo/tools/downloader/requirements.in && \ pip3 install -r /root/dldt/model-optimizer/requirements.txtRUN mkdir -p /root/models && cd /root/models && \ /root/open_model_zoo/tools/downloader/downloader.py --name resnet-50 && \ /root/open_model_zoo/tools/downloader/converter.py --name resnet-50 --mo /root/dldt/model-optimizer/mo.py && \ wget -q -P /root https://software.intel.com/sites/default/files/managed/f5/63/president_reagan-62x62.pngWORKDIR /rootCMD [“/bin/bash”]

Upload the built Docker image to the Docker Registry.

When Helm is ready, run the helm create openvino-benchmark command to create a Helm Chart for deployment of the TensorFlow benchmark.

Inside the openvino-benchmark directory, edit the following files and copy/paste the content:

values.yaml:

image: repository: <FILL_WITH_LOCATION_TO_BUILT_IMAGE> tag: latest

Chart.yaml:

apiVersion: v1appVersion: “1.0”description: A Helm chart for Kubernetesname: openvino-benchmarkversion: 0.1.0

Page 25: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 25

templates/job.yaml (you need to adjust CPU values to the maximum in your environment):

apiVersion: batch/v1kind: Jobmetadata: labels: role: openvino-benchmark name: openvino-benchmarkspec: template: metadata: labels: role: openvino-benchmark name: openvino-benchmark spec: containers: - name: openvino-benchmark image: {{ .Values.image.repository }}:{{ .Values.image.tag }} imagePullPolicy: Always resources: requests: cpu: “16” memory: “8Gi” limits: cpu: “16” memory: “8Gi” command: - “bash” args: - “/usr/local/bin/job.sh” volumeMounts: - name: usr-local-bin mountPath: “/usr/local/bin/job.sh” subPath: job.sh restartPolicy: Never volumes: - name: usr-local-bin configMap: name: openvino-benchmark-job

templates/configmap.yaml:

apiVersion: v1kind: ConfigMapmetadata: name: openvino-benchmark-jobdata: job.sh: | #!/bin/bash echo “batch_size, nireq, fps, latency” for batch_size in 1 16 32 64 128 do for nireq in 1 2 4 8 16 do benchmark_app_output=”$(/root/dldt/inference-engine/bin/intel64/Release/benchmark_app -stream_output -api async -d CPU -m /root/models/classification/resnet/v1/50/caffe/FP32/resnet-50.xml -l /root/dldt/inference-engine/bin/intel64/Release/lib/libcpu_extension.so -i /root/president_reagan-62x62.png -b ${batch_size} -nireq ${nireq} -niter 100)” latency=”$(echo $benchmark_app_output | sed ‘s/.*Latency: *\([0-9]*\.[0-9]*\) *ms.*/\1/’)” if [[ -z “${latency}” ]]; then latency=’-1’ fi fps=”$(echo $benchmark_app_output | sed ‘s/.*Throughput: *\([0-9]*\.[0-9]*\) *FPS.*/\1/’)” if [[ -z “${fps}” ]]; then fps=’-1’ fi echo “${batch_size}, ${nireq}, ${fps}, ${latency}” done done

When you have prepared the tf-benchmark Helm Chart, you can run it on the Kubernetes cluster by running:

helm install --namespace default --name openvino-benchmark openvino-benchmark/

After this command, the Kubernetes job is spawned on the cluster. To get the benchmark output, run (adjust the name of your pod in the command):

kubectl logs openvino-benchmark-kzcnc -f

Page 26: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 26

Appendix C – Deploying a Kubernetes Cluster Using VMware Enterprise PKSYou complete several steps to create a working instance of a Kubernetes cluster provisioned by VMware Enterprise PKS. You will need to have an active Plan in VMware Enterprise PKS, then you will use that plan to provision the Kubernetes cluster using the VMware Enterprise PKS CLI. Finally, you will set up your credentials for the kubectl tool to manage your cluster.

Creating a Kubernetes PlanWhen VMware Enterprise PKS is deployed, it comes with a default PKS Plan 1, which is a default template for a simple Kubernetes cluster creation. If you have specific requirements for the amount of resources your Kubernetes cluster should have, then you will create your own plan:

1. Log in to PCF Ops Manager ➔ VMware Enterprise PKS.2. From the left menu panel, explore the Plans 1 through 10 and find the first one that is inactive.

3. Pick the ‘Active’ option; a new set of options will appear. You will need to provide several parameters; the mandatory fields are as follows:a. Name of the new Plan – this will be used by the VMware Enterprise PKS CLI when creating new clusterb. Description c. Number and size of the Kubernetes master and etcd instances d. Size of the persistent storage disk for master e. Availability zone for the master and etcd f. Maximum number of workers for the cluster g. Number and size of Kubernetes nodes h. Size of the persistent storage disk for nodes i. Availability zone for the nodes

4. Click Save.

5. Go back to the Ops Manager main page and click the Review Pending Changes on the right.

6. Click Apply Changes next to the VMware Enterprise PKS to deploy the new settings. The process may take up to 15 minutes.

7. Once the changes are completed you can use your new plan with the CLI. The new plan should be visible when executing the pks plans command.

Provisioning a Kubernetes ClusterThe PKS CLI is used for creating the cluster. Follow these steps to create a cluster:

1. Obtain the user account and authentication (UAA) Admin Password from the PCF Ops Manager for the given workload domain:a. Go to the PCF Ops Manager Web interface.b. From the Installation Dashboard, go to Enterprise PKS ➔ Credentials ➔ Uaa Admin Password.c. You will see something similar to this:

{“credential”:{“type”:”secret”,”value”:{“secret”:”hBf-aZ5p7U2JPQhhXYmP2cJdf1LMK6GO”}}}

d. Copy the secret hash.

2. Use SSH to log in as an Ubuntu user in the Ops Manger machine where the PKS and kubectl CLIs are installed.

ssh [email protected]

3. Log in to PKS using the secret hash from step 1:

pks login -a pks-w2.vdap.local -u admin -p <secret hash for Uaa Admin Password”> -k

4. You can now use the PKS CLI to manage your deployments.

5. Create a new cluster using the predefined plan:

pks create-cluster <cluster name> --external-hostname <cluster name> --plan <name of the plan> --num-nodes <number of worker nodes>

For example:

pks create-cluster MLtest --external-hostname MLtest --plan small --num-nodes 20

6. You can monitor the process of provisioning the cluster with:

pks cluster <cluster name>

Page 27: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 27

7. Once the cluster is completed, the PKS cluster should show output similar to the following:

ubuntu@pcf-manager02-vdap-local:~$ pks cluster P1Name: P1Plan Name: testprivUUID: 12e5249d-2f26-4774-b29a-00a74e8420ebLast Action: CREATELast Action State: succeededLast Action Description: Instance provisioning completedKubernetes Master Host: P1Kubernetes Master Port: 8443Worker Nodes: 4Kubernetes Master IP(s): 10.116.0.10Network Profile Name:

Optional: BOSH Tool UsageAt some point you may need to debug the environment using the BOSH tool. Enable the BOSH tool by following these steps:

1. Go to the PCF Ops Manager Web interface. Go to Installation Dashboard ➔ BOSH Director for vSphere ➔ Credentials ➔ Bosh Commandline Credentials. Note the credentials. You should see something similar to this:

{“credential”:”BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=UYeqHs_Vn6zFYJUc2FUgQKnzupdgmtpO BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=10.113.0.2 bosh “}

2. Set all the listed global variables in your local session to use the BOSH command:

export BOSH_CLIENT=ops_manager; export BOSH_CLIENT_SECRET=UYeqHs_Vn6zFYJUc2FUgQKnzupdgmtpO; export BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate; export BOSH_ENVIRONMENT=10.113.0.2

Accessing a Kubernetes Cluster with the kubectl CLIYou will need a kube configuration file before you can use the kubectl CLI. To obtain this file, follow these steps:

1. Generate the file using PKS:

KUBECONFIG=<path to the kube config file> pks get-credentials <pks cluster name>

For example:

KUBECONFIG=~/Plus20.kubeconfig pks get-credentials Plus20

2. Edit the newly created kube configuration file and replace the cluster name with the Kubernetes Master Node IP in the server: https:// section. You can do it manually or use sed:

sed -i s/<cluster name>:8443/`pks cluster <cluster name> | grep “Kubernetes Master IP” | awk ‘{print $4}’̀ :8443/ <name of the kube config file>

For example:

sed -i s/Plus20n:8443/`pks cluster Plus20n | grep “Kubernetes Master IP” | awk ‘{print $4}’̀ :8443/ Plus20n.kubeconfig

After the modification, the kube configuration file in the “clusters: cluster: server:” field should contain an IP address and NOT the fully qualified domain name (FQDN) address, followed by the :8443 port number:

K3Q0Nm9TSmpqSkpHM1BuCkZTbTJDZG1uOHczYWtvK0ppR3RhMkMrTkJTendTT0xNUmd0NVplaEx3Mi9xZlY3eFpkWEt2L2N2djh0QlRONzIKVW1MQUZ2RU1 FTzE3VW9KY3BBN0JmSi9yNklGTW4rdm42ZnhOc2pZNDROUmFQQUhTWHJFdkVtbUwxaU5sRThyZgpUaDIxVGpSanVPcVl4YllXUjRDRXhZaytDSHA4TElmQz BvQXR6alUzSlRGbDRYcEtqV1hkODZrQjVydS9vZ0U9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K server: https://10.116.0.2:8443 name: P1contexts:- context: cluster: Plus20 user: d0c3e8d4-0444-48a7-bb6c-67050e8c7387 name: P1

3. Export the modified kube config file to a global variable, as in the following example:

export KUBECONFIG=~/Plus20.kubeconfig

4. You can now use kubectl to access the Kubernetes cluster. Switch the KUBECONFIG variable to another file to access other clusters. You may also pull the cluster credentials to a single configuration file using the command pks get-credentials <cluster name> and then just use kubectl config use-context <cluster name> to switch between multiple clusters.

Page 28: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 28

Upgrading VM Compatibility to Enable Intel AVX‑512In the current version of the software, VMs for a Kubernetes cluster are created with a compatibility mode that doesn’t allow for utilization of newer CPU capabilities like Intel AVX-512. In order to use these features, the following procedure is required:

1. Connect to the Ops Manager machine (using SSH for PCF).

2. Use the bosh vms command to check your PKS cluster deployment ID and the names of VMs. Get the deployment ID for your PKS cluster, then stop the worker nodes with the following command:

bosh -d <DEPLOYMENT_ID> stop worker

3. Stop the master node:

bosh -d <DEPLOYMENT_ID> stop master

4. Go to VMware vSphere and find the VMs that are correlated with your deployment ID, based on the output of the previously issued bosh vms command.

5. Highlight all the VMs in vSphere and right-click on the highlighted VMs. Choose Compatibility ➔ Schedule VM Compatibility Upgrade … then choose Compatible with: ESXi 6.7 Update 2 and later from the dropdown menu and click OK. After this you need to restart the VMs, so right-click on the highlighted VMs and choose Power ➔ Reset. Wait a few moments while all VMs reboot correctly.

6. Next, you need to start the master node with the command:

bosh -d <DEPLOYMENT_ID> start master

and worker nodes with:

bosh -d <DEPLOYMENT_ID> start worker

After completing these steps, VM compatibility will have been successfully upgraded.

Page 29: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 29

Appendix D – Data Warehouse BenchmarkingFor data warehouse benchmarking, we used HammerDB 3.2 as the client and multiple instances of Microsoft SQL Server 2017 for the test system. Both sides ran on Windows Server 2019 Datacenter Edition with up-to-date patches. Detailed software versions are provided in Table D1; the Microsoft SQL Server configuration is described in Table D2; and the HammerDB option settings are provided in Table D3.

The HammerDB client VM had eight vCPUs and 10 GB of memory. There were separate clients for the Base and Plus configurations—each located within the same cluster as the system under test but separated from VMs where Microsoft SQL Server was deployed.

The HammerDB TPC-C test was performed in several iterations using 1000-, 2000-, 3000-, 4000-, and 5000-record data warehouses. Each test scenario used 1/10 the number of users when comparing to the number of data warehouses. For example, for a 4000-record data warehouses, 400 users were used. Each test had five minutes of warm-up followed by 15 minutes of stress test. Tests were repeated three times.

The best result achieved with the Base configuration was for the experiment using 2,000 warehouses. The best result achieved with the Plus configuration was for the experiment using 3,000 warehouses. The other experiments did not meet the SLA latency requirements from the CPU Time:Requests or CPU Time:Total perspective.

Table D1. Detailed Software Versions

COMPONENT CONFIGURATIONOS and Build Windows Server 2019 DataCenter – 17763.rs5_release.180914-1434Workload and Version Microsoft SQL Server 2017 14.0.3223.3Other Software HammerDB 3.2

Table D2. Microsoft SQL Server VM Configuration

VIRTUAL HARDWAREBASE CONFIGURATION 2 VMs EACH

PLUS CONFIGURATION 2 VMs EACH

vCPUs 20 24vRAM 180 GB 360 GBvNIC VMXNET3 VMXNET3Disk Controllers 3x VMware Paravirtualized SCSI – PVSCSI

– Database data disks – Tempdb – Transaction disks

3x VMware Paravirtualized SCSI – PVSCSI – Database data disks – Tempdb – Transaction disks

Database Data Disks 4x 250 GB, 4 files per disk 4x 250 GB, 4 files per diskDisks for tempdb Data 2x 32 GB, 4 files per disk 2x 32 GB, 4 files per diskTempdb Log Data Disk 1x 50 GB 1x 50 GBTransaction Log Data Disk 1x 100 GB 1x 100 GB

Table D3. Detailed HammerDB Option Settings

DRIVER SCRIPT SETTING VIRTUAL USER SETTINGTPC‑C Driver Script Timed Driver Script Virtual Users 1/10 the number of warehousesExit on SQL Server Error True User Delay 500Keying and Thinking Time False Repeat Delay 500Checkpoint when Complete True Iterations 1Minutes to Rampup Time 5 Show Output TrueMinutes to Test Duration 15 Log Output to Temp TrueUse All Warehouses False Use Unique Log Name TrueTime Profile False No Log Buffer False

Log Timestamps True

Test Procedure1. Restore the clean database.2. Check the SQL parameters.3. Restart Microsoft SQL Server.4. Prepare snapshot for the SQL

latency KPI.5. Configure HammerDB parameters.6. Start HammerDB.7. Get KPI metrics.8. Get HammerDB and latency logs.

Page 30: Easily Consumable Hybrid Cloud Data Analytics Solution ... · A hybrid cloud strategy for AI relies on the assumption that machine-learning workloads are portable; that is, they will

Reference Architecture | Easily Consumable Hybrid Cloud Data Analytics Solution 30

1 Forbes, January 2018, “83% of Enterprise Workloads Will Be In The Cloud By 2020.” forbes.com/sites/louiscolumbus/2018/01/07/83-of-enterprise-workloads-will-be-in-the-cloud-by-2020/#52cba6206261

2 Intel. “Product Brief: Intel Optane SSD DC P4800X Series.” intel.com/content/www/us/en/solid-state-drives/optane-ssd-dc-p4800x-brief.html Based on internal Intel testing.

3 Intel Ethernet 700 Series includes extensively tested network adapters, accessories (optics and cables), hardware, and software along with broad operating system support. A full list of the product portfolio’s solutions is available at intel.com/ethernet. Hardware and software is thoroughly validated across Intel Xeon Scalable processors and the networking ecosystem. The products are optimized for Intel architecture and a broad OS ecosystem: Windows, Linux kernel, FreeBSD, RHEL, SUSE, Ubuntu, Oracle Solaris, and VMware ESXi.

4 Intel Ethernet 700 Series network adapters are backed with global support infrastructure for customers pre- and post-sales.5 Intel Ethernet 700 Series network adapters’ supported connections and media types are direct-attach copper and fiber SR/LR

(QSFP+, SFP+, SFP28, XLPPI/CR4, 25G-CA/25G-SR/25G-LR), twisted-pair copper (1000BASE-T/10GBASE-T), and backplane (XLAUI/XAUI/SFI/KR/KR4/KX/SGMII). Note that Intel is the only vendor offering the QSFP+ media type.

6 Intel Ethernet 700 Series network adapters’ supported speeds include 1 GbE, 10 GbE, 25 GbE, and 40 GbE.7 Testing done by Intel. See Table 1 for complete Base configuration information. See Table A1 and Table A2 in Appendix A for software and

virtual hardware specifications. Performance results are based on testing as of August 20, 2019 and may not reflect all publicly available security updates.

8 Testing conducted on ISV software comparing Intel Xeon Platinum 8260 processor to 2S Intel Xeon Gold 6248 processor. Performance results are based on testing as of August 20, 2019 and may not reflect all publicly available security updates. See Appendix D – Data Warehouse Benchmarking for full benchmarking methodology details.

9 See endnote 7.10 See endnote 7.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.

No product or component can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance

tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit intel.com/benchmarks.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel is a sponsor and member of the Benchmark XPRT Development Community, and was the major developer of the XPRT family of benchmarks. Principled Technologies is the publisher of the XPRT family of benchmarks. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases.

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. © Intel Corporation 1019/RCNO/KC/PDF 340881-001US