Processing Large Datasets for ADAS applications using Spark

57
1 © 2020 The MathWorks, Inc. Processing Large Datasets for ADAS applications using Spark Arvind Hosagrahara ([email protected]) Lead Solutions Architect May 2021

Transcript of Processing Large Datasets for ADAS applications using Spark

1© 2020 The MathWorks, Inc.

Processing Large Datasets for ADAS applications using Spark

Arvind Hosagrahara ([email protected])Lead Solutions ArchitectMay 2021

2

Outline

§ The Big Picture– Problem Statement / Challenges

§ Architecture of the system (ETL, Training, Inference)§ Deep-Dive into Spark Workflows

– Cluster Management– Interactive exploration – Pushdown of analytics to Spark Clusters– Experiment and metadata management

§ Design considerations– Scaling, Accuracy and Interoperability– Security, Governance, Model Lifecycle management

§ Conclusions / Key Takeaways§ Q&A

3

The Big Picture – Automated Driving

How can I analyze & simulate

scenarios?

How can I design & deploy

algorithms?

How can I integrate & test

systems?

Control

Planning

Perception

4

Semantic Segmentation

What?§ Semantic segmentation is the classification

of every pixel in an image/video

Why?§ Semantic segmentation: Higher quality

perception outcomes, easier to analyze and validation of localization modules for AD.

§ Automation: Faster design iterations

5

“…autonomous vehicles would have to be driven hundreds of millions of miles and sometimes hundreds of billions of miles to demonstrate their reliability in terms of fatalities and injuries.” [1]

[1] Driving to Safety, Rand Corporation https://www.rand.org/content/dam/rand/pubs/research_reports/RR1400/RR1478/RAND_RR1478.pdf

6

Problem Statement

How can we:1. Accelerate development of semantic segmentation models from prototype to production in order to build better perception software

2. Scale the workflows, capabilities and throughput of engineering processes by leveraging automation, cloud storage, compute and other infrastructure

7

Challenges

Business Challenges§ Prototype to production: establish system & software engineering processes§ Verification and validation: connect data, embedded software and simulation§ Workforce development: hire & integrate engineers with disparate

background

Technical Challenges§ Handling large datasets and specialized formats§ Leveraging the cloud to scale computational requirements§ Enable self-serve analytics for the data scientist / engineer

8

Demonstration / Results(Image-level features with batch normalization)

9

10

Demo / Results3-D simulation data to train a semantic segmentation

network and fine-tune it to real-world data using generative adversarial networks (GANs)

11

12

Big Picture: Architecture

13

Architecture of the demo system

Burst

Data access

Experiment M

anagement / G

overnance / Security

14

Burst

Data access

Experiment M

anagement / G

overnance / SecurityData access

Experiment M

anagement / G

overnance / Security

Development Environment

(Desktop / Cloud)

Storage(Large Datasets /

Models / Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Monitoring / VisualizationFront-ends

Cloud-based training system

(Spark Clusters / GPU support / Fast networks)

Recorded and Simulated Data

(Batch and Streaming

Ingest)

Burst

Ingest

15

Data access

Experiment M

anagement / G

overnance / Security

Development Environment

(Desktop / Cloud)

Storage(Large Datasets /

Models / Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Monitoring / VisualizationFront-ends

Cloud-based training system

(Spark Clusters / GPU support / Fast networks)

Recorded and Simulated Data

(Batch and Streaming

Ingest)

Burst

Ingest

16

A simplified view of the system architecture

Experiment M

anagement / G

overnance / Security

Development Environment(Desktop / Cloud)

Storage(Large Datasets / Models /

Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Monitoring / VisualizationFront-ends

Cloud-based training system(Spark Clusters / GPU support /

Fast networks)

Recorded and Simulated Data

(Batch and Streaming Ingest)

Burst

Ingest

Data access

17

The ETL (Extract Transform Load) Pipeline

§ The ETL (Extract Transform Load) Pipeline

Experiment M

anagement / G

overnance / Security

Storage(Large Datasets / Models /

Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Recorded and Simulated Data

(Batch and Streaming Ingest)

Ingest

§

§

18

The Training Pipeline

§

Experiment M

anagement / G

overnance / Security

Development Environment(Desktop / Cloud)

Storage(Large Datasets / Models /

Metadata)

Cloud-based training system(Spark Clusters / GPU support /

Fast networks)Burst

Ingest

Data access § The Training Pipeline

§

19

The Inference Pipeline

§

Experiment M

anagement / G

overnance / Security

Storage(Large Datasets / Models /

Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Monitoring / VisualizationFront-ends

Burst

Ingest

Data access §

§ The Inference Pipeline

20

The Software Stack

Experiment M

anagement / G

overnance / Security

Development Environment(Desktop / Cloud)

Storage(Large Datasets / Models /

Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Monitoring / VisualizationFront-ends

Cloud-based training system(Spark Clusters / GPU support /

Fast networks)

Recorded and Simulated Data

(Batch and Streaming Ingest)

Burst

Ingest

Data access

21

The Software Stack

Experiment M

anagement / G

overnance / Security

Development Environment(Desktop / Cloud)

Storage(Large Datasets / Models /

Metadata)

Model Inference Engine for ETL and Performance

Monitoring

Monitoring / VisualizationFront-ends

Cloud-based training system(Spark Clusters / GPU support /

Fast networks)

Recorded and Simulated Data

(Batch and Streaming Ingest)

Burst

Ingest

Data access

MATLAB® and Simulink®

MATLAB Production

Server™

MATLAB Webapp Server™3P Dashboards

Amazon S3™

AWS Identity and Access

Management (IAM)

AWS™

MLflow™

22

The DevOps perspective

23

Workflows supporting DevOps

Development Environment Operational Environment

TEST

OPER

ATE

DEPLOY

MONITOR

DESIGN

BUILD

PLAN

FEEDBACK

Cluster ManagementJDBC based

connectivity to data sources

Databricks-connect and interactive exploration

CI/CD integration

Jobs and Spark-Submit

Model Lifecycle Management

Experiment Management

Coder based acceleration

Embedded Applications

Inference engine

Central result, config, data

management

24

Deep-Dive: Spark Workflows

25

Workflow: Cluster management and Data Engineering via REST

Engineers

Developers

Scientists

Data Analyst

Compute Cluster(Cloud/On-Prem)

jobsubmission.mMATLAB

HTTPClient

MATLAB Integration (REST 2.0 API)

• Create Clusters• CRUD with DBFS• Create Jobs• Authentication / Tokens• Secrets• Notifications• Libraries• Workspace

Databricks specific runtime image

MATLAB Runtime

init_script Dockerfile

dockerized runtime

analytics.jar

26

Video 1 Outline

§ Planning capacity and performance§ Cluster creation§ Data and application transfer via the DBFS API

27

Workflow: JDBC and Database toolbox (a)

Engineers

Developers

Scientists

Data Analyst

Existing Compute Cluster(Cloud/On-Prem)

Database ToolboxODBC 2.6.x JDBC 2.6.x

Driver

SQL

28

Workflow: JDBC and Database toolbox (b)

Engineers

Developers

Scientists

Data Analyst

Existing Compute Cluster(Cloud/On-Prem)

ODBC 2.6.x JDBC 2.6.x

Driver

SparkSession

SQL

Databases / Delta Tables

29

Workflow: Databricks-connect

Engineers

Developers

Scientists

Data Analyst

Existing Compute Cluster(Cloud/On-Prem)

Databricks-connect

Spark DAG

Spark Master

Spark API

30

Video 2 Outline

§ Databricks-Connect workflows§ Read/Write delta tables

31

The Training Pipeline

32

The Training Pipeline (Local Experimentation)

§ Optimized for GPU§ High-level API (“write less and do

more”)§ Language abstractions to improve

accuracy (eg: imageDataAugmenter)§ Workflows for ease-of-use and

debug– apps for model selection– monitor accuracy, loss and training

performance– many quick-start examples

33

Workflow: Push-down via MATLAB Compiler

Engineers

Developers

Scientists

Data Analyst

Compute Cluster(Cloud/On-Prem)

spark-submit job for New Clusters

analytics.jar

Spark DAG

Spark Master

analytics.mMATLAB

Compiler/SDK

MATLAB API for Spark Tall Arrays

Libraries for Existing Clusters

34

Video 3 Outline

§ SparkSession and RDD abstractions§ Tall-based workflows

§ Deploytool and app based workflows– Compiler SDK for Java

§ mcc – Spark Submit

§ DBFS API to push application to Databricks

35

Build and Test

§ Improve quality of code and models by testing on CI Servers (self-hosted or Cloud-hosted)

§ Package libraries for deployment

§ Adhere to agile development practices

https://www.mathworks.com/solutions/continuous-integration.html

36

The Training Pipeline (Cloud-based training)

§ Infrastructure defined as code§ Leverage the latest hardware,

optimized drivers and toolkits§ APIs for command and control of

Jobs, artifacts, notifications, execution, etc.

§ Capital Expenses (CapEx) èOperating Expenses (OpEx)

37

Management of Experiments and tracking of performance

38

Video 4 Outline

§ MLFlow Demo– Creation of Experiments / Runs– Logging of Parameters and Metrics– Integration with Experiment Management features

39

Training Benchmarks and Performance

VGG-16 based network on NVIDIA Quadro GV100

40

The Inference Pipeline

41

The Inference Pipeline

§ Directed Acyclic Graphs (DAG) used to resolve changes to:– Training models, ETL changes– The training data– New data / changes to the cache

§ Trained Models can be deployed as cloud services

§ Cloud Reference Architectures available on github.com to integrate with existing systems

42

Scaling Inference on the MATLAB Production Server

§ Scales gracefully § Optimal usage of

available CPU and GPU resources

§ Can increase throughput as desired

§ PAYG models available on the cloud

43

The ETL (Extract-Transform-Load) Pipeline

44

The ETL Pipeline

§ Wide variety of Automotive and Engineering Formats (eg: ROSBag, MP4, MDF, BLF, PCAP)

§ RESTful endpoints or Client based access§ Connectivity to Cloud Storage (S3) and distributed

streaming systems (Kafka)§ Horizontally scalable and designed to run

24x7x365 on headless cloud systems§ Provides data proximity when executing on the

cloud§ Accelerated CUDA execution on NVIDIA® via GPU

Coder™

45

Putting it all together

46

Build Graph (DAG)

§ Engine code§ Trained models

§ New/Cached Data§ Operates on S3 version metadata

§ Can be triggered manually or automatically (eg: via CI/CD pipelines)

47

Design Considerations

48

Scaling considerations

Storage§ Cloud based storage allows

ingestion of larger datasets

Compute§ Quicker training iterations both

locally and on the cloud§ Faster inference pipeline

Cost§ CapEx è OpEx

Workflows§ Self-serve analytics optimized for

agility along a maturity framework

"Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… … engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly [1].

[1] The Hidden Technical Debt of Machine Learning Systemshttps://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

49

Accuracy considerations

§ More Training Data§ Simulation / Scene generation

– Roadrunner

Simulation

50

Interop considerations

§ ONNX interop § Transfer learning techniques§ Import/Export to

– OpenDRIVE®, – FBX®, – glTF™,– OpenFlight,– OpenSceneGraph, – OBJ, and USD formats.

§ Automated driving simulators and game engines (eg: CARLA, VIRES VTD, NVIDIA DRIVE Sim®, Metamoto®, LGSVL, Baidu Apollo®, Unity®, and Unreal Engine®)

51

Embedded Targeting

§ GPU Coder™ Support Package for NVIDIA GPUs

§ NVIDIA® DRIVE™ and Jetson hardware platforms

52

Governance and Lifecycle management

§ Local Experimentation§ Centralized Tracking

53

Security Considerations

§ Cloud Storage Best Practices§ ACL based permissions§ IAM based identity and access

54

Conclusions

§ Well-architected systems [1] accelerate development of automated semantic segmentation against large datasets

§ Local development provides smooth workflows for the development and refinement of deep learning models

§ Cloud-based scaling of compute and storage can be leveraged on Databricks to enable self-service analytics

§ Techniques that are relevant for ADAS development are equally applicable across other domains such as medical, geo-exploration, etc.

[1] AWS Well-Architectedhttps://aws.amazon.com/architecture/well-architected/

55

Key Takeaways

§ Build upon proven, safety compliant, and DevOps-ready development tools

§ Leveraging best-in-class simulation integration platform and modeling tools leads to faster time-to-market outcomes for ADAS development

§ Domain specific tooling allows practitioners to write less and do more

§ If this is interesting, please contact us at [email protected] for more details or to get started.

58